Authors Top

If you have a few years of experience in Computer Science or research, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.


1. Introduction

In this tutorial, we describe how to use the silhouette plot in cluster analysis.

Clustering is one of the unsupervised learning methods. First, we explain what silhouette values measure and how to calculate and interpret them. Then, we show how to determine the number of clusters using the mean silhouette value.

2. Silhouette Plots in Cluster Analysis

A silhouette plot is a graphical tool depicting how well our data points fit into the clusters they’ve been assigned to. We call it the quality of fit cohesion.

At the same time, a silhouette plot shows the quality of separation: this metric conveys the degree to which the points that don’t belong to the same cluster have been assigned to different ones.

To analyze clusters, we need to consider both criteria, which silhouette plots allow us to do.

3. Silhouette Values

A silhouette value is a combination of two scores: cohesion and separation.

3.1. Cohesion

Cohesion measures the similarity of the points in the same cluster. So, we can call it an intra-cluster metric.

Let C be a cluster and x_i, x_j \in C two points in it. Then, we can interpret the distance between them as a measure of their similarity. From there, we define the cohesion of point x_i in its cluster C as the mean distance between x_i and the other points x_j in C:

    \[a_i = \mathrm{mean}_{x_j \in C}(distance(x_i, x_j))\]

3.2. Separation

On the other hand, separation refers to the degree to which the clusters don’t overlap. So, it’s an inter-cluster metric.

Intuitively, the distance between the clusters speaks about the “goodness of their separation”. So, we define the separation of x_i \in C_1 as the minimum mean distance between x_i and other clusters:

    \[b_i = \min_{C_2 \neq C_1}(mean_{x_j \in C_2}(distance(x_i, x_j)))\]

3.3. Combining Cohesion and Separation into a Silhouette Value

Then, the silhouette value of a point x is:

    \[s_i = \frac{b_i - a_i}{\max(a_i, b_i)}\]

Its range is [-1, 1]. The higher the silhouette value, the more certain we can be that its label is correct. So, a high mean silhouette value of all the points indicates a good clustering.

3.4. How to Calculate the Silhouette Value?

Let’s take a look at clusters A, B, and C in the following image:

Example with three clusters

Let’s compute the silhouette value of the point (9.5, 11) \in A. To do so, we need to calculate its cohesion and separation scores.

The cohesion of point (9.5, 11) \in A is its mean distance to the other points in A. The distance between the point (9.5, 11) and point (11, 14.5) is:

    \[\sqrt{(9.5-11)^2 + (15.5-14.5)^2} = \sqrt{2.25 + 1} = \sqrt{3.25} \approx 1.8\]

and the distance to (11.5, 16.5) is:

    \[\sqrt{(9.5-11.5)^2 + (15.5-16.5)^2} = \sqrt{4.0 + 1.0} = \sqrt{5.0} \approx 2.2\]

So, the cohesion of point (9.5, 11) \in A is:

    \[cohesion_{(9.5, 11)} = \frac{1.8 + 2.2}{2} = 2\]

We can compute the mean distance of the point (9.5, 11) to cluster B in the same way:

    \[\begin{aligned} dist_{( (9.5, 15.5), (7.5, 4.3) )} &= \sqrt{(9.5-7.5)^2 + (15.5-4.3)^2} = \sqrt{4.0 + 125.0} = \sqrt{129.4} \approx 11.4 \\ dist_{( (9.5, 15.5), (3.7, 7) )} &= \sqrt{(9.5-3.7)^2 + (15.5-7)^2} = \sqrt{33.64 + 72.25} = \sqrt{105.89} \approx 10.3 \\ dist_{( (9.5, 15.5), (5.4, 4.7) )} &= \sqrt{(9.5-5.4)^2 + (15.5-4.7)^2} = \sqrt{16.8 + 116.6} = \sqrt{133.1} \approx 11.5 \\ mean\_distance &= \frac{11.4 + 10.3 + 11.5}{3} \approx 11 \end{aligned}\]

Following the same steps, we get compute its mean distance to cluster C:

    \[separation_{(9.5, 11), C} = \frac{8.9 + 11.5 + 11.6 + 11.5}{4} \approx 10.9\]

Therefore, the separation score of (9.5, 11) is 10.9, since that’s the lower of the two values.

From there, we get the silhouette value of (9.5, 11):

    \[s_i = \frac{b_i - a_i}{\max(a_i, b_i)} = \frac{10.9 - 2}{max(10.9, 2)} = \frac{8.9}{10.9} \approx 0.8\]

Since 0.8 is close to the theoretically maximal silhouette value, we can be confident that we assigned (9.5, 11) to the cluster it actually belongs to.

3.5. Analyzing the Silhouette Values

Let’s say that we have two clusters: A and B, such that B is the cluster closest to A. We’re calculating the silhouette values of the points in A. The silhouette value s_i of any point in A will be closer to 1 when b_i >> a_i. In that case, the distance between x_i and the other points in A is much smaller than the distance to points in B. This indicates that x_i belongs to cluster A:

s(i) approximately 1

Therefore, as the average of the silhouette values of the points in a cluster gets closer to 1, the cohesion and separation of the cluster as a whole increase.

The converse is also true. When the silhouette value is closer to zero, both a_i and b_i are similar. Then, the distances between x_i and the elements on A and B are similar. Thus, it is not clear if x_i should belong to A or B:

s(i) approximately 0

The worst case occurs when the silhouette value s_i is negative, which happens when a_i is greater than b_i. Then, the point in question lies closer to B than to A. So, it looks misclassified:

s(i) aproxximately -1

Hence, as the \boldsymbol{s_i} values decrease, the quality of separation deteriorates, and the quality of the clustering gets worse.

4. Silhouette Plots

The silhouette of a cluster visualizes the silhouette values s_i of all the points in it in the decreasing order. A silhouette plot shows the silhouettes of all the clusters in random order. Additionally, it inserts blank spaces between consecutive clusters and can color them differently.

For example, here’s a plot for four clusters we got with the K-Means clustering algorithm on an ad-hoc two-dimensional dataset:

Average silhouette score 0.71

Here, we set k=2. On the left, we have the silhouette plot. The x-axis shows the silhouette values, and the height of each silhouette indicates the number of points in the corresponding cluster. The right subplot visualizes the data points with the same colors as the clusters. The red line shows the average silhouette value for all the clusters. In this example, the average value is 0.71.

From the right subplot, we conclude that the cohesion of the green points is higher than that of the black ones. This should explain the worse silhouette of the black cluster on the left. However, the two clusters look well separated.

5. Choosing the Number of Clusters

By plotting the silhouettes for different values of \boldsymbol{k}, we can see which \boldsymbol{k} best fits the data.

For instance, the following graphic shows the K-Means results for the above data and k = 3. The average silhouette value increases to 0.78:

3 clusters silhouette_score 0.78

Here, we see that the right cluster remains intact, while the left one splits into two smaller ones. These two clusters have better silhouette values than the ones obtained by the blue cluster, but all three appear to be well defined.

What happens if we use k=4? The average silhouette value drops slightly to 0.74:

4 clusters silhouette_score 0.74

Here, we see that the green and black clusters have better silhouettes than the other two. It’s probably because their points are better separated than the points of the other two clusters.

Finally, the average silhouette values for k=5 drops to 0.66 and 0.53:

5 clusters silhouette_score 0.66

and for k=6 to 0.53:

6 clusters silhouette_score 0.53

So, k=2,3,4 appear to be good choices, whereas k=5 and k=6 give lower-quality clusters.

5.1. Interpreting the Mean of Silhouette Values

Silhouette values measure the relation between cluster cohesion and cluster separation. Thus, the mean of the silhouette values represents the balance of the overall cohesion and separation in all the clusters.

As we concluded previously, the cohesion and separation of clusters are better when the silhouette values are close to 1. Therefore, we’re looking for the clustering with a higher mean silhouette value, ideally close to 1.

One way to select the number of clusters could be to choose the one with the higher overall mean. Kaufmann and Rousseeuw (1990) named the overall mean the silhouette coefficient (SC). By their classification, if \boldsymbol{SC} > 0.70, the structure of the clusters is strong. If \boldsymbol{SC} is between 0.51 and 0.70 the structure is reasonable. Lower values indicate poor structure.

Here are the mean values for k=2,3,4,5,6 we got in the above example:

Rendered by QuickLaTeX.com

Considering the SC criterion, acceptable solutions are those with 2, 3, and 4 clusters. In this particular simulation, we used the data from four blobs, so the best solution should be the one with 4 clusters. But, we can see that the solution with 3 clusters looks even better.

6. Conclusion

In this tutorial, we talked about silhouette plots and values. A silhouette plot is a graphical tool we use to evaluate the quality of clusters. The silhouette values show the degree of cohesion and separation of the clusters. The mean of the silhouette values allows identifying how many clusters appear in the dataset.

Authors Bottom

If you have a few years of experience in Computer Science or research, and you’re interested in sharing that experience with the community, have a look at our Contribution Guidelines.

Comments are closed on this article!