Stop Using Elbow Method in K-Means Clustering

Learn how to find the number of clusters in K-means clustering without using the elbow method.

Written by Anmol Tomar
Published on Aug. 02, 2023
elbow bumps
Image: Shutterstock / Built In
Brand Studio Logo

The elbow method is a graphical representation of finding the optimal ‘K’ in a k-means clustering. This is typically done by picking out the k-value where the elbow is created. However, this is not the best way to find the optimal ‘K’.   

Elbow Method Definition

The elbow method is a graphical method for finding the optimal K value in a k-means clustering algorithm. The elbow graph shows the within-cluster-sum-of-square (WCSS) values on the y-axis corresponding to the different values of K (on the x-axis). The optimal K value is the point at which the graph forms an elbow.  

In this blog, we will look at the most practical way of finding the number of clusters (or K) for your k-means clustering algorithm and why the elbow method isn’t the answer.

Following are the topics that we will cover in this blog:

  1. What is K-means clustering?
  2. What is the elbow method?
  3. What are the drawbacks of the elbow method?
  4. Why the Silhouette Method is better than the elbow method.
  5. How to do the elbow method in Python.
  6. How to do the Silhouette method in Python.

Let’s get started.

 

What Is K-means Clustering?

K-means clustering is a distance-based unsupervised clustering algorithm where data points that are close to each other are grouped in a given number of clusters/groups.

It’s one of the most used clustering algorithms in the field of data science. To successfully implement the k-means algorithm, we need to identify the number of clusters we want to create using the k-means.

The following are the steps followed by the k-means algorithm:

  1. Initialize K, i.e the number of clusters to be created.
  2. Randomly assign K centroid points.
  3. Assign each data point to its nearest centroid to create K clusters.
  4. Re-calculate the centroids using the newly created clusters.
  5. Repeat steps 3 and 4 until the centroid gets fixed.

More on Data ScienceC-Means Clustering Explained

 

What Is the Elbow Method?

As I mentioned, the elbow method involves finding the optimal k via a graphical representation. It works by finding the within-cluster sum of square (WCSS), i.e. the sum of the square distance between points in a cluster and the cluster centroid.

The elbow graph shows WCSS values on the y-axis corresponding to the different values of K on the x-axis. When we see an elbow shape in the graph, we pick the K-value where the elbow gets created. We can call this the elbow point. Beyond the elbow point, increasing the value of ‘K’ does not lead to a significant reduction in WCSS.

 

What Are the Drawbacks of the Elbow Method?

The elbow curve is expected to look like this:

expected elbow curve graph
Expected elbow curve. | Image: Anmol Tomar

But here’s what it typically looks like:

elbow curve with no clear elbow
Actual elbow curve with no clear elbow. | Image: Anmol Tomar

So, in the majority of the real-world data sets, there’s not a clear elbow inflection point to identify the right ‘K’ using the elbow method. This makes it easier to find the wrong K.

 

Why the Silhouette Method Is Better Than the Elbow Method

The Silhouette score is a very useful method to find the number of K when the elbow method doesn’t show the elbow point.

The value of the Silhouette score ranges from -1 to 1. Following is the interpretation of the Silhouette score.

  • 1: Points are perfectly assigned in a cluster and clusters are easily distinguishable.
  • 0: Clusters are overlapping.
  • -1: Points are wrongly assigned in a cluster.
silhouette score for two clusters
Silhouette scores for two clusters. | Image: Anmol Tomar

Silhouette Score = (b-a)/max(a,b)

Where: 

  • a = average intra-cluster distance, i.e the average distance between each point within a cluster.
  • b = average inter-cluster distance i.e the average distance between all clusters.

 

How to Use the Elbow Method in Python

Let’s compare the elbow method and the silhouette score using the Iris data set. We’ll start with creating an elbow curve in Python.

The elbow curve can be created using the following code:

#install yellowbrick to vizualize the Elbow curve
!pip install yellowbrick  

from sklearn import datasets
from sklearn.cluster import KMeans
from yellowbrick.cluster import KElbowVisualizer

# Load the IRIS dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Instantiate the clustering model and visualizer
km = KMeans(random_state=42)
visualizer = KElbowVisualizer(km, k=(2,10))
 
visualizer.fit(X)        # Fit the data to the visualizer
visualizer.show()        # Finalize and render the figure
distortion score for a plot
The elbow plot finds the elbow point at K=4. | Image: Anmol Tomar

The above graph selects an elbow point at K=4, but K=3 also looks like a plausible elbow point. So, it’s not clear what should be the elbow point.

 

How to Use the Silhouette Method in Python

Let’s validate the value of K using the Silhouette plot using the below code.

from sklearn import datasets
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
from yellowbrick.cluster import SilhouetteVisualizer

# Load the IRIS dataset
iris = datasets.load_iris()
X = iris.data
y = iris.target
  
fig, ax = plt.subplots(3, 2, figsize=(15,8))
for i in [2, 3, 4, 5]:
    '''
    Create KMeans instances for different number of clusters
    '''
    km = KMeans(n_clusters=i, init='k-means++', n_init=10, max_iter=100, random_state=42)
    q, mod = divmod(i, 2)
    '''
    Create SilhouetteVisualizer instance with KMeans instance
    Fit the visualizer
    '''
    visualizer = SilhouetteVisualizer(km, colors='yellowbrick', ax=ax[q-1][mod])
    visualizer.fit(X) 
Silhouette Plot for K = 2 to 5
Silhouette Plot for K = 2 to 5. | Image: Anmol Tomar

The silhouette score is maximum(0.68) for K=2, but that’s not sufficient to select the optimal K.

The following conditions should be checked to pick the right ‘K’ using the Silhouette plots:

  1. For a particular K, all the clusters should have a Silhouette score greater than the average score of the data set represented by the red-dotted line. The x-axis represents the Silhouette score. The clusters with K=4 and 5 get eliminated because they don’t follow this condition.
  2. There shouldn’t be wide fluctuations in the size of the clusters. The width of the clusters represents the number of data points. For K=2, the blue cluster has almost twice the width as compared to the green cluster. This blue cluster gets broken down into two sub-clusters for K=3, and thus forms clusters of uniform size.

So, the silhouette plot approach gives us K=3 as the optimal value.

We should select K=3 for the final clustering on the Iris data set.

import plotly.graph_objects as go  #for 3D plot

## K-means using k = 3
kmeans = KMeans(n_clusters=3)
kmeans.fit(X)
y_kmeans = kmeans.predict(X)

## 3D plot 
Scene = dict(xaxis = dict(title  = 'sepal_length -->'),yaxis = dict(title  = 'sepal_width--->'),zaxis = dict(title  = 'petal_length-->'))

labels = kmeans.labels_
trace = go.Scatter3d(x=X[:, 0], y=X[:, 1], z=X[:, 2], mode='markers',marker=dict(color = labels, size= 10, line=dict(color= 'black',width = 10)))
layout = go.Layout(margin=dict(l=0,r=0),scene = Scene,height = 800,width = 800)
data = [trace]
fig = go.Figure(data = data, layout = layout)
fig.show()
3D Plot of clusters
3D plot of clusters. | Image: Anmol Tomar

I also validated the output clusters by indexing/checking the distribution of the input features within the clusters.

A tutorial on how to find the optimal K with elbow and silhouette methods. | Video: MachineLearningInterview

More on Data ScienceA Comprehensive Guide to Scikit-Learn (Sklearn)

 

Elbow Method vs. Silhouette Method

Elbow curve and Silhouette plots both are very useful techniques for finding the optimal K for k-means clustering. In real-world data sets, you will find quite a lot of cases where the elbow curve is not sufficient to find the right ‘K’. In such cases, you should use the silhouette plot to figure out the optimal number of clusters for your dataset.

I would recommend using both the techniques together to figure out the optimal K for k-means clustering.        

Explore Job Matches.