Clustering is an unsupervised machine learning technique that groups data points based on the similarity between them. The data points are grouped by finding similar patterns/features such as shape, color, behavior, etc. of the data points.
4 Uses For Clustering
- Market Segmentation
- Anomaly Detection
- Image segmentation
- Geo-Spatial Analysis
For a day-to-day life example of clustering, consider a store such as Walmart, where similar items are grouped together.
There are different types of clustering algorithms, including. centroid-based clustering algorithms, connectivity-based clustering algorithms (hierarchical clustering), distribution-based clustering algorithms and density-based clustering algorithms.
In this article, we will discuss connectivity-based clustering algorithms, also called hierarchical clustering. It is based on the core idea that similar objects lie nearby to each other in a data space while others lie far away. It uses distance functions to find nearby data points and group the data points together as clusters.
There are two major types of approaches in hierarchical clustering:
- Agglomerative clustering: Divide the data points into different clusters and then aggregate them as the distance decreases.
- Divisive clustering: Combine all the data points as a single cluster and divide them as the distance between them increases.
Agglomerative clustering is a bottom-up approach. It starts clustering by treating the individual data points as a single cluster then it is merged continuously based on similarity until it forms one big cluster containing all objects. It is good at identifying small clusters.
The steps for agglomerative clustering are as follows:
- Compute the proximity matrix using a distance metric.
- Use a linkage function to group objects into a hierarchical cluster tree based on the computed distance matrix from the above step.
- Data points with close proximity are merged together to form a cluster.
- Repeat steps 2 and 3 until a single cluster remains.
The pictorial representation of the above steps would be:
In the above figure,
- The data points 1,2,...6 are assigned to each individual cluster.
- After calculating the proximity matrix, based on the similarity the points 2,3 and 4,5 are merged together to form clusters.
- Again, the proximity matrix is computed and clusters with points 4,5 and 6 are merged together.
- And again, the proximity matrix is computed, then the clusters with points 4,5,6 and 2,3 are merged together to form a cluster.
- As a final step, the remaining clusters are merged together to form a single cluster.
Proximity Matrix and Linkage
The proximity matrix is a matrix consisting of the distance between each pair of data points. The distance is computed by a distance function. Euclidean distance is one of the most commonly used distance functions.
The above proximity matrix consists of n points named x, and the d(xi,xj) represents the distance between the points.
In order to group the data points in a cluster, a linkage function is used where the values in the proximity matrix are taken and the data points are grouped based on similarity. The newly formed clusters are linked to each other until it forms a single cluster containing all the data points.
The most common linkage methods are as follows:
- Complete linkage: The maximum of all pairwise distance between elements in each pair of clusters is used to measure the distance between two clusters.
- Single linkage: The minimum of all pairwise distance between elements in each pair of clusters is used to measure the distance between two clusters.
- Average linkage: The average of all pairwise distances between elements in each pair of clusters is used to measure the distance between two clusters.
- Centroid linkage: Before merging, the distance between the two clusters’ centroids are considered.
- Ward’s Method: It uses squared error to compute the similarity of the two clusters for merging.
Agglomerative Clustering Implementation
Agglomerative clustering can be implemented in Python using sklearn and scipy. Let’s implement Agglomerative clustering on the Iris dataset. The dataset can be found here. As a first step, import the necessary libraries and read the dataset.
import pandas as pd import numpy as np data_path = "/content/Iris.csv" #read the data df = pd.read_csv(data_path) #select only feature columns X = df[['SepalLengthCm', 'SepalWidthCm', 'PetalLengthCm', 'PetalWidthCm']]
Now we can use agglomerative clustering class from sklearn to cluster the data points.
#import the class from sklearn.cluster import AgglomerativeClustering #instantiate the model model = AgglomerativeClustering(n_clusters = 3, affinity = 'euclidean', linkage = 'ward') #fit the model and predict the clusters y_pred = model.fit_predict(X)
In the above code, at first we import the agglomerative clustering class and instantiate the model with the required parameters. We use the clusters of three since there are three classes in the Iris dataset and we use the ward linkage function with the euclidean function as a distance metric which is specified in affinity parameter.
Since we know the number of species in the Iris dataset, we were able to specify the number of clusters in the parameter but in most cases of unsupervised learning, we don’t know the number of clusters beforehand. So how can we find the optimal number of clusters for hierarchical clustering?
The way to find the optimal number of clusters in hierarchical clustering is to use a dendrogram chart. Let’s see how we can identify an optimal number of clusters for the Iris dataset. For this, we can use the scipy library in Python.
#import the necessary libraries from scipy.cluster.hierarchy import dendrogram, linkage import matplotlib.pyplot as plt plt.figure(figsize=(15,6)) plt.title('Dendrogram') plt.xlabel('Flowers') plt.ylabel('Euclidean distances') #create linkage matrix link_matrix = linkage(X,method='ward') dendrogram = dendrogram(link_matrix) plt.savefig("iris.png") plt.show()
In the above code, the dendrogram chart is created as follows:
- Import the necessary libraries.
- Create a linkage matrix with the linkage function from scipy. Here we use the ward method.
- Pass the linkage matrix to the dendrogram function to plot the chart.
The chart is shown here:
From the above chart we can visualize the hierarchical technique, so how to find the optimal number of clusters from the above chart?
To find it, draw a horizontal line where there is no overlap in the vertical lines of the bars. The number of bars without the overlap below the line is the optical number of the clusters. Refer to the figure below for a clear illustration.
From the above figure, we have three bars below the horizontal line, so the optimal number of clusters is three. Also, if you recall, the Iris dataset has three classes and we got the same number from the above chart.
Divisive clustering works just the opposite of agglomerative clustering. It starts by considering all the data points into a big single cluster and later on splitting them into smaller heterogeneous clusters continuously until all data points are in their own cluster. Thus, they are good at identifying large clusters. It follows a top-down approach and is more efficient than agglomerative clustering. But, due to its complexity in implementation, it doesn’t have any predefined implementation in any of the major machine learning frameworks.
Steps in Divisive Clustering
Consider all the data points as a single cluster.
- Split into clusters using any flat-clustering method, say K-Means.
- Choose the best cluster among the clusters to split further, choose the one that has the largest Sum of Squared Error (SSE).
- Repeat steps 2 and 3 until a single cluster is formed.
In the above figure,
- The data points 1,2,...6 are assigned to large cluster.
- After calculating the proximity matrix, based on the dissimilarity the points are split up into separate clusters.
- The proximity matrix is again computed until each point is assigned to an individual cluster.
The proximity matrix and linkage function follow the same procedure as agglomerative clustering, As the divisive clustering is not used in many places, there is no predefined class/function in any Python library.
Limits of Hierarchical Clustering
Hierarchical clustering isn’t a fix-all; it does have some limits. Among them:
- It has high time and space computational complexity. For computing proximity matrix, the time complexity is O(N2), since it takes N steps to search, the total time complexity is O(N3)
- There is no objective function for hierarchical clustering.
- Due to high time complexity, it cannot be used for large datasets.
- It is sensitive to noise and outliers since we use distance metrics.
- It has difficulty handling large clusters.
Clustering helps with the analysis of an unlabelled dataset to group the data points based on their similarity. In terms of business needs, clustering helps quickly segment customers and get insightful decisions.
In this article, we discussed hierarchical clustering, which is a type of unsupervised machine learning algorithm that works by grouping clusters based on the distance measures and similarity. We also learned about the types of hierarchical clustering, how it works and implementing the same using Python.