Cluster analysis is a data analysis method that clusters (or groups) objects that are closely associated within a given data set. When performing cluster analysis, we assign characteristics (or properties) to each group. Then we create what we call clusters based on those shared properties. Thus, clustering is a process that organizes items into groups using unsupervised machine learning algorithms.
Cluster analysis is a useful and straightforward tool for understanding data patterns. The main goal of clustering is to identify the clusters and group them accordingly. We can also use cluster analysis to identify anomalies or outliers, which are cases that stand out from the rest of the data. We use anomalies mostly to identify areas or cases that need further investigation. For example, banks use anomaly detection to fight fraud.
When Is Cluster Analysis Useful?
Cluster analysis helps us understand data and detect patterns. In certain cases, it provides a great starting point for further analysis. In other cases, it can give you the greatest insights from the data. Here are some cases when cluster analysis is more appropriate than other methods like standard deviation or correlation.
Should I Use Cluster Analysis?
- If you have large and unstructured data sets, it can be expensive and time-consuming to label groups manually. In this case, cluster analysis provides the best solution to divide your data into groups.
- When you don’t know the number of clusters in advance, cluster analysis can provide the first insight into groups that are available in your data set.
- When you need to detect outliers in your data, cluster analysis provides an effective method compared to traditional outlier detection methods, such as standard deviation.
- Cluster analysis can help you detect anomalies. While outliers are observations distant from the mean, they don’t necessarily represent abnormalities. On the other hand, anomalies relate to identifying rare events or observations that deviate greatly from the mean.
Applications of Cluster Analysis
Cluster analysis has applications in many disparate industries and fields. Here’s a list of some disciplines that make use of this methodology.
- Marketing: Cluster analysis is popular in marketing, especially in customer segmentation. This method of analysis helps to both target customer segments and perform sales analysis by groups.
- Business Operations: Businesses can optimize their processes and reduce costs by analyzing clusters and identifying similarities and differences between data points. For example, you can identify patterns in customer data and improve customer support processes for a particular group that may require special attention.
- Earth Observation: Using a clustering algorithm, you can create a pixel mask for objects in an image. For example, you can use image segmentation to classify vegetation or built-up areas in a satellite image.
- Data Science: We can use cluster analysis for predictive analytics. By applying machine learning techniques to clusters, we can create predictive models to make inferences about a particular data set.
Types of Clustering Methods
Centroid-based clustering and density-based clustering are two of the most widely used clustering methods.
This type of clustering calculates clusters based on a central point which may or may not be part of the data set. For centroid-based clustering, you can use the K-means clustering algorithm, which divides the data set into k clusters. Data points belong to the cluster with the nearest mean or cluster point.
Density-based clustering deals with the density of the data points. The clusters are tied to a threshold — a given number that indicates the minimum number of points in a given cluster radius. Density-based clustering is an effective way to identify noise and separate it from the clusters. The most widely used density-based clustering algorithm is density-based spatial clustering of applications with noise (DBSCAN).
Example of Cluster Analysis
The following example shows you how to use the centroid-based clustering algorithm to cluster 30 different points into five groups. You can plot points on a two-dimensional graph, as shown in the graphs below.
On the left, we have a random distribution of the 30 points. The first iteration of the K-means clustering divides the points into five groups, with each cluster represented by a different color, as shown in the center graph.
The algorithm will then iteratively move the points from one cluster to another until the points are grouped optimally. The end result will be five distinct clusters, as shown in the graph on the right.