A Holistic Approach to Performance Evaluation Metrics for Clustering

Ugur Selim Ozen
Teknasyon Engineering

--

Photo by Markus Winkler on Unsplash

In this article, I will cover the evaluation metrics of unsupervised learning including twin-sample validation for comparing the performance of built unsupervised machine learning model.

Definition of Unsupervised Learning

We can categorize machine learning problems under three main topics as supervised learning, unsupervised learning, and semi-supervised learning from a general perspective. Unsupervised learning is specifically a method that utilizes machine learning algorithms to analyze and cluster unlabeled datasets. In this method, algorithms can discover hidden patterns and group similar records without the need for additional information. Its capability to find similarities and differences from datasets makes it the ideal solution for some real-life problems such as customer segmentation, spam detection, recommendation engine, and so on.

Figure 1: Supervised vs Unsupervised Learning

(The sample dataset that will be mentioned below was created and used by using real spam data.)

Unsupervised Learning Method — Clustering

Clustering is the most popular unsupervised learning method that divides the population or data points into several groups. It is a collection of objects based on similarity and dissimilarity between them.

It has many algorithms and each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. The labels over data can be found in the labels_attribute for.

→ Clustering: new points may be assigned to a cluster and could also be outliers.

→ Partitioning: new points must be assigned to an existing cluster, not a new cluster.

Figure 2: Classification vs Clustering

Clustering Algorithm — KMeans

K-means is a partitioning algorithm that divides the data into k clusters. Data points are assigned to a cluster based on metrics such as Euclidean distance to nearest cluster centroid. We can define the workflow of the algorithm as follows;

  1. Choose k centroids
  2. Assign points to cluster based on nearest centroid
  3. Recompute centroids
  4. Repeat steps 2 and 3 until the algorithm converges

Finding Optimum k Value for the Clustering Algorithm

The K-means algorithm aims to choose centroids that minimize the inertia, or within-cluster sum-of-squares criterion by locating points close to cluster centroids and we can say that it is good clustering. However, we need to choose the optimum k value for doing this. There are many ways but other methods will be covered in the next section but one is given as follow;

Elbow Method

Elbow Method is an empirical method to find the optimal number of clusters for a dataset. In this method, we pick a range of candidate values of k, then apply K-Means clustering using each of the values of k. Find the average distance of each point in a cluster to its centroid, and represent it in a plot. Pick the value of k, where the average distance falls suddenly.

You can see the example implementation for the Elbow Method below.

In this code snippet, we evaluated inertia values of cluster centroids for the elbow method with kmeans_model.inertia_. A good model is one with low inertia and a low number of clusters (k). However, this is a tradeoff because as k increases, inertia decreases.

Figure 3: Elbow method result

As seen from Figure 3, we can choose 3 as the optimum value of k.

Internal Validation

Most of the methods of internal validation combine cohesion within each cluster and separation between different clusters to estimate the validation score. The approach is to compute the validation score of each cluster and then combine them in a weighted manner to arrive at the final score for the set of clusters.

Figure 4: Cohesion vs Separation

In practice, instead of dealing with two metrics which are cohesion and separation, several measures are available which combine both of the above into a single measure. Some important examples of such measures are:

Calisnki-Harabasz Index

The Calinski-Harabasz index — also known as the Variance Ratio Criterion — can be used to evaluate the model. A higher Calinski-Harabasz score relates to a model with better-defined clusters.

The index is the ratio of the sum of between-clusters dispersion and the sum of within-cluster dispersion for all clusters. Dispersion is defined as the sum of distances squared here. The Calinski-Harabasz index visualization is as follow;

Figure 5: Calinski-Harabasz Index result

Davies-Bouldin Index

In this score, a lower Davies-Bouldin index relates to a model with better separation between the clusters. This index indicates the average ‘similarity’ between clusters. The similarity is a measure that compares the distance between clusters with the size of the clusters themselves.

Zero is the lowest possible score. Values closer to zero perform a better partition. The Davies-Bouldin index visualization is as follow;

Figure 6: Davies-Bouldin Index result

Silhouette Coefficient

The Silhouette Coefficient is an example of such an evaluation. The higher score relates to a model with better-defined clusters. It is defined for each sample and is composed of two scores:

  • a: The mean distance between a sample and all other points in the same class.
  • b: The mean distance between a sample and all other points in the next nearest cluster.

The Silhouette Coefficient s for a single sample is then given as:

The Silhouette score for a set of samples is given as the mean of the Silhouette Coefficient for each sample. The implementation and visualization of it as given below;

Figure 7: Silhouette Score result

In this sample model, the silhouette analysis is used to choose an optimal value for n_clusters. Figure 7 's plot shows that the n_clusters value of 7 is a good pick for the sample dataset due to upper average silhouette scores with every cluster.

Twin-Sample Validation

In this section, we explain how we can further validate the results of our unsupervised learning model in the absence of true cluster labels. In this section, we will validate the results that we have already performed the clustering method, K-means, and we got some results. The approach consists of the following four steps.

Creating a Twin-Sample Dataset

This is the most important step in the process of performing the twin-sample validation. The key idea is to create a sample of records that are expected to show similar behavior as the sample dataset. This is similar to a validation set for supervised learning. However, there are some additional constraints. Firstly, the data you have should come from the same distribution as in the sample dataset. Lastly, it should sufficiently cover most of the patterns observed in the training set. To handle these constraints, we can apply a statistical test such as Kolmogorov-Smirnov(KS) as follow;

It gives as following test results;

answered_ratio : KstestResult(statistic=0.0048444194043421596, pvalue=0.7030589537367535)avg_duration : KstestResult(statistic=0.0037275378487697797, pvalue=0.9300070148409181)

→ If the statistic value is so small and the p-value is so high, we cannot reject the hypothesis that our sample dataset and extracted sample have the same distribution as statistically.

Performing Unsupervised Learning on Twin-Sample

Now that we have our twin sample, the next step is to perform cluster learning on it. For this, we will use the same parameter that we used on our sample dataset. This includes the number of clusters, distance metric, etc. We will get a set of cluster labels as the output of this step. We will denote this output set as S. The idea here is that we should get similar results on our twin-sample set as we got on our sample dataset. This similarity will be measured in the subsequent steps. Implementation as given below;

Importing Results for Twin-Sample from the Training Set

In this step, we will compute another set of cluster labels on the twin sample. This time we will use the results of clustering performed on the sample dataset. For each point in twin-sample, we will perform the following two steps:

  • Identify its nearest neighbor in the sample dataset. Please make sure that the distance metric should be same as the one used in the clustering process.
  • Import the cluster label of its nearest neighbor.

The python implementation is as follow;

results dataframe gives us S and P sets which are predictions with the Kmeans model and true values with NearestNeighbors model respectively.

Calculating Similarity between Two Sets Results

Now that we have two sets of cluster labels, S and P, for twin-sample, we can compute their similarity by using any supervised learning metrics such as F1-score, Jaccard Similarity, etc.

A set of clusters having high similarity with its twin-sample is considered good. Evaluation of similarity was implemented as given below;

Figure 8: Twin-Sample Validation result

Conclusion

  • Twin sample validation can be used to validate results of unsupervised learning.
  • It should be used in combination with internal validation.
  • It can prove to be highly useful in the case of time-series data where we want to ensure that our results remain the same across time.
  • In this article, we have used k-means clustering as an example to explain the process. But it is a general approach and can be adopted for any unsupervised learning algorithm such as Isolation Forest or DBSCAN.

--

--

Data Engineer who is developing himself in Big Data Processing, Data Analytics, Machine Learning and Cloud Technologies. https://ugurselimozen.github.io/