Experience Extended | Clustering Validation Techniques 2019-09-04T11:12:08+00:00

Clustering Validation Techniques

This is fourth in a series of Segmentation and Clustering articles. It can be difficult to define when a clustering result is acceptable. This article covers several clustering validity techniques and indices.

 

Abstract

Clustering is an unsupervised process in data mining and pattern recognition, and most clustering algorithms are very sensitive to their input parameters. Therefore, it is imperative to evaluate the algorithms’ outcome. Ideally, the resulting clusters should have good statistical properties (compact, well-separated, connected, and stable) and provide practically relevant results.

It is difficult to define when a clustering result is acceptable, so several clustering validity techniques and indices have been developed.

 

Conventional Practices

Analysts have conventionally used statistics like F-values and silhouette coefficients (and others) for non-hierarchical or semi-hierarchical clustering solutions; they execute relevant classification exercises on the new cluster membership to figure out if the clusters are intrinsically homogenous and extrinsically heterogeneous. But these metrics provide very limited information about cluster diagnosis. To generalize cluster solutions, the consideration of several important metrics is advisable.

 

Validation Measures

The following are the most commonly used validity indices:

  • External Measures : Rand Statistic, Jaccard Coefficient, Hubert’s Γ Statistic, Normalized Γ Statistic, Fowlkes-Mallows Index, etc.
  • Internal Measures : Connectivity, Dunn Index, etc.
  • Stability Measures : Average Proportion of Non-Overlap (APN), Average Distance (AD), Average Distance between Means (ADM), Figure of Merit (FOM), etc.

There are many other measures that help validate cluster solutions, including the C-Index, the Cubic Clustering Criterion (CCC), the Dindex, the SDindex, the Point-BiserialIndex, and the Calinski-Harabasz (CH), Duda, Pseudo t2, Gamma, Beale, Gplus, Davies-Bouldin, Frey, Hartigan, Tau, The Ratkowsky-Lance, Scott, Marriot, Ball, Trcovw, Tracew, Friedman, McClain-Rao, Rubin, KL, Gap, and SDbw indices.

 

References for Further Reading & Practice

-Authored by Shivli Gupta, Data Scientist at Absolutdata

Technical articles are published from the Absolutdata Labs group, and hail from The Absolutdata Data Science Center of Excellence. These articles also appear in BrainWave, Absolutdata’s quarterly data science digest.

Subscribe to BrainWave