This is second in a series of Segmentation and Clustering articles. Clustering is a widely used unsupervised learning technique. Overcome the limitations of a conventional ‘typing tool’ with the modern approach of using a Scoring of Clustering Model.
Conventional Approach – Building a Typing Tool
The standard practice after carrying out a clustering exercise is to develop a ‘typing tool’ or classification model on the clustering solution to determine which cluster a new data point will belong to. There are numerous drawbacks of this orthodox practice:
- The classification model is chosen based on the decision boundary, sample size, level of measurement of independent variables, etc. The underlying assumptions which a particular classification model follows also need to be met. Moreover, building a typing tool entails multiple iterations and an in-depth analysis of the degree of differentiation of variables across clusters, which is usually done manually and thus takes a lot of time.
- Typing tools can suffer from lower predictability, inability to tolerate missing values and lack of robustness when applied to other samples.
Automated Approach – Scoring of Clustering Model
Since the clustering exercise can be materialized using various metrics like those of similarity (e.g. correlation), compactness (e.g. k-Means, mixture models), connectivity (e.g. spectral clustering) etc., it is very difficult for the analyst to figure out the optimal decision boundary for the classification model to use in order to predict the cluster membership.
The best recourse could be a readymade function capable of scoring the new examples into the identified clusters, based on the cluster boundaries formed through a particular clustering algorithm. This alternative will overcome the aforementioned limitations of typing tools as well. There are some inbuilt packages available in Python and R which can score the clustering model built with the algorithms of k-Means, kNN, SVM etc. You can find examples of this approach in the technical article on Spectral Clustering.
Implementation in Python/R
Prediction of k-Means Cluster Membership
One can make predictions based on new incoming data by calling the predict function of the k-Means instance and passing in an array of observations. The predict function calculates the distance of any new observation from all the k centroids/cluster centers of the k clusters. When the function finds the cluster center that the observation is closest to, it outputs the index of that cluster center’s array. An implementation of predict in R can be found in the ‘flexclust’ package.
Prediction of kNN (k-Nearest Neighbors) Cluster Membership
kNN makes predictions using the training dataset directly. Predictions are made for a new instance (x) by searching through the entire training set for the k most similar instances (neighbors) and summarizing the output variable (e.g. Mode) for those k instances. To determine which of the k instances in the training dataset are most similar to a new input a distance measure (e.g. Euclidean distance for real-valued input variables) is used.
Technical articles are published from the Absolutdata Labs group, and hail from The Absolutdata Data Science Center of Excellence. These articles also appear in BrainWave, Absolutdata’s quarterly data science digest.