K-Means Clustering explained


Clustering

Clustering is an unsupervised learning technique used for data classification.

Unsupervised learning means there is no output variables to guide the learning process(no this or that, no right or wrong) and data is explored by algorithms to find patterns.We only observe the features but have no established measurements of the outcomes since we want to find them out.

As opposed to the supervised learning where your existing data is already labelled and you know which behaviour you want to determine in the new data you obtain, unsupervised learning techniques don't use labelled data and algorithms are left to themselves to discover structures in the data.

K-Means clustering

Within the universe of clustering techniques, K-means is probably one of the most known and frequently used. K-means uses an iterative refinement method to produce its final clustering based on the number of clusters (represented by the variables K) defined by its users and its dataset. For example, if we set K equal to 3 then our dataset will be grouped into 3 clusters.

K-means start off with arbitrarily chosen data points as proposed means of the data groups, and iteratively recalculates new means in order to converge to final clustering of the data points.

But how does the algorithms decide how to group the data if you are just providing a value (K)? When you define the value of K you are actually telling the algorithms how many means or centroids you want. A centroid is a data point that represents the center of the cluster, and it might not necessarily be a member of dataset.

This is how the algorithm works:

  1. K centroids are created randomly (based on the predefined value of K)
  2. K-means allocates every data point in the dataset to the nearest centroid(minimizing Euclidean distances between them), meaning that a data point is considered to be in a particular cluster if it is closer to that cluster's centroid than any other centroid.
  3. Then K-means recalculates the centroids by taking the means of all data points assigned to that centroid's cluster, hence reducing the total intra-cluster variance in relation to the previous step. The "means" in the K-means refers to averaging the data and finding the new centroid
  4. The algorithm reiterate step 2 and 3 until some criteria is met(e.g. the sum of distances between the data points and their corresponding centroid is minized, a maximum number of iterations is reached, no changes in centroids value or no data points changes clusters.)
  5. In general, for an n-dimensional space, the Euclidean distance is: $$ d(\mathbf{p}, \mathbf{q})=\sqrt{\sum_{i=1}^{n}\left(p_{i}-q_{i}\right)^{2}} $$

Another critical question exists in K-means is that how do we know the correct number of K ?

There is no universal answer for this.One commonly used approach is testing different numbers of clusters and measure the resulting sum of squared errors, choosing the K value at which an increase will cause a very small decrease in the error sum , while a decrease will sharply increase the error sum. This point that defines the optimal number of clusters is known as "elbow point", and can be used as a visual measure to find the best pick of K.

But everything has a downside

One of the drawbacks of the K-means is that it relies on random initialization, which means the outcome of the algorithm depends on random seed, results may not be comparable and lack of consistency. By default, scikit-learn runs the algorithm 10 times with 10 different random initializations, and returns the best result. Further downside of K-means are the relatively restrictive assumptions made on the shape of clusters, and the requirement to specify the number of clusters you are looking for(which may not be known in a real-world application).

Implementation in python



The overall implementation can be found in this github repo.