K-Means Clustering

Phalguni
4 min readDec 2, 2020

--

K-means clustering

K Means Clustering is an unsupervised learning algorithm that tries to cluster data into K pre-defined distinct non-overlapping subgroups (clusters) where each data point belongs to only one group. Unsupervised learning means that there is no outcome to be predicted, and the algorithm just tries to find patterns in the data.

Clustering refers to a very broad set of techniques for finding subgroups, or clusters, in a data set. When we cluster the observations of a data set, we seek to partition them into distinct groups so that the observations within each group are quite similar to each other, while observations in different groups are quite different from each other.

Of course, to make this concrete, we must define what it means for two or more observations to be similar or different. Indeed, this is often a domain-specific consideration that must be made based on knowledge of the data being studied.

Here are some examples of clustering problems -

  1. Cluster similar documents
  2. Cluster customers based on features
  3. Market Segmentation
  4. Identify Similar physical groups

The overall goal is to form such groups such that the observations within each group are similar to each other

Here is a diagram of unlabeled data and K-means algorithm trying to cluster the data into 5 different colored clusters.

K-means clustering

The way k-means algorithm works is as follows -

1. Randomly assign a number, from 1 to K, to each of the observations. These serve as initial cluster assignments for the observations.

2. Iterate until the cluster assignments stop changing:

(a) For each of the K clusters, compute the cluster centroid by taking the mean vector of points in the cluster

(b) Assign each observation to the cluster whose centroid is closest (where closest is defined using Euclidean distance).

Progression of K-means algorithm

The progress of the K-means algorithm with K=3.

Top left: the observations are shown.

Top center: in Step 1 of the algorithm, each observation is randomly assigned to a cluster.

Top right: in Step 2(a), the cluster centroids are computed. These are shown as large colored disks. Initially the centroids are almost completely overlapping because the initial cluster assignments were chosen at random.

Bottom left: in Step 2(b), each observation is assigned to the nearest centroid.

Bottom center: Step 2(a) is once again performed, leading to new cluster centroids.

Bottom right: the results obtained after ten iterations.

Because the K-means algorithm finds a local rather than a global optimum, the results obtained will depend on the initial (random) cluster assignment of each observation in Step 1 of Algorithm. For this reason, it is important to run the algorithm multiple times from different random initial configurations. Then one selects the best solution.

Now, the question is how to choose the optimal number of clusters (K).

Well, there is no easy answer to select the best value of K. One way is the elbow method.

ELBOW METHOD

Elbow method is based on a parameter called “Within Cluster Sum of Squares” (WCSS) which is given as -

Within Sum of squares

It basically gives the sum of squared distances between each data point and its centroid. The more the clusters, the lesser the value of WCSS. The maximum number of clusters beyond which the change in K will lead to no significant decrease in WCSS, is the optimal value for K.

Steps to be followed -

First of all, compute the sum of squared error (SSE) for some values of k (For eg- 2,4,6,8)

The SSE is defined as the sum of the squared distance between each member of the cluster and its centroid.

If we plot k against the SSE, we will see that the error decreases as the value of k increases, this is because when the number of clusters increases, they should be smaller so distortion is also smaller.

The idea is to choose the k at which the SSE decreases abruptly — this produces an “elbow effect”.

Elbow effect

We can see an elbow occurring around k = 6 or 7, beyond which the SSE is almost constant. It indicates that increasing the value of k beyond this point will lead to no additional information.

#Although, Elbow method in selecting number of clusters doesn’t usually work because the error function is monotonically decreasing for all ks.

Conclusion -

Kmeans clustering is one of the most popular clustering algorithms and usually the first thing practitioners apply when solving clustering tasks to get an idea of the structure of the dataset. The goal of kmeans is to group data points into distinct non-overlapping subgroups. Kmeans assumes spherical shapes of clusters (with radius equal to the distance between the centroid and the furthest data point) and doesn’t work well when clusters are in different shapes such as elliptical clusters.

--

--

No responses yet