What is K-Means Clustering
K-means clustering is amongst the most popular unsupervised machine learning algorithms. Machine learning algorithms can be roughly divided into three major classes. These classes are supervised, unsupervised and reinforcement learning. In supervised setting a data along with its label is required in order to have an algorithm work properly.
In unsupervised type of algorithms the data is not required to have labels for the algorithm to find an underlying pattern. In a nutshell, unsupervised algorithms make inferences from the data using only input vectors without considering labels or outcomes. The reinforcement learning algorithms do not fall in these two categories as they work by maximizing a predefined reward function against an action.
Given the fact that k-means clustering is an unsupervised machine learning algorithm, therefore, our discussion will be revolving around the unsupervised machine learning and how k-means work.
K-Means Clustering Formula
The objective of K-means is to group similar data points together. By doing this we come to discover the underlying patterns in the data. More expressively, what we find is that in the whole data which data points relate with each other. On this basis we make groups of such related data points and this is called ‘clustering’.
The number of clusters in which we like to see the data grouped into is provided by us. Let us say we want to see the data grouped/clustered into five distinct groups/clusters. This value is k which is what the algorithm has been named after. The data points are clustered in k groups on the basis of their similarity with each other.
Using the technical jargon the same concept can be understood as follows. The target number k refers to the number of centroids
we need to make in the dataset. A centroid
is a location which represents the center of a cluster. Simply put, the algorithm works by calculating where to put the center of each cluster
in the data and how far each cluster is stretching from this centeroid/center in the problem space.
The K-means algorithm places k number of centroids in the data and allocates every data point to the nearest cluster. While doing so, it tries to keep the centroids as small as possible.
The ‘means’ in the K-means refers to averaging of the data; that is, finding the centroid.
K-Means Pseudo Code
The pseudo code of K-means algorithm is given as follows.
Here ‘K’ is the total number of centroids and ‘k’ is the current centroid under consideration. Whereas x(i) is the current data point under consideration.
The first inner loop assigns the most appropriate centroid to the example under consideration i.e x(i) by giving it the index of the centroid from which the example is at the smallest distance apart. The distance is calculated using a distance formula as shown below:
The ‘k’ for which we get the smallest distance will be the centroid to which this example will be assigned. A more conventional way is to take the square of the distance.
In the second loop of the pseudo code the new position of a centroid is being calculated. It is done by taking average of all the example belonging to this centroid. Consider, if x(1), x(5) and x(7) belong to a centroid (2) then the new position of (2) will be as follows.
We repeat these steps till we have no more position changes occurring
or we run out of maximum iteration limit. A blow by blow working of the algorithm is shown graphically hereunder
in Figure 1. The first figure
shows the spread of data. In the next one that is (b), we are randomly choosing two centroids. In (c) we are assigning a number of examples to each of the two centroids. In (d), the forth one, we are calculating the centroid positions and moving them to their new locations. Finally, in (d), we once again assign the examples to the appropriate centroids.
K-Means Clustering in Data Mining:
K means clustering has wide applications in the data mining. It is used to identify market segments. A noticeable point is that we do not always have data separated in the format as shown in the upper figures. We may have a continuous spread as shown below in the Figure 5.
Consider that it is representing the data of people against their weight and height parameters. The problem is to segment this data into three chunks so that a T-shirt manufacturer can have small, medium and large t-shirt sized for a specific range of weight and height. This problem is a vanilla data mining problem. Here we are mining for a hidden pattern in the data which will segment it with respect to the weight to height ratio of the given population.
Similarly, other applications of K means clustering in data mining can be found in finding similar or like minded people on social media and rearranging a data center by moving servers close together which tend to have similar activity thus reducing the power consumption and other costs.
K-Means Clustering in Python
Scikit-learn is a very handy Python library for implementing machine learning algorithms. Here we shall be performing a hands-on unsupervised clustering routine on a dummy data using this useful Python package. This code can be run using Jupyter Notebook or Pycharm IDE. For a novice it is recommended to use Anaconda which comes with pre-installed Python packages. This could be got from here. We recommend that you choose Anaconda distribution with Python 3.7 version as it is the latest one. Once installed, on running you get a screen like in below in Figure 3.
Click ’Launch’ Jupyter Notebook. This will launch a server and a browser window will open as under in Figure 4. Now, start a new notebook by clicking at New and selecting Python 3. This will open up a new notebook where you can start coding. In order to run each cell press shift+enter. For this example we are clustering the data into two clusters. That is, the value of K is 2.
K-Means Clustering in R
R is the classic language for data manipulation. Here is a hands-on routine for doing K means clustering in R. The code credits go here1 . We are using the famous iris dataset which comes pre loaded wit R-Studio. The number of clusters are set to three as three species are involved.
Launch R-Studio from Anaconda and run the following code in the console line by line as below. In case of having an error saying there is no package called ‘ggplot2’ please click on the adjacent tab of terminal and run this command to install the required package: conda install -c r r-ggplot2. Now run the instruction once again.
Note: Line number 8 and 9 (above) are representing one instruction. You can write them in one line as well. R’s console automatically knows that it is a single instruction and on pasting one line it automatically adds a ‘+’ sign so that next line can be added after it to make the instruction complete.