In this article, an R-hadoop (with rmr2) implementation of Distributed KMeans Clustering will be described with a 2-d dataset.
- First the dataset shown below is horizontally partitioned into 4 data subsets and they are copied from local to HDFS, as shown in the following animation. The dataset chosen is small enough and it’s just for the POC purpose.
- Next, the KMeans clustering algorithm is parallelized. The algorithm consists of 2 key steps:
- Cluster Assignment: where each data point is assigned to the nearest cluster center. This step can be done for each data point independently. This can be designed using the Map function (there are 4 such map jobs created) where the points from each of the points in each of the 4 data subsets will be parallelly assigned to the nearest cluster center. Once the cluster centroid for each data point is computed the map will emit each of the datapoints with the the assigned cluster label as the key.
- Cluster Centroid (Re) Computation: where the centroids for each of the clusters will be recomputed from the points belonging to the cluster. This is done in the Reduce function, where each cluster’s data points come to the reducer as a collection of all the data points belong to the cluster (corresponding to the key emitted by the map function).
The steps 1-2 above are repeated till convergence, so this becomes a chain of map-reduce jobs. The partitioned dataset is clustered into K=3 clusters and the first 3 initial centroids are randomly generated. The next figures show the map-reduce steps, first for a single iteration and then for the entire algorithm steps.
- The next animation shows the first 5 iterations of the map-reduce chain. Every time the clusters assigned to each of the points in each of the data subsets are obtained from the map job and then the updated cluster centroids are obtained from the corresponding reduce job in the same iteration.