In this article, an *R-hadoop* (with *rmr2*) implementation of ** Distributed KMeans Clustering** will be described with a 2-d dataset.

- First the dataset shown below is
*horizontally partitioned*into 4 data subsets and they are copied from*local*to*HDFS*, as shown in the following animation. The dataset chosen is small enough and it’s just for the*POC*purpose. - Next, the KMeans clustering algorithm is parallelized. The algorithm consists of 2 key steps:
: where each data point is assigned to the nearest cluster center. This step can be done for each data point independently. This can be designed using the*Cluster Assignment*function (there are*Map**4 such map jobs*created) where the points from each of the points in each of the 4 data subsets will be parallelly assigned to the nearest cluster center. Once the cluster*centroid*for each data point is computed the map will emit each of the datapoints with the the assigned cluster label as the*key*.: where the*Cluster Centroid (Re) Computation**centroids*for each of the clusters will be recomputed from the points belonging to the cluster. This is done in thefunction, where each cluster’s data points come to the reducer as a collection of all the data points belong to the cluster (corresponding to the key emitted by the map function).*Reduce*

The steps 1-2 above are repeated till

*convergence*, so this becomes a chain of*map-reduce*jobs. The partitioned dataset is clustered into*K=3*clusters and the first 3*initial centroids*are randomly generated. The next figures show the map-reduce steps, first for a single iteration and then for the entire algorithm steps. - The next animation shows the
*first 5 iterations*of the. Every time the clusters assigned to each of the points in each of the data subsets are obtained from the map job and then the updated cluster*map-reduce chain**centroids*are obtained from the corresponding reduce job in the same iteration.

Advertisements