In this article, an *R-hadoop* (with *rmr2*) implementation of ** Distributed KMeans Clustering** will be described with a 2-d dataset.

- First the dataset shown below is
into 4 data subsets and they are copied from*horizontally partitioned**local*to, as shown in the following animation. The dataset chosen is small enough and it’s just for the*HDFS**POC*purpose. - The partitioned dataset is to be clustered into
and the first*K=3*clustersare randomly generated.**3 initial centroids** - Next, the
algorithm is**KMeans clustering***parallelized*. The algorithm consists of two key steps::*Cluster Assignment*- In this step, each data point is assigned to the
*nearest cluster**center*. - This step can be carried for each data point independently.
- This can be designed using the
function (there are*Map**4 such*created) where the points from each of the 4 data subsets are parallelly assigned to the nearest cluster center (each map job knows the coordinates of the initial cluster centroids created).**map jobs** - Once each data point is assigned to a
**cluster**, the map job*centroid**emits*each of the datapoints with the the assignedas the**cluster label**.*key*

- In this step, each data point is assigned to the
:*Cluster Centroid (Re-) Computation*- In this step, the
for each of the*centroids*are**clusters**from the points assigned to the cluster.**recomputed** - This is done in the
function, where each cluster’s data points come to the*Reduce*as a collection of all the data points assigned to the cluster (corresponding to the**reducer**emitted by the**key**function).**map** - The reducer
the**recomputes**of each*centroid*, corresponding to each key.**cluster**

- In this step, the

The

above are**steps 1-2****repeated till**, so this becomes a*convergence*.**chain of map-reduce jobs**The next figures show the map-reduce steps, first for a single iteration and then for the entire algorithm steps.

- The next animation shows the
*first 5 iterations*of the.*map-reduce chain*- Every time the cluster-labels assigned to each of the points in each of the data subsets are obtained from the corresponding
*map job.* - Then the
**updated (recomputed) cluster**are obtained from the corresponding*centroids*for each of the clusters, in the same iteration.**reduce job**

- Every time the cluster-labels assigned to each of the points in each of the data subsets are obtained from the corresponding

Thanks for sharing! Could you also share your code? I’d love to reproduce. And it might be nice to set default axis in future projects, I had to look at the first gif a couple of times before I saw how the different datapoints related to ea h other!

LikeLike

thanks for the suggestion, will share once i get some time, however, it’s quite straightforward to implement.

LikeLike