Distributed K-Means with R-Hadoop

In this article, an R-hadoop (with rmr2) implementation of Distributed KMeans Clustering will be described with a 2-d dataset.

  1. First the dataset shown below is horizontally partitioned into 4 data subsets and they are copied from local to HDFS, as shown in the following animation. The dataset chosen is small enough and it’s just for the POC purpose.


  2. The partitioned dataset is to be clustered into K=3 clusters and the first 3 initial centroids are randomly generated.
  3. Next, the KMeans clustering algorithm is parallelized. The algorithm consists of two key steps:
    1. Cluster Assignment:
      • In this step, each data point is assigned to the nearest cluster center.
      • This step can be carried for each data point independently.
      • This can be designed using the Map function (there are 4 such map jobs created) where the points from each of the 4 data subsets are parallelly assigned to the nearest cluster center (each map job knows the coordinates of the initial cluster centroids created).
      • Once each data point is assigned to a cluster centroid, the map job emits each of the datapoints with the the assigned cluster label as the key.
    2. Cluster Centroid (Re-) Computation:
      • In this step, the centroids for each of the clusters are recomputed from the points assigned to the cluster.
      • This is done in the Reduce function, where each cluster’s data points come to the reducer as a collection of all the data points assigned to the cluster (corresponding to the key emitted by the map function).
      • The reducer recomputes the centroid of each cluster, corresponding to each key.

    The steps 1-2  above are repeated till convergence, so this becomes a chain of map-reduce jobs.

    The next figures show the map-reduce steps, first for a single iteration and then for the entire algorithm steps.


  4. The next animation shows the first 5 iterations of the map-reduce chain.
    • Every time the cluster-labels assigned to each of the points in each of the data subsets are obtained from the corresponding map job.
    • Then the updated  (recomputed) cluster centroids are obtained from the corresponding reduce job for each of the clusters, in the same iteration.



2 thoughts on “Distributed K-Means with R-Hadoop

  1. Thanks for sharing! Could you also share your code? I’d love to reproduce. And it might be nice to set default axis in future projects, I had to look at the first gif a couple of times before I saw how the different datapoints related to ea h other!


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s