Distributed K-Means with R-Hadoop

In this article, an R-hadoop (with rmr2) implementation of Distributed KMeans Clustering will be described with a 2-d dataset.

  1. First the dataset shown below is horizontally partitioned into 4 data subsets and they are copied from local to HDFS, as shown in the following animation. The dataset chosen is small enough and it’s just for the POC purpose.

    part.gif

  2. Next, the KMeans clustering algorithm is parallelized. The algorithm consists of 2 key steps:
    1. Cluster Assignment: where each data point is assigned to the nearest cluster center. This step can be done for each data point independently. This can be designed using the Map function (there are 4 such map jobs created) where the points from each of the points in each of the 4 data subsets will be parallelly assigned to the nearest cluster center. Once the cluster centroid for each data point is computed the map will emit each of the datapoints with the the assigned cluster label as the key.
    2. Cluster Centroid (Re) Computation: where the centroids for each of the clusters will be recomputed from the points belonging to the cluster. This is done in the Reduce function, where each cluster’s data points come to the reducer as a collection of all the data points belong to the cluster (corresponding to the key emitted by the map function).

    The steps 1-2  above are repeated till convergence, so this becomes a chain of map-reduce jobs. The partitioned dataset is clustered into K=3 clusters and the first 3 initial centroids are randomly generated. The next figures show the map-reduce steps, first for a single iteration and then for the entire algorithm steps.

    algo

    algo2

  3. The next animation shows the first 5 iterations of the map-reduce chain. Every time the clusters assigned to each of the points in each of the data subsets are obtained from the map job and then the updated cluster centroids are obtained from the corresponding reduce job in the same iteration.

    dkmeans.GIF

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s