# Computing Closest Pairs and implementing Clustering methods for 2D datasets in Python

The following problem appeared as a project in the coursera course Algorithmic Thinking (by RICE university), a part of Fundamentals of Computing specialization. The following description of the problem is taken directly from the project description.

In this article the implementations of two methods for computing closest pairs and two methods for clustering data will be described. Then these two clustering methods will be compared in terms of efficiency, automation, and quality. The following figures show the algorithms to be used.

While many clusterings of the points in P exist, a desired property is that the partition results in clusters with higher similarity of points within a cluster than of points between clusters. Algorithms HierarchicalClustering and KMeansClustering below are two heuristics for generating clustering with this desired property. In both algorithms, we will use k to denote the number of clusters.

### Provided data

As part of the analysis, the clustering methods will be applied to several sets of 2D data that include information about lifetime cancer risk from air toxics. First few rows from the raw version of this data is shown in the following table. Each entry in the data set corresponds to a county in the USA (identified by a unique 5 digit string called a FIPS county code) and includes information on the total population of the county and its per-capita lifetime cancer risk due to air toxics.

1. Let’s implement a function that creates a list of clusters where each cluster in this list corresponds to one randomly generated point in the square with corners [±1,±1][±1,±1]. Use this function to compute the running times of the functions naive slow_closest_pairs and divide and conquer based fast_closest_pair for lists of clusters of size 2 to 200.Theoretically, the running times of the functions slow_closest_pairs and fast_closest_pair should be O(n^2) (as can be seen from the next figure, the timing curve appears to be growing quadratically) and O(nlogn) (the timing curve shown looks almost linear with a very slight upward bend), respectively.

2.  Next let’s create an image of the 15 clusters generated by applying hierarchical clustering to the 3108 county cancer risk dataset. This dataset contains 3108 rows. The table below shows a few rows from the dataset. Each entry in the data set corresponds to a county in the USA (identified by a unique 5 digit string called a FIPS county code) and includes information on the total population of the county and its per-capita lifetime cancer risk due to air toxics.The next figure shows the clustering results.

x y population risk
56005 370.156754 161.446026 33698 0.000018
56009 367.849050 192.164880 12052 0.000014
56027 385.495031 190.747084 2407 0.000011
56031 375.184848 211.478461 8807 0.000014
11001 867.470401 260.460974 572059 0.000077
Loaded 3108 data points
Displaying 15 hierarchical clusters

3.  Next let’s create an image of the 15 clusters generated by applying 5 iterations of k-means clustering to the county cancer risk data set. The next figure shows the clustering results.

Loaded 3108 data points
Displaying 15 k-means clusters

4. Now let answer the following questions: Which clustering method is faster when the number of output clusters is either a small fixed number or a small fraction of the number of input clusters? Provide a short explanation in terms of the asymptotic running times of both methods. You should assume that hierarchical_clustering uses fast_closest_pair and that k-means clustering always uses a small fixed number of iterations.

kmeans clustering will be much faster.

Time complexity of hierarchical_clustering = O(n^2+n.h(n)where h(n)= time complexity of fast_closest_pair = O(nlogn).

Hence, Time complexity of hierarchical_clustering = O(n2+n2logn)=O(n2logn), where number of clusters k is small and fixed.

Time complexity of k_means_clustering = O(qkn), where number of iterations and number of clusters k are small and fixed.

Conclusion: One important reason that hierarchical clustering is slower than k-means clustering is that each call to fast_closest_pair throws away most of the information used in computing the closest pair. Smarter implementations of hierarchical clustering can substantially improve their performance by using an implementation of dynamic closest pairs (http://en.wikipedia.org/wiki/Closest_pair_of_points_problem#Dynamic_closestpair_ problem). Dynamic methods build an initial data structure that can be used to accelerate a sequence of insertions and deletions from the set of points. Using dynamic closest pairs leads to implementations of hierarchical clustering that are more competitive with k-means clustering.

5. Now let’s’ create an image of the 9 clusters generated by applying hierarchical clustering to the 111 county cancer risk dataset. This dataset contains 111 rows. The next table shows a few rows from the dataset. The next figure shows the clustering results.

x y population risk
53011 104.000465 74.018253 345238 0.000064
53033 125.274860 39.149773 1737034 0.000058
55079 664.855001 192.484141 940164 0.000074
54009 799.221538 240.153315 25447 0.000077
11001 867.470401 260.460974 572059 0.000077
Loaded 111 data points
Displaying 9 hierarchical clusters

6. Now let’s create an image of the 9 clusters generated by applying 5 iterations of k-means clustering to the 111 county cancer risk dataset. The initial clusters should correspond to the counties with the largest populations. The next figure shows the clustering results.

Loaded 111 data points
Displaying 9 k-means clusters

The clusterings that are computed and shown in the last few figures illustrate that not all clusterings are equal. In particular, some clusterings are better than others.

One way to make this concept more precise is to formulate a mathematical measure of the error associated with a cluster. Given a cluster C, its error is the sum of the squares of the distances from each county in the cluster to the cluster’s center, weighted by each county’s population. If p_i is the position of the county and w_i is its population, the cluster’s error is: error(C)=∑_{p_iC} w_i.(d(p_i,c))^2, where c the center of the cluster C.

Given a list of clusters L, the distortion of the clustering is the sum of the errors associated with its clusters: distortion(L)=∑_{CL} error(C).

7. Let’s implement a function compute_distortion(cluster_list) that takes a list of clusters and uses cluster_error to compute its distortion. Now, use compute_distortion to compute the distortions of the two clusterings in the previous questions. Enter the values for the distortions (with at least four significant digits) for these two clusterings, clearly indicate the clusterings to which each value corresponds.

Loaded 111 data points
Displaying Distortion for 9 hierarchical clusters: 1.75163886916e+11
Displaying Distortion for 9 k-means clusters: 2.71254226924e+11


8. Now, let’s examine the clusterings generated in the previous figures. In particular, let’s focus on the number and shape of the clusters located on the west coast of the USA. Describe the difference between the shapes of the clusters produced by these two methods on the west coast of the USA. What caused one method to produce a clustering with a much higher distortion? We should consider how k-means clustering generates its initial clustering in this case.

As can be seen from the previous figures, both the clustering techniques produce 3 clusters on the west coast. However, the clusters created using Hierarchical clustering have much smaller intra-cluster distances (counties in the same cluster are geographically nearer to one another) than those created with K-means clustering, in the latter case the counties far apart were grouped in the same cluster, thereby causing high distortion. This is caused due to the initial choice of the cluster centroids of the kmeans clustering, which chooses the counties with higher populations as initial cluster centroids, that causes counties in geographical proximity to fall in different clusters from the very beginning (e.g., the black and the green cluster correspond to different cluster centroids chosen in the first step, so they can’t be merged), unlike hierarchical clustering, which does nto have this limitation.

9. Conclusion: In general, k-means clustering requires a higher level of human supervision since the quality of the output clustering is sensitive to the initial choice of cluster centers. k-means clustering also requires that the size of the output clustering be known beforehand. For hierarchical clustering, the distortion of the clustering can be monitored as closest pairs of clusters are merged with almost no extra cost.

10. Let’s compute the distortion of the list of clusters produced by hierarchical clustering and k-means clustering (using 5 iterations) on the 111, 290 and 896 county data sets, respectively, where the number of output clusters ranges from 6 to 20 (inclusive).

As can be seen from the above figures, none of the methods consistently produce lower distortion clusterings for larger datasets when the number of output clusters is in the range 6 to 20. Hierarchical Clustering method is superior to k-means clustering when considered from cluster distortion perspective for relatively smaller datasets, but it’s slower than k-means most of the times.

Conclusion: On these data sets, neither method dominates in all three areas: efficiency, automation, and quality. In terms of efficiency, k-means clustering is preferable to hierarchical clustering as long as the desired number of output clusters is known beforehand. However, in terms of automation, k-means clustering suffers from the drawback that a reliable method for determining the initial cluster centers needs to be available. Finally, in terms of quality, neither method produces clusterings with consistently lower distortion on larger data sets