The following problem appeared as a project in the ** coursera course Algorithmic Thinking (by RICE university)**, a part of Fundamentals

*of Computing*specialization. The following description of the problem is taken directly from the project description.

In this article the implementations of two methods for computing *closest pairs* and two methods for *clustering data* will be described. Then these two clustering methods will be compared in terms of *efficiency*, *automation*, and *quality*. The following figures show the algorithms to be used.

While many clusterings of the points in P exist, a desired property is that the partition results in clusters with higher similarity of points within a cluster than of points between clusters. Algorithms **HierarchicalClustering** and **KMeansClustering** below are two heuristics for generating clustering with this desired property. In both algorithms, we will use **k** to denote the *number of clusters*.

### Provided data

As part of the analysis, the clustering methods will be applied to several sets of *2D data* that include information about *lifetime cancer risk* from *air toxics*. First few rows from the raw version of this data is shown in the following table. Each entry in the data set corresponds to a county in the USA (identified by a unique 5 digit string called a FIPS county code) and includes information on the total population of the county and its per-capita lifetime cancer risk due to air toxics.

- Let’s implement a function that creates a list of clusters where each cluster in this list corresponds to one randomly generated point in the square with corners [±1,±1][±1,±1]. Use this function to compute the running times of the functions
*naive slow_closest_pairs*and*divide and conquer based fast_closest_pair*for lists of clusters of size 2 to 200.Theoretically, the running times of the functions*slow_closest_pairs*and*fast_closest_pair*should be O(n^2) (as can be seen from the next figure, the timing curve appears to be growing quadratically) and O(nlogn) (the timing curve shown looks almost linear with a very slight upward bend), respectively.

2. Next let’s create an image of the **15** clusters generated by applying **hierarchical clustering** to the *3108 county cancer risk* dataset. This dataset contains **3108** rows. The table below shows a few rows from the dataset. Each entry in the data set corresponds to a county in the USA (identified by a unique 5 digit string called a *FIPS county code*) and includes information on the *total population* of the county and its *per-capita lifetime cancer risk* due to air toxics.The next figure shows the clustering results.

x | y | population | risk | |
---|---|---|---|---|

56005 | 370.156754 | 161.446026 | 33698 | 0.000018 |

56009 | 367.849050 | 192.164880 | 12052 | 0.000014 |

56027 | 385.495031 | 190.747084 | 2407 | 0.000011 |

56031 | 375.184848 | 211.478461 | 8807 | 0.000014 |

11001 | 867.470401 | 260.460974 | 572059 | 0.000077 |