# Clustering Algorithms

CPS230 Project, Fall 2010
Instructor: Kamesh Munagala

(Designed with input from Kshipra Bhawalkar and Sudipto Guha)

In this project, we will explore different algorithms to cluster data items. Clustering is the process of automatically detect items that are similar to one another, and group them together. Since the notion of a group is fuzzy, there are various algorithms for clustering that differ in their measure of quality of a clustering, and in their running time. The goal of this project is to implement some of these algorithms and compare their performance on different data sets, both in terms of running time, and solution quality.

## Part I: Agglomerative Clustering

In agglomorative clustering, each data point is initially placed in a cluster by itself. At each step, two clusters with the highest similarity score are merged. The output of agglomerative clustering is a tree, where the leaves are the data items. At any intermediate step, the clusters so far are different trees. When two clusters are merged by the algorithm, the respective trees are merged by making the roots of these trees the left and right children of a new node (so that the total number of trees decreases by 1). In the end, the process yields a single tree. See here for an example.

This algorithm closely resembles Kruskal's algorithm for Minimum Spanning Tree construction. In Kruskal's algorithm, at each step the smallest edge connecting two components is added till there is only one component left. Each new edge added merges two components.  This can be seen as clustering where the similarity score between two components is the negative the length of the smallest edge between any two points in the cluster.

Many different clustering algorithms can be obtained by varying the similarity measures between clusters. Some examples of similarity measures are the following. Here, A and B are two clusters, and sim(A,B) is the similarity measure between them. The similarity measure sim(u,v) between input data items u and v is specified as input. Typically, the similarity is a number between 0 and 1, with larger values implying greater similarity.

Centroid similarity:

Ward's similarity:

Input: The input is a collection of 50 artists, specified in artists50.in. The similarity between each pair of artists is specified in the file dataAm50.in. Of these 50 artists, a collection of 25 are chosen, and the respective artists and similarity measures are in the files artists25.in and dataAm25.in.  These data sets have been generated using the similar artists API here.

Output: Output one line per cluster, which contains the ids of the points belonging to that cluster. See below for an example.

Experiments: Code up agglomerative clustering with each of the above similarity measures as different subroutines. Run these algorithms (5 of them) on the two data sets and output the clustering tree. Rate the algorithms in terms of running time and quality of solutions.  Compare the robustness of the different similarity measures by checking how similar the clusterings are for the 25 items across the two data sets. In the first data set, the 25 items are padded with 25 outliers, which can possibly change the clustering of the 25 items themselves. A similarity measure is robust if the clustering of these items does not change by much. How will you quantitatively measure the robustness of the clustering?

For assessing quality, you can use any available information about the artists. What would be a good quantitative measure for quality of the clustering? For your reference, here is a possible clustering of the 25 artists that was obtained using an entirely different clustering algorithm:

Cluster 1: 2 3 4 10 11 12 13 14 17 18 22
Cluster 2: 15 19 20 21 23 24
Cluster 3: 5 6 7
Cluster 4: 0 1 8 9 16

This is the same clustering as above but using the indices for 50 artists

Cluster 1: 18 25 5 8 20 23 44 10 31 19 34
Cluster 2: 49 26 37 38 48 36
Cluster 3: 12 13 14
Cluster 4: 16 4 3 15 29

In addition to the code, you need to produce a document explaining what you did. You will be graded on two dimensions.
• Efficiency: What is the main bottleneck in the running time? Can you devise data structures or heuristics to avoid these bottlenecks? How do the heuristics affect the quality of the solution produced?
• Experiments and Discussion: How do the different clustering measures compare in terms of quality of solutions and robustness? What kinds of data are desirable for each similarity measure? What are the drawbacks of each similarity measure?

### Part II: Center-based Clustering

The agglomerative clustering method discussed above constructs a tree of clusters, where the leaves are the data items. In center-based clustering, the items are endowed with a distance function instead of a similarity function, so that the more similar two items are, the shorter their distance is. The output of the clustering algorithm is K centers (which are quite often data items themselves). Each center serves as the representative of a cluster. Each data item is placed in the cluster whose corresponding center it is closest to. The goal is to find those centers such that each data item travels as little distance as possible to reach its center: In some sense, this would correspond to each cluster being as tight and closely-knit as possible around the corresponding center.

Formally, the K-medians problem is defined as follows: let X denote the collection of data items, so that |X| = n. For data items u,v, let d(u,v) denote the distance between u and v. The goal of K-medians clustering is to find a subset C of X, so that |C| = K, and the following function is optimized:

Basically, for any collection C of centers, each data point u is assigned to the closest center. The objective sums this distance over all u, and the goal is to find that collection C for which this objective is minimized.

Since solving this problem requires searching over all possible choices of K centers, it is NP-Complete. This is not a formal proof - try reducing it from the set cover problem to obtain a formal proof. In view of the difficulty in finding the "optimal" clustering, the goal is to design heuristics that work reasonably fast and output a clustering of reasonable quality.  We list below several approaches to solving this problem. Your goal is to code these approaches up, and compare their performance in terms of running time and solution quality on several data sets.

### K-Means Algorithm:

1. Choose K centers at random from X. Call this set C
2. Map each point in X to its closest center in C. This yields a set of clusters, one for each center.
3. For each cluster so obtained, compute the 1-median. This is that point in the cluster for which the sum of the distances to the remaining points in the cluster is as small as possible. This yields one 1-median for each of the K clusters. These are the new centers; call this set C.
4. Repeat Step (2) using the new set C.
5. In each iteration, keep track of the objective function for the current choice of C. Stop when the objective does not increase by much from one iteration of Steps (2-4) to the next.   (Why does K-Means eventually terminate?)
6. Repeat for a few random choices in Step (1), and take the best solution so obtained.  (Why is this step important?)

### Local Search Algorithm:

This algorithm is somewhat different from K-means. An analysis of this algorithm is presented in this paper. Let Cost(C) = denote the objective function when the set of centers is C.
1. Initially, C is chosen as in Step (1) above.
2. For each u in X and c in C, compute Cost(C - {c} + {u}). Find (u,c) for which this cost is smallest, and replace C with C - {c} + {u}.
3. Stop when Cost(C) does not change significantly after one iteration of Step (2).
How can you speed up the swap operation by avoiding searching over all (c,u) pairs? Is this scheme superior in practice to simply repeating K-Means with different random starting points? What about in theory?

### Online Algorithm:

The following is based on a clever algorithm due to Adam Meyerson and is presented in Section 2 of this paper. The algorithm assigns a cost f for each center, and finds a set of centers C that optimizes + f |C|. It then tries out different values of f till it finds one that opens around K centers.
1. Start with f  being very small (close to zero), and let Q = 1.1  (or any other number slightly bigger than 1).
2. Order the points randomly.
3. The set C is initially empty.
4. Consider the points in the random order constructed in Step (2).
5. If the distance from the current point to the closest center is D, then with probability min(D/f, 1) add this point to the set C.
6. When all points have been considered, if |C| > K, then increase f by a factor of Q and repeat Steps (3-6); else output C and the associated clustering and STOP.
What heuristic can you implement after Step (6) to improve the quality of the solution produced? What is the running time of the above algorithm? How can you optimize the performance by avoiding repeating all computations for every value of f?

### Greedy Clustering:

Consider the following clustering algorithm:
1. Start with an arbitrary point (maybe randomly chosen) and include it in C.
2. Repeat K-1 times: Pick that data point for which the closest point in C is as far away as possible, and include this point in C.
What performance metric is this algorithm designed to optimize? Note that it does not optimize for the K-medians objective, but something slightly different. What is the running time of the algorithm, and how do the solutions produced compare with the previous algorithms?

### Faster Implementations:

One can obtain faster implementations of all the above algorithms by randomly sampling a smaller "representative" set of points and clustering just these. An algorithm based on sampling and clustering is presented in Section 4 of this paper. After Step (6), you can run a K-medians algorithm on the centers to obtain exactly K centers. Run this algorithm using the above algorithms as a subroutine. What is the degradation in performance? What is the gain in running time?

Input: Use the code here to generate point sets with around 1000 to 10,000 points, which are mixtures of K Gaussians in 2 dimensions. (Remove the #include "malloc.h" from the source code if you get a compilation error.) Choose K = 3 or 4. As a standard input format for the clustering procedure, the first line of input should have the number of points, and then, each new line should be the x and y coordinate (separated by spaces) of a new point.

Output: The output should be the same format as the input, but each line should be pre-pended with the cluster number (use numbers 1,2,3,...) to which that point belongs. You can visualize the data, as well as the clusters using either gnuplot, or any other software (Matlab, Mathematica) you are comfortable with. No points will be given for developing fancy visual interfaces.

Experiments: Compare the performance of K-medians and agglomerative clustering methods as a function of:
1. How elliptical the Gaussians are. This can be controlled by making the covariance term much larger than the variance terms. K-medians works best when the Gaussians are spherical, so that the covariance and variance terms are comparable.
2. The separation between the means of the Gaussians. If the means are close by, the Gaussians overlap and this confuses most clustering algorithms.
3. The difference between the weights of the clusters. If there is a cluster with very small weight, many algorithms have trouble discerning it as a separate cluster.
Things to Ponder: What is the trade-off between the running time and performance of the various implementations of K-medians? How does the faster implementation scale compared to the rest as the data set size increases? How can the different implementations be combined to improve performance? Read and understand the theoretical analyses/properties of K-means, local search, the online algorithm, the greedy algorithm, and the fast implementation, and summarize these in your own words. Why do K-Means and local search procedures eventually terminate? When is K-medians superior to agglomerative methods, and vice versa? How can we choose the "correct" value of K in K-median clustering methods?

As in part I, you will need to produce the code and a document detailing what you did. You will be graded on how you addressed efficiency issues, and on the quality of experiments and discussion.

## Part III: Mixture Models

Until now, we assumed no information about what the true underlying clustering was. Suppose instead someone told us that the data was generated by a mixture of Gaussians (as in the code used to generate the input above). Can this information be used to devise a better clustering algorithm? The answer is the EM algorithm, which generalizes the K-Means algorithm described above. Instead of starting with a crude guess of the centers and refining these iteratively, the EM algorithm works as follows (descriptions here, here, and here):
1. Start with any estimate of the mean and covariance matrices of each of the K Gaussians.
2. Computes (by Bayes rule) the probability of each data point being generated by each of the K Gaussians.
3. Use these probabilities to assign each data point probabilistically to one of K clusters.
4. Compute the sample mean and sample covariance matrices of each of these clusters, and use these new estimates to repeat Steps (2-4).
5. What termination criterion must one use here?
Note that K-Means is a special case of EM, where we assume the Gaussians are spherical with the same covariance matrix, and deterministically assign the data point to the most likely Gaussian in Step (3). Therefore, K-Means works best when the Gaussians are identical in shape and spherical, but not very well for elliptical non-isomorphic Gaussians. As with K-Means, this procedure eventually terminates. (Try to understand why!)

Input/Output: The input and output formats should be the same as Part II.

Experiments: Compare the performance of the EM algorithm on the same data sets as were used in Part II. How will you quantitatively measure the quality of the clustering? Note that you should expect the EM algorithm to work better, since it is assuming the data is drawn from a Gaussian mixture. Construct a data set (clearly one that is not generated by a Gaussian mixture model) where the EM algorithm does not work as well as agglomerative clustering methods.

The criteria for grading is the same as for Parts I and II.

## Part IV: MapReduce Implementations

This part is optional, more open-ended, and for the adventurous. Suppose you have a large amount of data that you need to cluster, and have a cloud computing system available. What would be a good clustering algorithm to implement; how would you implement it; and what would be the efficiency considerations? This paper has a model for MapReduce, and this paper has a possible algorithm for K-Medians that works with MapReduce. Can you implement EM or agglomerative clustering efficiently on a MapReduce system?  I do not need the implementation; just a description of the algorithm and an analysis of running time and solution quality will do.