In this project, we will explore different algorithms to cluster data items. Clustering is the process of automatically detect items that are similar to one another, and group them together. Since the notion of a group is fuzzy, there are various algorithms for clustering that differ in their measure of quality of a clustering, and in their running time. The goal of this project is to implement some of these algorithms and compare their performance on different data sets, both in terms of running time, and solution quality.

This algorithm closely resembles Kruskal's algorithm for Minimum Spanning Tree construction. In Kruskal's algorithm, at each step the smallest edge connecting two components is added till there is only one component left. Each new edge added merges two components. This can be seen as clustering where the similarity score between two components is the negative the length of the smallest edge between any two points in the cluster.

Many different clustering algorithms can be obtained by varying the similarity measures between clusters. Some examples of similarity measures are the following. Here, A and B are two clusters, and sim(A,B) is the similarity measure between them. The similarity measure sim(u,v) between input data items u and v is specified as input. Typically, the similarity is a number between 0 and 1, with larger values implying greater similarity.

Single-linkage (Least similarity):

Complete-linkage (Maximum similarity):

Average-linkage:

Centroid similarity:

Ward's similarity:

Complete-linkage (Maximum similarity):

Average-linkage:

Centroid similarity:

Ward's similarity:

Input: The input is a collection of 50 artists, specified in artists50.in. The similarity between each pair of artists is specified in the file dataAm50.in. Of these 50 artists, a collection of 25 are chosen, and the respective artists and similarity measures are in the files artists25.in and dataAm25.in. These data sets have been generated using the similar artists API here.

Output: Output one line per cluster, which contains the ids of the points belonging to that cluster. See below for an example.

Experiments: Code up agglomerative clustering with each of the above similarity measures as different subroutines. Run these algorithms (5 of them) on the two data sets and output the clustering tree. Rate the algorithms in terms of running time and quality of solutions. Compare the robustness of the different similarity measures by checking how similar the clusterings are for the 25 items across the two data sets. In the first data set, the 25 items are padded with 25 outliers, which can possibly change the clustering of the 25 items themselves. A similarity measure is robust if the clustering of these items does not change by much. How will you quantitatively measure the robustness of the clustering?

For assessing quality, you can use any available information about the artists. What would be a good quantitative measure for quality of the clustering? For your reference, here is a possible clustering of the 25 artists that was obtained using an entirely different clustering algorithm:

Cluster 1: 2 3 4 10 11 12 13 14 17 18 22

Cluster 2: 15 19 20 21 23 24

Cluster 3: 5 6 7

Cluster 4: 0 1 8 9 16

Cluster 2: 15 19 20 21 23 24

Cluster 3: 5 6 7

Cluster 4: 0 1 8 9 16

This is the same clustering as above but using the indices for 50 artists

Cluster 1: 18 25 5 8 20 23 44 10 31 19 34

Cluster 2: 49 26 37 38 48 36

Cluster 3: 12 13 14

Cluster 4: 16 4 3 15 29

Cluster 2: 49 26 37 38 48 36

Cluster 3: 12 13 14

Cluster 4: 16 4 3 15 29

In addition to the code, you need to produce a document explaining what you did. You will be graded on two dimensions.

- Efficiency: What is the main bottleneck in the running time? Can you devise data structures or heuristics to avoid these bottlenecks? How do the heuristics affect the quality of the solution produced?
- Experiments and Discussion: How do the different clustering measures compare in terms of quality of solutions and robustness? What kinds of data are desirable for each similarity measure? What are the drawbacks of each similarity measure?

Formally, the K-medians problem is defined as follows: let X denote the collection of data items, so that |X| = n. For data items u,v, let d(u,v) denote the distance between u and v. The goal of K-medians clustering is to find a subset C of X, so that |C| = K, and the following function is optimized:

Basically, for any collection C of centers, each data point u is assigned to the closest center. The objective sums this distance over all u, and the goal is to find that collection C for which this objective is minimized.

Since solving this problem requires searching over all possible choices of K centers, it is NP-Complete. This is not a formal proof - try reducing it from the set cover problem to obtain a formal proof. In view of the difficulty in finding the "optimal" clustering, the goal is to design heuristics that work reasonably fast and output a clustering of reasonable quality. We list below several approaches to solving this problem. Your goal is to code these approaches up, and compare their performance in terms of running time and solution quality on several data sets.

- Choose K centers at random from X. Call this set C.
- Map each point in X to its closest center in C. This yields a set of clusters, one for each center.
- For each cluster so obtained, compute the 1-median. This is that point in the cluster for which the sum of the distances to the remaining points in the cluster is as small as possible. This yields one 1-median for each of the K clusters. These are the new centers; call this set C.
- Repeat Step (2) using the new set C.
- In each iteration, keep track of the objective function for the current choice of C. Stop when the objective does not increase by much from one iteration of Steps (2-4) to the next. (Why does K-Means eventually terminate?)
- Repeat for a few random choices in Step (1), and take the best solution so obtained. (Why is this step important?)

This algorithm is somewhat different from K-means. An analysis of this algorithm is presented in this paper. Let Cost(C) = denote the objective function when the set of centers is C.

- Initially, C is chosen as in Step (1) above.
- For each u in X and c in C, compute Cost(C - {c} + {u}). Find (u,c) for which this cost is smallest, and replace C with C - {c} + {u}.
- Stop when Cost(C) does not change significantly after one iteration of Step (2).

How can you speed up the swap operation by avoiding searching over all (c,u) pairs? Is this scheme superior in practice to simply repeating K-Means with different random starting points? What about in theory?

The following is based on a clever algorithm due to Adam Meyerson and is presented in Section 2 of this paper. The algorithm assigns a cost f for each center, and finds a set of centers C that optimizes + f |C|. It then tries out different values of f till it finds one that opens around K centers.

- Start with f being very small (close to zero), and let Q = 1.1 (or any other number slightly bigger than 1).
- Order the points randomly.
- The set C is initially empty.
- Consider the points in the random order constructed in Step (2).
- If the distance from the current point to the closest center is D, then with probability min(D/f, 1) add this point to the set C.
- When all points have been considered, if |C| > K, then increase f by a factor of Q and repeat Steps (3-6); else output C and the associated clustering and STOP.

Consider the following clustering algorithm:

- Start with an arbitrary point (maybe randomly chosen) and include it in C.
- Repeat K-1 times: Pick that data point for which the closest point in C is as far away as possible, and include this point in C.

One
can obtain faster implementations of all the above algorithms by
randomly sampling a smaller "representative" set of points and
clustering just these. An algorithm based on sampling and clustering is presented in Section 4 of this paper.
After Step (6), you can run a K-medians algorithm on the centers to
obtain exactly K centers. Run this algorithm using the above algorithms
as a subroutine. What is the degradation in performance? What is the
gain in running time?

Input: Use the code here to generate point sets with around 1000 to 10,000 points, which are mixtures of K Gaussians in 2 dimensions. (Remove the #include "malloc.h" from the source code if you get a compilation error.) Choose K = 3 or 4. As a standard input format for the clustering procedure, the first line of input should have the number of points, and then, each new line should be the x and y coordinate (separated by spaces) of a new point.

Output: The output should be the same format as the input, but each line should be pre-pended with the cluster number (use numbers 1,2,3,...) to which that point belongs. You can visualize the data, as well as the clusters using either gnuplot, or any other software (Matlab, Mathematica) you are comfortable with. No points will be given for developing fancy visual interfaces.

Experiments: Compare the performance of K-medians and agglomerative clustering methods as a function of:

- How elliptical the Gaussians are. This can be controlled by making the covariance term much larger than the variance terms. K-medians works best when the Gaussians are spherical, so that the covariance and variance terms are comparable.
- The separation between the means of the Gaussians. If the means are close by, the Gaussians overlap and this confuses most clustering algorithms.
- The difference between the weights of the clusters. If there is a cluster with very small weight, many algorithms have trouble discerning it as a separate cluster.

As in part I, you will need to produce the code and a document detailing what you did. You will be graded on how you addressed efficiency issues, and on the quality of experiments and discussion.

- Start with any estimate of the mean and covariance matrices of each of the K Gaussians.
- Computes (by Bayes rule) the probability of each data point being generated by each of the K Gaussians.
- Use these probabilities to assign each data point probabilistically to one of K clusters.
- Compute the sample mean and sample covariance matrices of each of these clusters, and use these new estimates to repeat Steps (2-4).
- What termination criterion must one use here?

Input/Output: The input and output formats should be the same as Part II.

Experiments: Compare the performance of the EM algorithm on the same data sets as were used in Part II. How will you quantitatively measure the quality of the clustering? Note that you should expect the EM algorithm to work better, since it is assuming the data is drawn from a Gaussian mixture. Construct a data set (clearly one that is not generated by a Gaussian mixture model) where the EM algorithm does not work as well as agglomerative clustering methods.

The criteria for grading is the same as for Parts I and II.

- Correctness of code: 20%
- Efficiency of implementation: 30%
- Experiments (creativity and conclusions drawn): 30%
- Document writing: 20%

General Comments:

- You can work in groups of two. In the document, please list the contributions of each team member.
- You can use C, C++, or Java for coding. I will not give extra points for fancy visualization.
**No programming in Matlab or using pre-packaged clustering tools!!!** - We will use an automated code checker to detect copying. You can discuss ideas with each other, but code them up individually. Same holds for writing the document.
- This project is open-ended, so there is no single "right answer". It is mainly designed to give you an idea of what algorithm design is like in real life.