Abstract:
Clustering is a popular technique in biology to partition experimental data. For example, clustering is used to find similarities between genes, discern disease classes and build evolutionary trees. Often very simple clustering algorithms are used to great effect. The celebrated AML/ALL result used centroid-based self-organizing maps (SOMs) to perform class discovery. From gene expression patterns alone, Golub et al show the cancers forming two clusters that correspond to their AML or ALL type with few errors.Given the problem of unsupervised class discovery, I examine two alternative clustering techniques. The first is spectral clustering which uses the spectral properties of a neighbor graph to mimic a Markov random walk which in turn defines the clusters. Spectral clustering is known to be superior in many cases, for instance when the underlying data cannot be separated by convex regions. A second approach is divisive clustering. From a k-nearest neighbors graph, compute the betweenness centrality, i.e. the number of shortest paths traversing each edge in the graph. After successively removing edges in decreasing betweenness value, the partition with the highest modularity score determines the optimal clustering. This algorithm is attractive because it does not require, a priori, any knowledge of the number of clusters.
Using spectral and divisive clustering, I find improved performance (in terms of false positives and negatives) over previous results on the canonical AML/ALL dataset. I demonstrate that spectral clustering generalizes to other datasets with similar success. On recent microarray data from patients with two types of difficult-to-diagnose malignant gliomas, I achieve a single misclassification via spectral clustering compared to the three errors in the original research.
Finally, time permitting, I will attempt to prove the theoretical soundness of spectral methods in the context of PCC and the corrupted cliques problem.
Presentation:
beverly-18.417-clustering.pdfDatasets:
In this work, we analyze two microarray gene expression datasets:Results:
Code:
Newman's Divisive Clustering C++ implementation:
File Description Makefile Makefile newman.cpp Main common.h Include central.cpp Compute betweenness centrality dot.cpp Routines to process/produce GraphViz output Spectral Clustering MATLAB implementation (adapted from 6.867):
File Description preproc.m Preprocess microarray data weights.m Calculate graph edge weights spectral.m Spectral clustering specerr.m Calculate error from a clustering knn.m Generate k Nearest Neighbors swrap.m Wrapper Miscellaneous Supporting Perl Scripts:
File Description munge.pl Create gene expression matrix makedot.pl Create a GraphViz plot of a k-NN input matrix