18.417 Introduction to Computational Molecular Biology
Final Project: Robust Clustering Techniques in Bioinformatics
Rob Beverly

Abstract:

Clustering is a popular technique in biology to partition experimental data. For example, clustering is used to find similarities between genes, discern disease classes and build evolutionary trees. Often very simple clustering algorithms are used to great effect. The celebrated AML/ALL result used centroid-based self-organizing maps (SOMs) to perform class discovery. From gene expression patterns alone, Golub et al show the cancers forming two clusters that correspond to their AML or ALL type with few errors.

Given the problem of unsupervised class discovery, I examine two alternative clustering techniques. The first is spectral clustering which uses the spectral properties of a neighbor graph to mimic a Markov random walk which in turn defines the clusters. Spectral clustering is known to be superior in many cases, for instance when the underlying data cannot be separated by convex regions. A second approach is divisive clustering. From a k-nearest neighbors graph, compute the betweenness centrality, i.e. the number of shortest paths traversing each edge in the graph. After successively removing edges in decreasing betweenness value, the partition with the highest modularity score determines the optimal clustering. This algorithm is attractive because it does not require, a priori, any knowledge of the number of clusters.

Using spectral and divisive clustering, I find improved performance (in terms of false positives and negatives) over previous results on the canonical AML/ALL dataset. I demonstrate that spectral clustering generalizes to other datasets with similar success. On recent microarray data from patients with two types of difficult-to-diagnose malignant gliomas, I achieve a single misclassification via spectral clustering compared to the three errors in the original research.

Finally, time permitting, I will attempt to prove the theoretical soundness of spectral methods in the context of PCC and the corrupted cliques problem.

Presentation:

beverly-18.417-clustering.pdf
Datasets:
In this work, we analyze two microarray gene expression datasets:
  1. ALL/AML Leukemia
  2. Glioblastomas and Anaplastic Oligodendrogliomas
Results:
  1. ALL/AML
  2. Gliomas
Code:
Newman's Divisive Clustering C++ implementation:
FileDescription
MakefileMakefile
newman.cppMain
common.hInclude
central.cppCompute betweenness centrality
dot.cppRoutines to process/produce GraphViz output

Spectral Clustering MATLAB implementation (adapted from 6.867):

FileDescription
preproc.mPreprocess microarray data
weights.mCalculate graph edge weights
spectral.mSpectral clustering
specerr.mCalculate error from a clustering
knn.mGenerate k Nearest Neighbors
swrap.mWrapper

Miscellaneous Supporting Perl Scripts:

FileDescription
munge.plCreate gene expression matrix
makedot.plCreate a GraphViz plot of a k-NN input matrix