# Computational Genomics 9 – clustering

## Clustering motivation

there’s a matrix with genes in the rows, samples in the columns. We can either

1. cluster samples – which leads to class discovery, by measuring relevance to phenotype = clinical outcomes (Kaplan-Meier)
2. cluster genes – groups of co-regulated genes, then we test functional enrichment = guilty by association

Q: what does the above mean?

## Hierarchical clustering – Average linkage

agglomerative instead of divisive. start with a cluster for each item, in each step merge the closest clusters r and s. The distance to all other clusters is a weighted average of the previous distances from r and s. We can also use min/max instead of average.

## Non-hierarchical clustering – K-means

iterative use of EM.

## CLICK

mates are a pair of points from the same cluster. Similarity can be a dot product between those 2 points. We expect the similarity between mates to be higher

than the non-mates similarity. The main idea is to look for a good cut to 2 clusters, which turns out to be the min-cut in the similarity graph! Note that sometimes we

don’t use the whole graph, we remove edges which have very low weight.

We sum over all the edges in the cut and try to decide between H0 = all the edges in the cut are non-mates, and H1 – all edges are mates. Kernel is when all possible

cuts are H1. If even for the min-cut H1 is more probable, stop – we have reached a Kernel. Else, let’s cut by the min-cut!

The initial parameters are retrieved by EM. A lemma we had to prove is that the weight of any cut is the probability of the similarities of the weights in the cut

under H1 divided by the probability under H0.

## Recitation 9 – clustering

min-max intra-cluster dist – pay the price of the largest distance inside any cluster. The intuition is – make sure that the furthest points are in different clusters.
In every iteration the distance to head only gets better. Delta = the worst distance at the end, so there were k+1 points worse than delta. So if we partition to k clusters, the optimal solution will be worse than delta, as it will have 2 points in the same cluster.

By the triangle inequality, our solution would be 2*delta – point to head and then to point.