Computational Genomics 5 – Phylogeny

Distance based methods

We are given a distance matrix, and would like to build a tree where the SSQ of the differences between (true distance – tree distance) is minimal (with possible weighting).


We repeatedly merge the closest clusters. We start where each string is its own 1-sized cluster. Naively, at each stage you need to recalculate the new formed cluster distance with all other clusters O(N), and there are O(N) stages -> not sure why this is O(n^3)???

A major setback of UPGMA is that it assumes ultrametry – all leafs are the same distance from the root = common uniform molecular clock.

Neighbor joining

There’s a method for detecting the neighbors in a tree. We join neighbors, and recalculate the new node distance to all other leafs using additivity.

Character based methods

Characters (sites) are independent. Small parsimony – given a tree, strings in leafs – find strings in inner nodes to minimize changes. Large parsimony – find the best tree.

Fitch alg for small parsimony – scan the tree post-order and prepare list of characters for the site. Whenever possible, do S_l \cap S_r – if not possible do S_l \cup S_r. Now for the solution generation do a pre-order and select characters from each node set S – whenever possible choose the same character as for the parent – if not, select a random character.

Weighted small parsimony – in the first step, prepare a cost for each state X node:

S_k(v) = min_m[S_m(l)+C_{mk}] + min_m[S_m(r)+C_{mk}]

In the exercise, we proved that assuming the cost is a metric (distance like properties) you can move the root and still get the same tree cost. TBD: why did we actually deleted the node and created a new one? Isn’t the question about just redefining another node as the root?

Large parsimony – NP-hard, but there are heuristics..

Probabilistic approaches

We want to find labeling for internal nodes + branch lengths. We assume that each site is independent + the branching is Markovian. For each assignment of letters per internal node – the likelihood is according to Bayes law a product of all  conditional probabilities. We can then use DP to post-order traverse and get likelihood of every node given likelihood of its children.

Because of additivity and symmetry we can move the root anywhere. So to find optimal edge length we move the root to one of its vertices, and do standard optimization on the log-likelihood of the tree. We can then rotate between all edges.

Time models – Jukes-Cantor

To figure out the time of changes in DNA, we assume that switching to another letter is 3 times as likely as staying in the same letter for an infinitesimal period of time…

Also, turns out that you can reverse the likelihood in a Bayes like fashion: p(ancestor)*p(bird|ancestor) = p(bird)*p(ancestor|bird)

And you can also add probabilities: p(dinosaur|ancestor,t)*p(ancestor|bird,t) = p(dinosaur|bird,2t)


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s