# Computational genomics 4 -multiple sequence alignment

There are 3 scores used for measuring multiple sequence (strings) alignment:

1. SOP – sum of pairs distances
2. Consensus – the minimum distance from an external string
3. Tree – given an evolutionary tree, sum the distances on its paths

## SOP MA

note that the distance between pairs is after the strings are indented according to the MSA. The DP solution for 2 strings can be extended to k strings, building a $n^k$ cube, with $2^k$ calculations in each cell = impractical.

There’s a nice pruning trick we can use in our DP – the MSA solution will always be at least as bad as the sum of pairs. We can efficiently compute the pairwise distance for every suffix of every pair, and if the current solution + the sum of all pairwise suffixes is already worse than a known solution, stop.

## Center star alg

1. Find the string $S^*$ that minimizes SOP.
2. Iteratively match other strings to it, and add spaces to all strings where spaces were added to $S^*$

We can show that this alg achieves approximation ratio of 2 for SOP of its aligned strings:

due to triangle inequality and symmetry: the SOP for the star alignment is less than 2K sum of all pairs with the aligned center, which is 2K sum of all pairs with the original center as $\delta(-,-) = 0$, which means spaces don’t matter.

And the SOP for OPT is greater than original strings SOP, which is bigger than K original sum of pairs with center (that’s how we selected the center).

## Consensus error – NO RELATION TO MA

Consensus error is the sum of pairs to S which is a string external to our set. The string which minimizes that sum is called the optimal steiner string = minimizes $\sum{D(S^*, S_i)}$.

Turns out that the steiner string is 2-approximated by the center string! This basically has to do with the fact that if you build an MA from a star around a steiner string, the consensus alignment error for it would be equal to the sum of global alignment errors – see this so question . So both the consensus error and the alignment error would be minimal around that wonderful steiner string.

## Opt Consensus alignment error – MA

Back to MA – once we have an alignment, we can build an optimal consensus string which minimizes $\sum{d(S_{con}, S'i)}$. Surprisingly, it turns out that opt steiner string = optimal consensus string!

If we examine the optimal consensus alignment error for the MA of the center star alg, it is less than the alignment with the center string, which is the same as the consensus error with the center string (why?!) – which is a 2 approximation to the steiner string.

=> The center star MA is also 2 approximation for the optimal consensus alignment error!

## Phylogenetic = Tree

The tree is given, with a string in each leaf. We would like to assign a string for every internal node s.t. the cost along the edges would be minimal. The lifted alignment alg can achieve 2-approximation:

We assume we know the optimal tree alignment $T^*$. We traverse post-order and in each vertex with string $S^*$ we select the child $S_j$ who is closest to $S^*$.

Now consider another child $S_i$ in the lifted tree. $D(S_j, S_i) < D(S_j, S^*) + D(S_i, S^*) < 2D(S_i, S^*)$. We don’t know what was the string assigned to that child in the optimal tree, but we do know that the entire path from $S_i$ to $S^*$ was at least $D(S_i, S^*)$ – due to triangle inequality. So this edge cost is less than 2 times the entire path in the optimal tree. We can get such disjoint paths for every non-zero edge in the lifted tree, and if we sum up everything we get the proof.

#### The DP itself

for the DP, each node will remember the optimal distance of all its children if it is assigned any of its child strings. For a new node (post-order), we consider each string S and for each child we choose the T that minimizes the cost.

## MA – greedy heuristics and progressive alignments

We develop DP to work not only with letters, but also with profiles = distribution over the letters. The scoring function between 2 profiles can be as simple as an expectation over the original scoring function. Now we can build a guide tree and add more and more strings to the MA.