# Standard Markov chains

We would like to detect CpG islands in a DNA strand. Inside the island, the probability for a G following a C is higher. We assume the letters are a Markovian process (memory size = 1). For determining the likelihood of a given small section to be from an island or not, we calculate the two likelihoods $P(X) = P(x_1) \prod{a_{x(i-1)->x(i)}}$.

# General case – hidden Markov models

When does a CpG island starts and stops in the strand? The analogy used is an occasionally unfair casino that switches to using a loaded die sometimes and then returns to use a normal die. We model a hidden layer of states $\pi_i$ with transition matrix a, and X is the emission with probability $e_{\pi_i}(x_i)$. The most probable path can be found by Viterbi = DP with back pointers. Note that if the string size is m, and there are n states, the complexity is O(m*n^2), as for each state we consider all states from the previous state, and DOES NOT relate to the alphabet size, as the emissions are known!

## Posterior state probabilities

We want to calculate $P(\pi_i = k | X)$. We want to compute the joint probability of X, $\pi_i = k$. This is equal to probability of the prefix joint with $\pi_i = k$ and the suffix joint with that again. We find these by forward and backward algorithms

• Q: why do we need the backward alg, and can’t move forward from $\pi_i$?
• A: because backward is calculating the joint probability and forward is calculating a conditional probability.

We then calculate the entire X and divide by that: $P(\pi_i = k | X) = P(\pi_i = k, X) / P(X) = P(pref(X), \pi_i = k) * P(suf(X) | \pi_i = k)/ P(X)$ .

## Posterior decoding

Now to decode, we cant go position by position and take the max prob, as it might contradict the other positions… We define a function of interest which is 1 at interesting states and 0 at all other states, and the posterior probability G(i|X) is the sum of posterior state probabilities for the interesting states.

## HMM parameter estimation – known state sequence

We get a multiple sequence alignment and output an HMM model. We determine the match state positions by some heuristic (e.g. more than 50% non-gaps in the position). Then we estimate the parameters by counting transitions:
* In a match position: letter = match, gap = deletion
* In a no-match position: letter = insert, gap = nothing
then we estimate for each transition a(state->nextState) = $\frac{cnt(state->nextState)}{\sum{cnt(state->otherState)}}$. But when the training data is small, this leaves a lot of empty transitions. Better to use Laplace rule: a(state->nextState) = $\frac{cnt(state->nextState) + 1}{\sum{cnt(state->otherState) + 1}}$ (aka Dirichlet priors).

• Q: in slide 17, what does training a profile HMM from unaligned sequences mean? “obtain MA with the resulting profile as before”??

## HMM parameter estimation – unknown state sequence

If the states are unknown, we use a version of the EM algorithm (Baum-Welch). We are given multiple sequences. In the E-step, we compute expectations for:

• Ek(b) – the conditional probability of h(i) = k in every position the character b appears, in every one of the sequences
• Akl – the conditional probability for the transition to appear in every position on every sequence

## Recitation insights

Viterbi for string against a profile – The original Viterbi shown was good for when there’s a string of size m and a single layer of n states = O(m*n^2). Now there are L layers, and it’s a O(m*L) – because now every character can be from every layer. We estimate the probability of every character, to be I/D/M for every layer depending on the ingoing edges.
The same goes for estimating forward/backward probabilities (only with sum instead of max…)

## Questions

1. what’s the relation between HMM estimation when state is known, HMM estimation when states are unknown, profile HMM, EM?
2. What is training a profile HMM from unaligned sequences? were they aligned before – the only difference I see is the length of the model…
3. whats the connection to previous search techniques we learned? (FASTA et al)
4. In the intro to HMM (hmm-2nd-lec-slide 7) they talk about the price of gaps! what do they mean? how does that come to use in the next slides illustrating hmm estimation?