Computational genomics 6 – HMM

Standard Markov chains

We would like to detect CpG islands in a DNA strand. Inside the island, the probability for a G following a C is higher. We assume the letters are a Markovian process (memory size = 1). For determining the likelihood of a given small section to be from an island or not, we calculate the two likelihoods P(X) = P(x_1)  \prod{a_{x(i-1)->x(i)}}.

General case – hidden Markov models

When does a CpG island starts and stops in the strand? The analogy used is an occasionally unfair casino that switches to using a loaded die sometimes and then returns to use a normal die. We model a hidden layer of states \pi_i with transition matrix a, and X is the emission with probability e_{\pi_i}(x_i). The most probable path can be found by Viterbi = DP with back pointers. Note that if the string size is m, and there are n states, the complexity is O(m*n^2), as for each state we consider all states from the previous state, and DOES NOT relate to the alphabet size, as the emissions are known!

Posterior state probabilities

We want to calculate P(\pi_i = k | X). We want to compute the joint probability of X, \pi_i = k. This is equal to probability of the prefix joint with \pi_i = k and the suffix joint with that again. We find these by forward and backward algorithms

  • Q: why do we need the backward alg, and can’t move forward from \pi_i?
  • A: because backward is calculating the joint probability and forward is calculating a conditional probability.

We then calculate the entire X and divide by that: P(\pi_i = k | X) = P(\pi_i = k, X) / P(X) = P(pref(X), \pi_i = k) * P(suf(X) | \pi_i = k)/ P(X) .

Posterior decoding

Now to decode, we cant go position by position and take the max prob, as it might contradict the other positions… We define a function of interest which is 1 at interesting states and 0 at all other states, and the posterior probability G(i|X) is the sum of posterior state probabilities for the interesting states.

HMM parameter estimation – known state sequence

We get a multiple sequence alignment and output an HMM model. We determine the match state positions by some heuristic (e.g. more than 50% non-gaps in the position). Then we estimate the parameters by counting transitions:
* In a match position: letter = match, gap = deletion
* In a no-match position: letter = insert, gap = nothing
then we estimate for each transition a(state->nextState) = \frac{cnt(state->nextState)}{\sum{cnt(state->otherState)}} . But when the training data is small, this leaves a lot of empty transitions. Better to use Laplace rule: a(state->nextState) = \frac{cnt(state->nextState) + 1}{\sum{cnt(state->otherState) + 1}} (aka Dirichlet priors).

  • Q: in slide 17, what does training a profile HMM from unaligned sequences mean? “obtain MA with the resulting profile as before”??

HMM parameter estimation – unknown state sequence

If the states are unknown, we use a version of the EM algorithm (Baum-Welch). We are given multiple sequences. In the E-step, we compute expectations for:

  • Ek(b) – the conditional probability of h(i) = k in every position the character b appears, in every one of the sequences
  • Akl – the conditional probability for the transition to appear in every position on every sequence

Recitation insights

Viterbi for string against a profile – The original Viterbi shown was good for when there’s a string of size m and a single layer of n states = O(m*n^2). Now there are L layers, and it’s a O(m*L) – because now every character can be from every layer. We estimate the probability of every character, to be I/D/M for every layer depending on the ingoing edges.
The same goes for estimating forward/backward probabilities (only with sum instead of max…)

Questions

  1. what’s the relation between HMM estimation when state is known, HMM estimation when states are unknown, profile HMM, EM?
  2. What is training a profile HMM from unaligned sequences? were they aligned before – the only difference I see is the length of the model…
  3. whats the connection to previous search techniques we learned? (FASTA et al)
  4. In the intro to HMM (hmm-2nd-lec-slide 7) they talk about the price of gaps! what do they mean? how does that come to use in the next slides illustrating hmm estimation?

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s