Standard Markov chains
We would like to detect CpG islands in a DNA strand. Inside the island, the probability for a G following a C is higher. We assume the letters are a Markovian process (memory size = 1). For determining the likelihood of a given small section to be from an island or not, we calculate the two likelihoods .
General case – hidden Markov models
When does a CpG island starts and stops in the strand? The analogy used is an occasionally unfair casino that switches to using a loaded die sometimes and then returns to use a normal die. We model a hidden layer of states with transition matrix a, and X is the emission with probability . The most probable path can be found by Viterbi = DP with back pointers. Note that if the string size is m, and there are n states, the complexity is O(m*n^2), as for each state we consider all states from the previous state, and DOES NOT relate to the alphabet size, as the emissions are known!
Posterior state probabilities
We want to calculate . We want to compute the joint probability of X, . This is equal to probability of the prefix joint with and the suffix joint with that again. We find these by forward and backward algorithms
- Q: why do we need the backward alg, and can’t move forward from ?
- A: because backward is calculating the joint probability and forward is calculating a conditional probability.
We then calculate the entire X and divide by that: .
Now to decode, we cant go position by position and take the max prob, as it might contradict the other positions… We define a function of interest which is 1 at interesting states and 0 at all other states, and the posterior probability G(i|X) is the sum of posterior state probabilities for the interesting states.
HMM parameter estimation – known state sequence
We get a multiple sequence alignment and output an HMM model. We determine the match state positions by some heuristic (e.g. more than 50% non-gaps in the position). Then we estimate the parameters by counting transitions:
* In a match position: letter = match, gap = deletion
* In a no-match position: letter = insert, gap = nothing
then we estimate for each transition a(state->nextState) = . But when the training data is small, this leaves a lot of empty transitions. Better to use Laplace rule: a(state->nextState) = (aka Dirichlet priors).
- Q: in slide 17, what does training a profile HMM from unaligned sequences mean? “obtain MA with the resulting profile as before”??
HMM parameter estimation – unknown state sequence
If the states are unknown, we use a version of the EM algorithm (Baum-Welch). We are given multiple sequences. In the E-step, we compute expectations for:
- Ek(b) – the conditional probability of h(i) = k in every position the character b appears, in every one of the sequences
- Akl – the conditional probability for the transition to appear in every position on every sequence
Viterbi for string against a profile – The original Viterbi shown was good for when there’s a string of size m and a single layer of n states = O(m*n^2). Now there are L layers, and it’s a O(m*L) – because now every character can be from every layer. We estimate the probability of every character, to be I/D/M for every layer depending on the ingoing edges.
The same goes for estimating forward/backward probabilities (only with sum instead of max…)
- what’s the relation between HMM estimation when state is known, HMM estimation when states are unknown, profile HMM, EM?
- What is training a profile HMM from unaligned sequences? were they aligned before – the only difference I see is the length of the model…
- whats the connection to previous search techniques we learned? (FASTA et al)
- In the intro to HMM (hmm-2nd-lec-slide 7) they talk about the price of gaps! what do they mean? how does that come to use in the next slides illustrating hmm estimation?