# Computational Genomics 7 – EM

## Intro

We get multiple sets of coin tosses, originating from 2 coins with unknown parameters. X = number of heads in each set, Y = which coin was used? (hidden). If we knew the Y assignments, we could simply find the ML solution maximizing $P(x,y|theta)$ which is the fraction of H per coin. But we don’t. So we can guess, find the most likely coin $P(y|x,\theta)$, update our guess and continue till convergence. EM is working with probabilities instead of the most likely coin…

## Probabilistic setting

We have a mixture of two Gaussians with known equal variance, and would like to find the maximum likelihood parameters: $\theta = p, \mu_1, \mu_2$, and color each dot in blue/red accordingly.

KL divergence $D(P || Q) = \sum{P(x)log\frac{P(x)}{Q(x)}}$. This measures how unlikely it was to get P if our null hypotheses was Q. It is always >=0.

The EM algorithm for heights tries to maximize $log P(X=180|\theta) = log \sum{P(X=180,Y=y|\theta)}$. We now try to express the probability for X=180 with relation to the different options for Y (M,F). From Bayes rule we have: $P(X=180|\theta) = P(X=180, Y=M | \theta) / P(Y=M|X=180, \theta)$.

We can show that maximizing the difference from the current parameters is equal to maximizing this: $\Delta = Q(\theta | \theta^t) - Q(\theta^t | \theta^t) + KL(P(y|x,\theta^t) || P(y|x,\theta))$

Where the KL divergence part is almost non-negative for everything we choose. So the important part is: $Q(\theta | \theta^t) = \sum_y P(y|x, \theta^t)log P(x,y|\theta)$

• Q: in slide 14, why do we need the math gymnastics with y_ij?

This is an expectation over the distribution of y (conditioned on current parameters and x) of the complete log-likelihood. We sum this over all possible x_i and y_j, find the roots of the derivatives and that’s it.

## EM for HMM – Baum-Welch

The Q function for HMM is an expectation over the distribution of $\pi$ of the complete log-likelihood. This is a multiplication of

1. the emission probability of “b” in state “k”, to the power of the number of times we saw that
2. the transition probability from “k” to “l”, to the power of the number of time we saw that

so we need the maximum expectations of those 2 things. The forward-backward algorithm gives us the joint probability of a transition in a specific position “i”. If we sum over all possible positions and divide by P(X) we get the conditional expectation

• Q: in slide 21, why do we get the conditional expectation by summing over all positions? and why is the specified weights the ones maximizing the expectations?
• Q: what is slide 22 showing us? is this the answer to the previous question?