## Intro

We get multiple sets of coin tosses, originating from 2 coins with unknown parameters. X = number of heads in each set, Y = which coin was used? (hidden). If we knew the Y assignments, we could simply find the ML solution maximizing which is the fraction of H per coin. But we don’t. So we can guess, find the **most likely** coin , update our guess and continue till convergence. EM is working with probabilities instead of the most likely coin…

## Probabilistic setting

We have a mixture of two Gaussians with known equal variance, and would like to find the maximum likelihood parameters: , and color each dot in blue/red accordingly.

**KL divergence **– . This measures how unlikely it was to get P if our null hypotheses was Q. It is always >=0.

The EM algorithm for heights tries to maximize . We now try to express the probability for X=180 with relation to the different options for Y (M,F). From Bayes rule we have:

.

We can show that maximizing the difference from the current parameters is equal to maximizing this:

Where the KL divergence part is almost non-negative for everything we choose. So the important part is:

- Q: in slide 14, why do we need the math gymnastics with y_ij?

This is an expectation over the distribution of y (conditioned on current parameters and x) of the complete log-likelihood. We sum this over all possible x_i and y_j, find the roots of the derivatives and that’s it.

## EM for HMM – Baum-Welch

The Q function for HMM is an expectation over the distribution of of the complete log-likelihood. This is a multiplication of

- the emission probability of “b” in state “k”, to the power of the number of times we saw that
- the transition probability from “k” to “l”, to the power of the number of time we saw that

so we need the maximum expectations of those 2 things. The forward-backward algorithm gives us the joint probability of a transition in a specific position “i”. If we sum over all possible positions and divide by P(X) we get the conditional expectation

- Q: in slide 21, why do we get the conditional expectation by summing over all positions? and why is the specified weights the ones maximizing the expectations?
- Q: what is slide 22 showing us? is this the answer to the previous question?