RL 2

Bannach fixed point

operator T is a contraction map if when we apply it to x and y then the distance between T(x) and T(y) is smaller than the original distance between x and y.

The theorem states that when we repeatedly apply such an operator T we converge to a fixed point x*.

Value based methods

we examine a policy pi, and we define the value of each state V function: expected reward of starting from that state and following pi. V satisfies a recursive formula.

Q is the state-action function: the expected reward of starting in s, taking action a and then following pi.

We can calculate V by continuously applying the recursive V formula on the previous estimate of V. The formula is contracting, so we will finally get to a V that satisfies the recursive formula, so it must be V*.

The same can be done for Q

Optimal V and Q

V* and Q* are related. as V* is the maximum of all policies we can express each policy as a policy that takes a at its first step and then follow an arbitrary policy, which shows that V* is max(a, Q*) (2.3)

Bellman optimality: V* and Q* satisfy the following equations –

Q* of starting from s and following any action a is r(s,a) + expectation of discounted V* by probability of the state s’ we’ll get to. (2.5)

Together, (2.3) and (2.5) gives us that V*(s) is the max action of reward + expected V*(s’)

(2.4) the optimal policy is following the maximal Q*. This is shown as for solving the recursive formula of every V, V* with its companion Q* must be from a policy that selects the maximal Q*…





RL 1


we are trying to learn a good policy without worrying about estimating all the parameters of the model. We want to maximize L = the expected reward for a finite horizon of states-actions path of length T.

The gradient of L is hard to calculate, but we use a log-derivative-trick which allows us to express the gradient as well as  an expectation over over samples of state-actions paths with length T! We can now empirically estimate this expectation.


But there will be variance and we must minimize it. We can shift each component of the gradient \theta_i by a constant. It won’t change the expectancy, since it’s own expectancy is 0, but it can minimize the variance which is a good thing.


Another reduction in variance can be achieved. The policy gradient can be written as a double sum over the reward step k and the action time t. We can remove the terms where k < t intuitively as the reward was already given so it does not affect future errors.

Computational Genomics 12 -NGS

NGS – the race towards cheap and quick sequencing.


We have a reference genome, and some short reads. Instead of using a suffix tree which costs a lot of space, use hash for small chunks, and align each read from there. We also estimate the mapping quality by dividing the probability of fitting at “u” to the sum of probabilities fitting anywhere else.

Burrows-wheeler transform

Take a text of length L. do all cyclic transformations, and then sort it alphabetically, then read the last column. It maps to a new text of length L.

The value_counts stay the same. We can recover the first column as the sorting is lexicographic. Now because its cyclic, we can recover all the pairs of 2! and now we we can recover the second column, as its also lexicographically sorted.

Last-first mapping

The internal order of the “a” characters in the last column is the same order as in the first column. why? because they are ordered by their next character (which is in the first), which is in the second column as well – determining their order in the first column…

Using this lemma we can reconstruct the text! we take the character from F with a corresponding $ in L, and from there we can start cycling thru all characters.

Searching the BWT index

We search from the end of the pattern backwards, similar to reconstructing, only with uncertainty in which of the rows of the first column are we.

Relation with suffix array and finding indexes of match

The BWT is ordered just as the suffix array. We save the index only each 100 entries. We continue to go backward thru the pointers until we reach an indexed location.

Sequence assembly – no reference genome

de-Bruijn – build vertices for all possible k-mers (if the sequence length is ~100 then k would be ~30). Build edge from each vertex to other vertices who’s prefix match the source suffix. A sequence matches a path in this graph.

How to merge 2 short paths with overlaps to a long single paths?


oy lo.


Computational Genomics 10 – Suffix trees

Suffix trees allow for efficient multi-query. We pre-process the sequence, and build a suffix tree, which is a compressed trie. After we have that, a pattern P matches S iff it is a prefix of some suffix of S. The naive way to build a suffix tree is adding the suffixes one at a time – O(m^2).

UA (Ukkonen’s algorithm)

UA (Ukkonen’s algorithm) builds successive implicit suffix trees. At each stage we extend and add s[i+1] to s[j..i]. This is done by the extension rules:

  1. S[i] is leaf – just add a leaf under it
  2. Following S[i] is some other u, need to create a new internal node and a new leaf
  3. S[j..i+1] is already at the trie – do nothing

The naive way to do this would be O(m^3), but the using the following tricks we can improve to O(m):

  1. suffix links – as in the i phase we add the suffixes from the longest [1..i] to the shortest [i-1,i], we can build suffix links from xa to a. Then, we can use them by going up with them, them coming down for the size of the prefix!
  2. skip and count – after jumping with the suffix link, we have to walk down. We can count the number of letters on each edge, and thus keep the walking down proportional to the number of nodes, not letters. It can be shown that during a phase, we go to some node, then decrement a maximum of 2m with the up walks and suffix links, so we go down a maximum of 3m – as the deepest node is at depth m. -> overall we’re at O(m^2)
  3. rule 3 is a stopper – if a string is already in the tree, then all of its suffixes are also in the tree, we can move on to the next phase
  4. once a leaf, always a leaf – a leaf remains a leaf, it just gets added new characters. We can store indexes instead of characters, and mark the last index with the special symbol “e”. That way we can extend all the leafs from the previous phase in O(1).

Implementation issues

for each node, there’s a choice of how to represent the outgoing edges: an array, linked list , balanced tree, hash or mixture of the correct choice per node.


  1. find exact match – every node in the subtree are an occurrence
  2. generalized suffix trees – for multiple strings
    1. longest common substring – node with both type of descendants
    2. LCA of two suffixes (leaves) represent their LCP – we can preproces and answer such queries in constant time
    3. maximal palindrome – insert S and Sr, for every i find the LCP of the corresponding suffixes of S and Sr

Suffix array

All suffixes sorted lexicographic order. We build a ST and DFS traverse it. We search for a pattern with a binary search.

Computational Genomics 9 – clustering

Clustering motivation

there’s a matrix with genes in the rows, samples in the columns. We can either

  1. cluster samples – which leads to class discovery, by measuring relevance to phenotype = clinical outcomes (Kaplan-Meier)
  2. cluster genes – groups of co-regulated genes, then we test functional enrichment = guilty by association

Q: what does the above mean?

Hierarchical clustering – Average linkage

agglomerative instead of divisive. start with a cluster for each item, in each step merge the closest clusters r and s. The distance to all other clusters is a weighted average of the previous distances from r and s. We can also use min/max instead of average.

Non-hierarchical clustering – K-means

iterative use of EM.


mates are a pair of points from the same cluster. Similarity can be a dot product between those 2 points. We expect the similarity between mates to be higher

than the non-mates similarity. The main idea is to look for a good cut to 2 clusters, which turns out to be the min-cut in the similarity graph! Note that sometimes we

don’t use the whole graph, we remove edges which have very low weight.

We sum over all the edges in the cut and try to decide between H0 = all the edges in the cut are non-mates, and H1 – all edges are mates. Kernel is when all possible

cuts are H1. If even for the min-cut H1 is more probable, stop – we have reached a Kernel. Else, let’s cut by the min-cut!

The initial parameters are retrieved by EM. A lemma we had to prove is that the weight of any cut is the probability of the similarities of the weights in the cut

under H1 divided by the probability under H0.

Recitation 9 – clustering

min-max intra-cluster dist – pay the price of the largest distance inside any cluster. The intuition is – make sure that the furthest points are in different clusters.
In every iteration the distance to head only gets better. Delta = the worst distance at the end, so there were k+1 points worse than delta. So if we partition to k clusters, the optimal solution will be worse than delta, as it will have 2 points in the same cluster.

By the triangle inequality, our solution would be 2*delta – point to head and then to point.


Computational Genomics 8 – Gene finding and motif analysis

Modeling ORFs

Find ORF = open reading frames in a long DNA sequence. We try to build a Markov model where the hidden states are nucleotides to discern Gene mode from NORF mode, but the variance is too big. When we use 3-letter combinations of nucleotides (codons), it’s more significant. This makes for a 64-state HMM. We choose the reading frame (indentation) using the frequencies of all codons that appear in it.

  • Q: slide 14, what do they mean by prefwindow?
  • A: this slides shows us the cumulative probability on a window of 25 codons for the decoding. Note that outside ORF the codons do not translate to proteins, so we don’t measure their probability. Also note that only in reading frame1, inside an ORF the probability rises notably! So this must be the correct reading frame


A promoter is a sequence that encourages transcription of following gene – it calls the RNA polymerase.

A PWM (positional weight matrix) is a matrix with the frequency of each base per position for the promoter, in the non-coding segment. We can score each sequence assuming its a promoter and assuming its not a promoter, and divide the 2 results for a log-likelihood ratio score.

Splicing – removal of introns, happens at the RNA level.

  • Q: slide 22, what do they mean by RNA-RNA base pairing?
  • The splicing (removal of introns) is done after the RNA is composed, by translating it into a new RNA by the spliceosome complex.

Intron/Exon length distribution

we can not use HMM as its memory-less, so can only model geometric distribution of length, and exons do not behave like that. So, we use a generalized HMM = GenScan. Output of each state can have a different distribution. We also generalize PWM to Weighted Array Model, where the distribution in each position can be dependent on other positions.

Spliced alignment

if we have a spliced mRNA, align it to the DNA.

  • Q: What are we seeing in slides 39-41?
  • A: These are short segments, spread all over the RNA,

Regulatory sequence analysis

The promoter is several BS (binding sites) in the DNA that are bound by several proteins called TF (transcription factor) that is regulating the gene’s expression. If several genes are co-expressed we assume they have common BSs!

The length we assume for the promoter is typically ~2Kbp, and we look in both strands upstream the transcription start site (TSS).

Models for BS

  1. exact string
  2. 1 mismatch
  3. degenerate string – several options per position
  4. PWM

Techniques for detecting BS

  1. In vitro – protein binding microarrys – generate a DNA with all possible k-mers. detect the TF binding to each of them
  2. In vivo – CHIP – use antibodies as fishing rods for the wanted BS
  • Q: in slide 53 – I didn’t understand the method… are we looking for BS or are we looking for TF?
  • A: both…



suppose we know the motif. We want to see if its over-represented in co-regulated genes. We set the threshold for our PWM s.t. it will fish 5% hits of random sequences generated with a HMM of the background probabilities.

For each promoter, test how much it got and how much from the target set and then calculate of getting such or more extreme result by chance (With the hyper-geometric distribution). We can also find co-occurrence using the same technique.

Motif discovery de-novo – MEME

We assume ZOOPS – Zero or One Per Sequence. Also we assume uniform probability of \lambda = \delta / m in each position in each sequence. The hidden data is the PWM matrix. Then we run EM inference.

Recitation 8 – spliced alignment

We do DP. S(i,j,k) defined when i is inside exon k. Everything goes as usual, only that in positions which are first positions of an exon, we go backward to any of the last positions of any previous exon. A trick to save more time – maintain P(i,j) the best chain that ends before i. This costs mn + mN, as we go over all the m positions for j, and either we go over exons that end in i (N), or simply over all i (n).

Computational Genomics 7 – EM


We get multiple sets of coin tosses, originating from 2 coins with unknown parameters. X = number of heads in each set, Y = which coin was used? (hidden). If we knew the Y assignments, we could simply find the ML solution maximizing P(x,y|theta) which is the fraction of H per coin. But we don’t. So we can guess, find the most likely coin P(y|x,\theta), update our guess and continue till convergence. EM is working with probabilities instead of the most likely coin…

Probabilistic setting

We have a mixture of two Gaussians with known equal variance, and would like to find the maximum likelihood parameters: \theta = p, \mu_1, \mu_2, and color each dot in blue/red accordingly.

KL divergence D(P || Q) = \sum{P(x)log\frac{P(x)}{Q(x)}}. This measures how unlikely it was to get P if our null hypotheses was Q. It is always >=0.

The EM algorithm for heights tries to maximize log P(X=180|\theta) = log \sum{P(X=180,Y=y|\theta)}. We now try to express the probability for X=180 with relation to the different options for Y (M,F). From Bayes rule we have:

P(X=180|\theta) = P(X=180, Y=M | \theta) / P(Y=M|X=180, \theta).

We can show that maximizing the difference from the current parameters is equal to maximizing this:

\Delta = Q(\theta | \theta^t) - Q(\theta^t | \theta^t) + KL(P(y|x,\theta^t) || P(y|x,\theta))

Where the KL divergence part is almost non-negative for everything we choose. So the important part is:

Q(\theta | \theta^t) = \sum_y P(y|x, \theta^t)log P(x,y|\theta)

  • Q: in slide 14, why do we need the math gymnastics with y_ij?

This is an expectation over the distribution of y (conditioned on current parameters and x) of the complete log-likelihood. We sum this over all possible x_i and y_j, find the roots of the derivatives and that’s it.

EM for HMM – Baum-Welch

The Q function for HMM is an expectation over the distribution of \pi of the complete log-likelihood. This is a multiplication of

  1. the emission probability of “b” in state “k”, to the power of the number of times we saw that
  2. the transition probability from “k” to “l”, to the power of the number of time we saw that

so we need the maximum expectations of those 2 things. The forward-backward algorithm gives us the joint probability of a transition in a specific position “i”. If we sum over all possible positions and divide by P(X) we get the conditional expectation

  • Q: in slide 21, why do we get the conditional expectation by summing over all positions? and why is the specified weights the ones maximizing the expectations?
  • Q: what is slide 22 showing us? is this the answer to the previous question?



Computational genomics 6 – HMM

Standard Markov chains

We would like to detect CpG islands in a DNA strand. Inside the island, the probability for a G following a C is higher. We assume the letters are a Markovian process (memory size = 1). For determining the likelihood of a given small section to be from an island or not, we calculate the two likelihoods P(X) = P(x_1)  \prod{a_{x(i-1)->x(i)}}.

General case – hidden Markov models

When does a CpG island starts and stops in the strand? The analogy used is an occasionally unfair casino that switches to using a loaded die sometimes and then returns to use a normal die. We model a hidden layer of states \pi_i with transition matrix a, and X is the emission with probability e_{\pi_i}(x_i). The most probable path can be found by Viterbi = DP with back pointers. Note that if the string size is m, and there are n states, the complexity is O(m*n^2), as for each state we consider all states from the previous state, and DOES NOT relate to the alphabet size, as the emissions are known!

Posterior state probabilities

We want to calculate P(\pi_i = k | X). We want to compute the joint probability of X, \pi_i = k. This is equal to probability of the prefix joint with \pi_i = k and the suffix joint with that again. We find these by forward and backward algorithms

  • Q: why do we need the backward alg, and can’t move forward from \pi_i?
  • A: because backward is calculating the joint probability and forward is calculating a conditional probability.

We then calculate the entire X and divide by that: P(\pi_i = k | X) = P(\pi_i = k, X) / P(X) = P(pref(X), \pi_i = k) * P(suf(X) | \pi_i = k)/ P(X) .

Posterior decoding

Now to decode, we cant go position by position and take the max prob, as it might contradict the other positions… We define a function of interest which is 1 at interesting states and 0 at all other states, and the posterior probability G(i|X) is the sum of posterior state probabilities for the interesting states.

HMM parameter estimation – known state sequence

We get a multiple sequence alignment and output an HMM model. We determine the match state positions by some heuristic (e.g. more than 50% non-gaps in the position). Then we estimate the parameters by counting transitions:
* In a match position: letter = match, gap = deletion
* In a no-match position: letter = insert, gap = nothing
then we estimate for each transition a(state->nextState) = \frac{cnt(state->nextState)}{\sum{cnt(state->otherState)}} . But when the training data is small, this leaves a lot of empty transitions. Better to use Laplace rule: a(state->nextState) = \frac{cnt(state->nextState) + 1}{\sum{cnt(state->otherState) + 1}} (aka Dirichlet priors).

  • Q: in slide 17, what does training a profile HMM from unaligned sequences mean? “obtain MA with the resulting profile as before”??

HMM parameter estimation – unknown state sequence

If the states are unknown, we use a version of the EM algorithm (Baum-Welch). We are given multiple sequences. In the E-step, we compute expectations for:

  • Ek(b) – the conditional probability of h(i) = k in every position the character b appears, in every one of the sequences
  • Akl – the conditional probability for the transition to appear in every position on every sequence

Recitation insights

Viterbi for string against a profile – The original Viterbi shown was good for when there’s a string of size m and a single layer of n states = O(m*n^2). Now there are L layers, and it’s a O(m*L) – because now every character can be from every layer. We estimate the probability of every character, to be I/D/M for every layer depending on the ingoing edges.
The same goes for estimating forward/backward probabilities (only with sum instead of max…)


  1. what’s the relation between HMM estimation when state is known, HMM estimation when states are unknown, profile HMM, EM?
  2. What is training a profile HMM from unaligned sequences? were they aligned before – the only difference I see is the length of the model…
  3. whats the connection to previous search techniques we learned? (FASTA et al)
  4. In the intro to HMM (hmm-2nd-lec-slide 7) they talk about the price of gaps! what do they mean? how does that come to use in the next slides illustrating hmm estimation?


Computational Genomics 5 – Phylogeny

Distance based methods

We are given a distance matrix, and would like to build a tree where the SSQ of the differences between (true distance – tree distance) is minimal (with possible weighting).


We repeatedly merge the closest clusters. We start where each string is its own 1-sized cluster. Naively, at each stage you need to recalculate the new formed cluster distance with all other clusters O(N), and there are O(N) stages -> not sure why this is O(n^3)???

A major setback of UPGMA is that it assumes ultrametry – all leafs are the same distance from the root = common uniform molecular clock.

Neighbor joining

There’s a method for detecting the neighbors in a tree. We join neighbors, and recalculate the new node distance to all other leafs using additivity.

Character based methods

Characters (sites) are independent. Small parsimony – given a tree, strings in leafs – find strings in inner nodes to minimize changes. Large parsimony – find the best tree.

Fitch alg for small parsimony – scan the tree post-order and prepare list of characters for the site. Whenever possible, do S_l \cap S_r – if not possible do S_l \cup S_r. Now for the solution generation do a pre-order and select characters from each node set S – whenever possible choose the same character as for the parent – if not, select a random character.

Weighted small parsimony – in the first step, prepare a cost for each state X node:

S_k(v) = min_m[S_m(l)+C_{mk}] + min_m[S_m(r)+C_{mk}]

In the exercise, we proved that assuming the cost is a metric (distance like properties) you can move the root and still get the same tree cost. TBD: why did we actually deleted the node and created a new one? Isn’t the question about just redefining another node as the root?

Large parsimony – NP-hard, but there are heuristics..

Probabilistic approaches

We want to find labeling for internal nodes + branch lengths. We assume that each site is independent + the branching is Markovian. For each assignment of letters per internal node – the likelihood is according to Bayes law a product of all  conditional probabilities. We can then use DP to post-order traverse and get likelihood of every node given likelihood of its children.

Because of additivity and symmetry we can move the root anywhere. So to find optimal edge length we move the root to one of its vertices, and do standard optimization on the log-likelihood of the tree. We can then rotate between all edges.

Time models – Jukes-Cantor

To figure out the time of changes in DNA, we assume that switching to another letter is 3 times as likely as staying in the same letter for an infinitesimal period of time…

Also, turns out that you can reverse the likelihood in a Bayes like fashion: p(ancestor)*p(bird|ancestor) = p(bird)*p(ancestor|bird)

And you can also add probabilities: p(dinosaur|ancestor,t)*p(ancestor|bird,t) = p(dinosaur|bird,2t)


Computational genomics 4 -multiple sequence alignment

There are 3 scores used for measuring multiple sequence (strings) alignment:

  1. SOP – sum of pairs distances
  2. Consensus – the minimum distance from an external string
  3. Tree – given an evolutionary tree, sum the distances on its paths


note that the distance between pairs is after the strings are indented according to the MSA. The DP solution for 2 strings can be extended to k strings, building a n^k cube, with 2^k calculations in each cell = impractical.

There’s a nice pruning trick we can use in our DP – the MSA solution will always be at least as bad as the sum of pairs. We can efficiently compute the pairwise distance for every suffix of every pair, and if the current solution + the sum of all pairwise suffixes is already worse than a known solution, stop.

Center star alg

  1. Find the string S^* that minimizes SOP.
  2. Iteratively match other strings to it, and add spaces to all strings where spaces were added to S^*

We can show that this alg achieves approximation ratio of 2 for SOP of its aligned strings:

due to triangle inequality and symmetry: the SOP for the star alignment is less than 2K sum of all pairs with the aligned center, which is 2K sum of all pairs with the original center as \delta(-,-) = 0, which means spaces don’t matter.

And the SOP for OPT is greater than original strings SOP, which is bigger than K original sum of pairs with center (that’s how we selected the center).

Consensus error – NO RELATION TO MA

Consensus error is the sum of pairs to S which is a string external to our set. The string which minimizes that sum is called the optimal steiner string = minimizes \sum{D(S^*, S_i)}.

Turns out that the steiner string is 2-approximated by the center string! This basically has to do with the fact that if you build an MA from a star around a steiner string, the consensus alignment error for it would be equal to the sum of global alignment errors – see this so question . So both the consensus error and the alignment error would be minimal around that wonderful steiner string.

Opt Consensus alignment error – MA

Back to MA – once we have an alignment, we can build an optimal consensus string which minimizes \sum{d(S_{con}, S'i)}. Surprisingly, it turns out that opt steiner string = optimal consensus string!

If we examine the optimal consensus alignment error for the MA of the center star alg, it is less than the alignment with the center string, which is the same as the consensus error with the center string (why?!) – which is a 2 approximation to the steiner string.

=> The center star MA is also 2 approximation for the optimal consensus alignment error!

Phylogenetic = Tree

The tree is given, with a string in each leaf. We would like to assign a string for every internal node s.t. the cost along the edges would be minimal. The lifted alignment alg can achieve 2-approximation:

We assume we know the optimal tree alignment T^*. We traverse post-order and in each vertex with string S^* we select the child S_j who is closest to S^*.

Now consider another child S_i in the lifted tree. D(S_j, S_i) < D(S_j, S^*) + D(S_i, S^*) < 2D(S_i, S^*). We don’t know what was the string assigned to that child in the optimal tree, but we do know that the entire path from S_i to S^* was at least D(S_i, S^*) – due to triangle inequality. So this edge cost is less than 2 times the entire path in the optimal tree. We can get such disjoint paths for every non-zero edge in the lifted tree, and if we sum up everything we get the proof.

The DP itself

for the DP, each node will remember the optimal distance of all its children if it is assigned any of its child strings. For a new node (post-order), we consider each string S and for each child we choose the T that minimizes the cost.

MA – greedy heuristics and progressive alignments

We develop DP to work not only with letters, but also with profiles = distribution over the letters. The scoring function between 2 profiles can be as simple as an expectation over the original scoring function. Now we can build a guide tree and add more and more strings to the MA.