RL 2

Bannach fixed point

operator T is a contraction map if when we apply it to x and y then the distance between T(x) and T(y) is smaller than the original distance between x and y.

The theorem states that when we repeatedly apply such an operator T we converge to a fixed point x*.

Value based methods

we examine a policy pi, and we define the value of each state V function: expected reward of starting from that state and following pi. V satisfies a recursive formula.

Q is the state-action function: the expected reward of starting in s, taking action a and then following pi.

We can calculate V by continuously applying the recursive V formula on the previous estimate of V. The formula is contracting, so we will finally get to a V that satisfies the recursive formula, so it must be V*.

The same can be done for Q

Optimal V and Q

V* and Q* are related. as V* is the maximum of all policies we can express each policy as a policy that takes a at its first step and then follow an arbitrary policy, which shows that V* is max(a, Q*) (2.3)

Bellman optimality: V* and Q* satisfy the following equations –

Q* of starting from s and following any action a is r(s,a) + expectation of discounted V* by probability of the state s’ we’ll get to. (2.5)

Together, (2.3) and (2.5) gives us that V*(s) is the max action of reward + expected V*(s’)

(2.4) the optimal policy is following the maximal Q*. This is shown as for solving the recursive formula of every V, V* with its companion Q* must be from a policy that selects the maximal Q*…





RL 1


we are trying to learn a good policy without worrying about estimating all the parameters of the model. We want to maximize L = the expected reward for a finite horizon of states-actions path of length T.

The gradient of L is hard to calculate, but we use a log-derivative-trick which allows us to express the gradient as well asĀ  an expectation over over samples of state-actions paths with length T! We can now empirically estimate this expectation.


But there will be variance and we must minimize it. We can shift each component of the gradient \theta_i by a constant. It won’t change the expectancy, since it’s own expectancy is 0, but it can minimize the variance which is a good thing.


Another reduction in variance can be achieved. The policy gradient can be written as a double sum over the reward step k and the action time t. We can remove the terms where k < t intuitively as the reward was already given so it does not affect future errors.

Computational Genomics 12 -NGS

NGS – the race towards cheap and quick sequencing.


We have a reference genome, and some short reads. Instead of using a suffix tree which costs a lot of space, use hash for small chunks, and align each read from there. We also estimate the mapping quality by dividing the probability of fitting at “u” to the sum of probabilities fitting anywhere else.

Burrows-wheeler transform

Take a text of length L. do all cyclic transformations, and then sort it alphabetically, then read the last column. It maps to a new text of length L.

The value_counts stay the same. We can recover the first column as the sorting is lexicographic. Now because its cyclic, we can recover all the pairs of 2! and now we we can recover the second column, as its also lexicographically sorted.

Last-first mapping

The internal order of the “a” characters in the last column is the same order as in the first column. why? because they are ordered by their next character (which is in the first), which is in the second column as well – determining their order in the first column…

Using this lemma we can reconstruct the text! we take the character from F with a corresponding $ in L, and from there we can start cycling thru all characters.

Searching the BWT index

We search from the end of the pattern backwards, similar to reconstructing, only with uncertainty in which of the rows of the first column are we.

Relation with suffix array and finding indexes of match

The BWT is ordered just as the suffix array. We save the index only each 100 entries. We continue to go backward thru the pointers until we reach an indexed location.

Sequence assembly – no reference genome

de-Bruijn – build vertices for all possible k-mers (if the sequence length is ~100 then k would be ~30). Build edge from each vertex to other vertices who’s prefix match the source suffix. A sequence matches a path in this graph.

How to merge 2 short paths with overlaps to a long single paths?


oy lo.


Computational Genomics 10 – Suffix trees

Suffix trees allow for efficient multi-query. We pre-process the sequence, and build a suffix tree, which is a compressed trie. After we have that, a pattern P matches S iff it is a prefix of some suffix of S. The naive way to build a suffix tree is adding the suffixes one at a time – O(m^2).

UA (Ukkonen’s algorithm)

UA (Ukkonen’s algorithm) builds successive implicit suffix trees. At each stage we extend and add s[i+1] to s[j..i]. This is done by the extension rules:

  1. S[i] is leaf – just add a leaf under it
  2. Following S[i] is some other u, need to create a new internal node and a new leaf
  3. S[j..i+1] is already at the trie – do nothing

The naive way to do this would be O(m^3), but the using the following tricks we can improve to O(m):

  1. suffix links – as in the i phase we add the suffixes from the longest [1..i] to the shortest [i-1,i], we can build suffix links from xa to a. Then, we can use them by going up with them, them coming down for the size of the prefix!
  2. skip and count – after jumping with the suffix link, we have to walk down. We can count the number of letters on each edge, and thus keep the walking down proportional to the number of nodes, not letters. It can be shown that during a phase, we go to some node, then decrement a maximum of 2m with the up walks and suffix links, so we go down a maximum of 3m – as the deepest node is at depth m. -> overall we’re at O(m^2)
  3. rule 3 is a stopper – if a string is already in the tree, then all of its suffixes are also in the tree, we can move on to the next phase
  4. once a leaf, always a leaf – a leaf remains a leaf, it just gets added new characters. We can store indexes instead of characters, and mark the last index with the special symbol “e”. That way we can extend all the leafs from the previous phase in O(1).

Implementation issues

for each node, there’s a choice of how to represent the outgoing edges: an array, linked list , balanced tree, hash or mixture of the correct choice per node.


  1. find exact match – every node in the subtree are an occurrence
  2. generalized suffix trees – for multiple strings
    1. longest common substring – node with both type of descendants
    2. LCA of two suffixes (leaves) represent their LCP – we can preproces and answer such queries in constant time
    3. maximal palindrome – insert S and Sr, for every i find the LCP of the corresponding suffixes of S and Sr

Suffix array

All suffixes sorted lexicographic order. We build a ST and DFS traverse it. We search for a pattern with a binary search.

Computational Genomics 9 – clustering

Clustering motivation

there’s a matrix with genes in the rows, samples in the columns. We can either

  1. cluster samples – which leads to class discovery, by measuring relevance to phenotype = clinical outcomes (Kaplan-Meier)
  2. cluster genes – groups of co-regulated genes, then we test functional enrichment = guilty by association

Q: what does the above mean?

Hierarchical clustering – Average linkage

agglomerative instead of divisive. start with a cluster for each item, in each step merge the closest clusters r and s. The distance to all other clusters is a weighted average of the previous distances from r and s. We can also use min/max instead of average.

Non-hierarchical clustering – K-means

iterative use of EM.


mates are a pair of points from the same cluster. Similarity can be a dot product between those 2 points. We expect the similarity between mates to be higher

than the non-mates similarity. The main idea is to look for a good cut to 2 clusters, which turns out to be the min-cut in the similarity graph! Note that sometimes we

don’t use the whole graph, we remove edges which have very low weight.

We sum over all the edges in the cut and try to decide between H0 = all the edges in the cut are non-mates, and H1 – all edges are mates. Kernel is when all possible

cuts are H1. If even for the min-cut H1 is more probable, stop – we have reached a Kernel. Else, let’s cut by the min-cut!

The initial parameters are retrieved by EM. A lemma we had to prove is that the weight of any cut is the probability of the similarities of the weights in the cut

under H1 divided by the probability under H0.

Recitation 9 – clustering

min-max intra-cluster dist – pay the price of the largest distance inside any cluster. The intuition is – make sure that the furthest points are in different clusters.
In every iteration the distance to head only gets better. Delta = the worst distance at the end, so there were k+1 points worse than delta. So if we partition to k clusters, the optimal solution will be worse than delta, as it will have 2 points in the same cluster.

By the triangle inequality, our solution would be 2*delta – point to head and then to point.


Computational Genomics 8 – Gene finding and motif analysis

Modeling ORFs

Find ORF = open reading frames in a long DNA sequence. We try to build a Markov model where the hidden states are nucleotides to discern Gene mode from NORF mode, but the variance is too big. When we use 3-letter combinations of nucleotides (codons), it’s more significant. This makes for a 64-state HMM. We choose the reading frame (indentation) using the frequencies of all codons that appear in it.

  • Q: slide 14, what do they mean by prefwindow?
  • A: this slides shows us the cumulative probability on a window of 25 codons for the decoding. Note that outside ORF the codons do not translate to proteins, so we don’t measure their probability. Also note that only in reading frame1, inside an ORF the probability rises notably! So this must be the correct reading frame


A promoter is a sequence that encourages transcription of following gene – it calls the RNA polymerase.

A PWM (positional weight matrix) is a matrix with the frequency of each base per position for the promoter, in the non-coding segment. We can score each sequence assuming its a promoter and assuming its not a promoter, and divide the 2 results for a log-likelihood ratio score.

Splicing – removal of introns, happens at the RNA level.

  • Q: slide 22, what do they mean by RNA-RNA base pairing?
  • The splicing (removal of introns) is done after the RNA is composed, by translating it into a new RNA by the spliceosome complex.

Intron/Exon length distribution

we can not use HMM as its memory-less, so can only model geometric distribution of length, and exons do not behave like that. So, we use a generalized HMM = GenScan. Output of each state can have a different distribution. We also generalize PWM to Weighted Array Model, where the distribution in each position can be dependent on other positions.

Spliced alignment

if we have a spliced mRNA, align it to the DNA.

  • Q: What are we seeing in slides 39-41?
  • A: These are short segments, spread all over the RNA,

Regulatory sequence analysis

The promoter is several BS (binding sites) in the DNA that are bound by several proteins called TF (transcription factor) that is regulating the gene’s expression. If several genes are co-expressed we assume they have common BSs!

The length we assume for the promoter is typically ~2Kbp, and we look in both strands upstream the transcription start site (TSS).

Models for BS

  1. exact string
  2. 1 mismatch
  3. degenerate string – several options per position
  4. PWM

Techniques for detecting BS

  1. In vitro – protein binding microarrys – generate a DNA with all possible k-mers. detect the TF binding to each of them
  2. In vivo – CHIP – use antibodies as fishing rods for the wanted BS
  • Q: in slide 53 – I didn’t understand the method… are we looking for BS or are we looking for TF?
  • A: both…



suppose we know the motif. We want to see if its over-represented in co-regulated genes. We set the threshold for our PWM s.t. it will fish 5% hits of random sequences generated with a HMM of the background probabilities.

For each promoter, test how much it got and how much from the target set and then calculate of getting such or more extreme result by chance (With the hyper-geometric distribution). We can also find co-occurrence using the same technique.

Motif discovery de-novo – MEME

We assume ZOOPS – Zero or One Per Sequence. Also we assume uniform probability of \lambda = \delta / m in each position in each sequence. The hidden data is the PWM matrix. Then we run EM inference.

Recitation 8 – spliced alignment

We do DP. S(i,j,k) defined when i is inside exon k. Everything goes as usual, only that in positions which are first positions of an exon, we go backward to any of the last positions of any previous exon. A trick to save more time – maintain P(i,j) the best chain that ends before i. This costs mn + mN, as we go over all the m positions for j, and either we go over exons that end in i (N), or simply over all i (n).

Computational Genomics 7 – EM


We get multiple sets of coin tosses, originating from 2 coins with unknown parameters. X = number of heads in each set, Y = which coin was used? (hidden). If we knew the Y assignments, we could simply find the ML solution maximizing P(x,y|theta) which is the fraction of H per coin. But we don’t. So we can guess, find the most likely coin P(y|x,\theta), update our guess and continue till convergence. EM is working with probabilities instead of the most likely coin…

Probabilistic setting

We have a mixture of two Gaussians with known equal variance, and would like to find the maximum likelihood parameters: \theta = p, \mu_1, \mu_2, and color each dot in blue/red accordingly.

KL divergence D(P || Q) = \sum{P(x)log\frac{P(x)}{Q(x)}}. This measures how unlikely it was to get P if our null hypotheses was Q. It is always >=0.

The EM algorithm for heights tries to maximize log P(X=180|\theta) = log \sum{P(X=180,Y=y|\theta)}. We now try to express the probability for X=180 with relation to the different options for Y (M,F). From Bayes rule we have:

P(X=180|\theta) = P(X=180, Y=M | \theta) / P(Y=M|X=180, \theta).

We can show that maximizing the difference from the current parameters is equal to maximizing this:

\Delta = Q(\theta | \theta^t) - Q(\theta^t | \theta^t) + KL(P(y|x,\theta^t) || P(y|x,\theta))

Where the KL divergence part is almost non-negative for everything we choose. So the important part is:

Q(\theta | \theta^t) = \sum_y P(y|x, \theta^t)log P(x,y|\theta)

  • Q: in slide 14, why do we need the math gymnastics with y_ij?

This is an expectation over the distribution of y (conditioned on current parameters and x) of the complete log-likelihood. We sum this over all possible x_i and y_j, find the roots of the derivatives and that’s it.

EM for HMM – Baum-Welch

The Q function for HMM is an expectation over the distribution of \pi of the complete log-likelihood. This is a multiplication of

  1. the emission probability of “b” in state “k”, to the power of the number of times we saw that
  2. the transition probability from “k” to “l”, to the power of the number of time we saw that

so we need the maximum expectations of those 2 things. The forward-backward algorithm gives us the joint probability of a transition in a specific position “i”. If we sum over all possible positions and divide by P(X) we get the conditional expectation

  • Q: in slide 21, why do we get the conditional expectation by summing over all positions? and why is the specified weights the ones maximizing the expectations?
  • Q: what is slide 22 showing us? is this the answer to the previous question?