Computational Genomics 3 – Sequence alignment heuristics

Heuristic algorithms that speed up the global alignment solution: FASTA, BLAST.

FASTA detects diagonals of perfect match with len > ktup. We prepare an index for each ktup-sequence all the places it appears in. We then find all matches. Then run banded DP on regions with high scoring diagonals. Combine these sub-alignments (SA) into a maximum weight path.

FASTA global alignment output analysis – once we found hits, we want to know if its a false positive. The random distribution of scores is unknown. So all we can say is that if only 3/100 random sequences had a score larger than we got, then the p-value is likely <0.03, but we can’t say that the score acts normal. So we can express it in mean and std terms, but we can’t translate it to P with the normal distribution table.

BLAST – divide the query to words of size W (overlapping). Blast each word to a list of close words using a substitution matrix. Detect exact matches and then try to extend using ungapped extension. Prune when the score drops below the best score yet. Another variant – do a gapped extension using DP backward and forward from sequence. Stop at cells where value drops well below the best value found so far.

local alignment analysis – well understood distribution. The model – each amino-acid residue (there are ~20 of these) is selected independently, from a background distribution defining the probability per residue. The expected score for a random pair of residues should be negative, otherwise a longer random pairing would have a larger score. Turns out that the score distribution follows extreme value distribution. Just as the sum of i.i.d variables tends to normal, the max tends to that distribution.

E(s) = Kmne^{-\lambda s}

And a = the number of sequences with higher scores >= s would follow a Poisson distribution:

p(a) = e^{-E(s)}\frac{E(s)^a}{a!}

so the probability of finding 0 would e^{-E(s)} so our p value = the chance of finding 1 or more random sequences with larger score would finally be: 1 - e^{-E(s)}

Substitution scores – we use the background probabilities p_i, p_j and the empirical target frequencies q_{ij} for this substitution between these 2 residues. The substitution score is given as ln(\frac{q_{ij}}{p_i p_j})/\lambda

PAM – estimating q_{ij} by starting with a protein that has 1% changes per generation (1 PAM), collect statistics on exchanges in the matrix M, then estimate M^k for k generations. Note that as residues will return to what they were, after 100 PAMs we will never be at 100% difference!

BLOSUM – matrices developed from local alignment on sequences that are not entirely identical. In BLOSUM-62, all sequences are less than 62% identical.

Computational Genomics 2 – pairwise alignment

Sankoff – a family of algorithm for matching 2 strings with indels and substitution. We use dynamic programming for completing the matrix with the best cost, and then trace back for the actual alignment.

Efficient space consumption = Hirschberg. Divide and conquer and each time compute path only for the middle row and its 2 neighbor rows.

End-space free alignment – we’re allowed to have leading and trailing spaces. We use a variant of the global alg – the first row and column have 0 cost, and we can finish everywhere in the last row/column instead of just the bottom-right corner.

Local alignment – we modify global alignment again – we dump all negative prefixes by putting at least 0 in each cell, and we back-track from all cells with the maximum value.

Gap penalties – we use 3 matrices: one for 2 characters for both i and j, and the other 2 represent a space in either i or j. That way we can maintain which gap is open and penalize accordingly (for a new gap or for a prolonged gap)

ML course 1+2

I’m following

Lecture 1

The first lecture included mainly vague descriptions of popular ML problems. They explained our need to generalize and avoid overfitting. They introduced K-means and came with a naive upper bound for the number of iterations k^n (the number of all possible divisions). Turns out that in order to get a much better bound you have to consider Voronoi diagrams, and you have to define the criterion for good clustering as variance-based clustering: |S|^{\alpha - 1} \sum{||x_i - \bar{x}||^2}. Also turns out that sum-of-squared-distances-from-mean-equals-all-pairs-sum-of-square-distances. I didn’t go into the world of weighted Voronoi diagrams…

Lecture 2 – Bayesian inference

The axiom of machine learning – historical data and new data come from the same distribution. So we would like to learn that distribution from the historical data. The likelihood function is presented. MLE estimates for \mu, \sigma are derived for a sample with Binomial and Normal Distributions.

Then we do the same for uniform distribution U(0,\theta). Turns out that the MLE estimator for \theta is the largest observation. Another estimator called the MOM estimator is twice the average of observations. While the MLE estimator is biased and the MOM is unbiased, the MLE has smaller variance. For comparing these 2 variances, see SO question for comparing MLE and MOM estimator variances

The derivation for the MOM estimator variance was a bit involved for me – see so question about expectation of sum squared observations

Likelihood estimator

Multinomial distribution – several words from a language, we want to estimate the unknown multinomial distribution that generated that language = maximum likelihood that p_1^{c1}... p_n^{ct} under the constraint the p_i is indeed a distribution = adds up to 1. We use the powerful Lagrange multipliers to find a point where the gradient for the optimization function and the gradient for the constraint are parallel!

Bayesian estimators

The parameters are now random variables and have a prior distribution. We try to estimate the MAP estimator, which will maximize the posterior distribution. The prior can sometimes be seen as an extra observation that makes the MAP estimator give different results than the MLE estimator. We can also explicitly compute the posterior distribution of the parameter, which is way more informative than the MAP estimate of the parameter. If the prior and posterior are from the same family, we say that they are conjugate.

Beta distribution

The Beta distribution as a prior is very useful, as if our random variable distributes Binomially, then the posterior would also be in the Beta family. The uniform prior can also be expressed as a Beta prior.

Naive Bayes classification

Given a prior on 2 classes, decide which is more likely to fit the data assuming the attributes are independent. How can we extend NB to be aware of dependent attributes?

Recitation 2 – EM

The general concept is Minorization Maximization. It’s an iterative process to find a maximum of a function. In each step we find a surrogate function which is below the function in every point, and is touching it in the current estimate. We maximize the surrogate and continue.

In EM, the function we wish to maximize is the log-likelihood. We are presented with a observations of Hemoglobin values, and we assume that there is a latent Bernoulli variable \theta which determines if the patient is in the sick/healthy population, and each population has different parameters for its Hemoglobin distribution.




Computational genomics – 1

Here are some cool insights which were new to me:

Transcription and translation

Transcription = DNA->mRNA. Apparently in Eukaryotes there’s only a tiny fraction that gets transcripted called exons. All the other parts are called introns and stay inside the nucelus.

As DNA/RNA are interpreted by triplets called codons, there are 3 possible indentations. Not sure why, but there are 3 in the other direction as well!


Terribly simple and beautiful! We take a single-stranded DNA. We synthesize a complementary fragment called a prime. We put in the 4 nucleotides. We add to the mixture a variant of one of them (E.g. C) called de-deoxy, which lacks the 3′ terminus which is required to continue the chain. We get chains which terminate with all possible locations of C. We then use a electrophoresis analysis to get the length of each chain (thus getting all indexes of C).

Polymerase chain reaction – PCR

amplify a selected section X 1M times. We design 2 synthetic DNA oligo-nucleotides for the start and end of the section. We use a heat-resistant polymerase, which attaches to the oligo-nucleotide of eight start or end, and the product serves as a template for the next cycle. After a couple of the start/end trims, we’re left with huge amounts of the required section.

Restriction enzyme digestions

not sure why the double-digestion problem arises! The partial digestion is just reconstructing the sequence given all the distances between each pair.

Human genome – who’s genome was it?

confidential. there were 5 volunteers and their identity is hushed (probably X-men).


Following Princeton’s GLM course

NYU – OLS estimator without likelihood motivation

These notes develop the OLS estimator with minimum assumptions – They assume the real world true model is really linear (systematic and stochastic) and build an estimate for this real population from the sample. Then, building on the Gauss-Markov theorem they show that the estimate is pretty good in many senses (BLUE).

See this so-question about why is it different than using likelihood theory motivation and how come the 2 framework coincide so nicely in the linear regression model. The hypothesis testing for the estimator relies (in addition to the Gauss-Markov assumptions) on assuming the errors are normally distributed – see so question on why do we need this assumption

2.1 – linear models

Very gentle intro to linear models. The random structure is shown (normal distribution around \mu_i and same \sigma for all observations = homoskedasticity) and the systematic structure. There’s discussion of the 2 far end models – the null model and the saturated model. Note that this intro assumes heavily on the distributions and builds on that for a maximum likelihood interpretation. while nyu shows the same OLS estimator without any assumption on distributions…

2.2 – estimation of parameters

We use maximum likelihood, and get the famous normal equations. In addition, this estimator has few interesting properties – it’s unbiased and we can calculate its variance.

2.3 – tests of hypotheses

We’re introduced to 2 types of tests: Wald tests for the MLE and likelihood ratio tests. The first kind tests one/several variable for their significance (the probability that they are not 0). The second kind of test – likelihood ratio tests the reduction in RSS we got by adding more variables. For linear models the 2 kinds are equal (for some reason)…

2.4 – Simple linear regression

very basic – just a parameter and a constant. We do an ANOVA – how much of the RSS is explained by the variable, R-squared which is the pct of RSS explained (perfect is 1, but can go below 0 in pathological scenarios), and pearson-r is the proportion of variance explained by the model (ranges between -1 and 1).

2.5 – Multiple linear regression

Multiple independent variables. The model assumes additivity, and it may represent combinations which are impossible in the real world, but it is still useful. Note that the coefficients in a multiple regression are considered as net effect, as they are associated with fixed values for the other predictors! in a simple regression they are gross effects (contain variability of other predictors).

ANOVA and hierarchical ANOVA – After we explain some of the RSS with the first predictor, the sqrt of the proportion explained by the second predictor out of the remaining RSS is called partial correlation coefficient. It can also be derived by regressing y on x1, x2 on x1 – taking the residuals from both and then measuring simple Pearson correlation between them.

2.6+7+8 – mixing categorical variables (factors)

The linear model supports discretizing one or more variables into categories, and we can build ANOVA tables for them (and the numeric regular variables), and calculate the square root of the proportion of variance explained by each discrete factor – called the correlation ratio (the same thing we called correlation coefficient in the continuous case)

2.9 – Regression diagnostics (obsolete)

When linear models are built on a small number of samples, there’s a point in diagnosing the contribution and possibility of being an outlier per data-point. I don’t think that’s relevant for modern models.

2.10 – transformations








The concrete guide to likelihood and fisher information

Let’s follow Maximum likelihood estimation in the most concrete way I can think of.

Suppose we have a sample x of 10 coin draws. Suppose each coin draw is independent and is a random variable from a Bernoulli distribution with parameter p_0 = 0.3. The likelihood of each coin draw can be written as a function of p:

L(x_i;p) = p^{x_i}(1-p)^{1-x_i}

and for the entire sample of 10 draws:

L(x;p) = \prod{p^{x_i}(1-p)^{1-x_i}}

and the log-likelihood is:

LL(x;p) = \sum{x_i}log(p) + (n-\sum{x_i})log(1-p)


The score is the first derivative with respect to each of the parameters. Luckily, in our case there’s only one.

S(p) = \frac{dLL(p)}{dp} = \frac{\sum{x_i}}{p} - \frac{n-\sum{x_i}}{1-p}

Since we’re after the maximum of the log-likelihood = the mle, under regularity conditions we can find it by setting the score to 0. And the mle for our specific 10 draws sample turns out to be:

{p}_{mle} = \frac{\sum{x_i}}{n}

Now let’s learn about the expected score (currently for no reason, just a fact to keep in your head). Imagine we do another 1 million X 10 coin draws samples, and average their score function for every value of p. Let’s see what function do we expect:

E(S(p)) = E(\frac{dLL(p)}{dp}) = \frac{E(\sum{x_i})}{p} - \frac{n-E(\sum{x_i})}{1-p} = \frac{np_0}{p} - \frac{n(1-p_0)}{1-p}

And its obvious that it zeros out if we evaluate it in the true parameter p_0. Remember that for later.


Now that we have our mle, we would like to estimate how certain are we that its the correct number? As steeper our log-likelihood function is, we are more certain that we’re right. Let’s start to formalize this intuition with the observed fisher information – which is always evaluated at the mle, and defined as:

J(p_{mle}) = - \frac{d^2}{dp^2}LL(p)|_{p_{mle}}

and in our case:

J(p) =  \frac{\sum{x_i}}{p^2} + \frac{n-\sum{x_i}}{(1-p)^2}

and when evaluated at the mle it simplifies to:

J(p_{mle}) =  \frac{n}{p_{mle}(1-p_{mle})} 

Why do we care about this observed information? WE DON’T. We wish to evaluate the steepness of the expected log-likelihood, across a million samples of 10 coin draws. That expected fisher information is the real deal – as its related to the variance of MLEs we’ll get for each of those million samples (as we’ll soon show).

I(p) = - E\frac{d^2}{dp^2}LL(p)

and here it is for our case:

I(p) = E(\frac{\sum{x_i}}{p^2} + \frac{n-\sum{x_i}}{(1-p)^2}) = \frac{np_0}{p^2} + \frac{n-np_0}{(1-p)^2}

when evaluated at the true parameter it simplifies to:

I(p_0) = \frac{n}{p_0(1-p_0)}

Now, there are 2 interesting properties we can prove for the expected fisher information

  1. It is equal to the expectation of the squared score at the true parameter E(S(p)^2) (proof below)
  2. And its also equal to the variance of the score at the true parameter, since VAR(S(p)) = E(S(p)^2) -E(S(p)^2) and we already know that the second term is zero as the expectation of the score at the true parameter is 0.

The intuition: as large as the information gets, the curve (second derivative) of the log-likelihood gets higher – which means that the true parameter really shines above the other options we have for p. This also means that the variance of other score functions evaluated at the true parameter would be larger – they won’t come close to 0! And exactly because of this (and a bit counter-intuitively) the mle would be closer to the true parameter. see this so question


Likelihood theory

My end goal is to understand approaches in modeling survival. Turns out that this usually goes into the framework of GLMs, and that in order to understand GLMs you need to understand the theory of likelihood.

pennstate intro to stats shows how to use likelihoods to build estimators for 2 types of distributions. First they show a random sample we assume was sampled from a Bernoulli distrib, so we need to build an estimator for the one parameter p. Next they show how to do the same for Gauss distrib, which means 2 parameters – mean and variance. This is interesting because it shows that sometimes you need 2 estimators. However, they jump right ahead from the derivative to setting to 0 and finding the 2 estimators, instead of showing that there is a stage in which after you derive w.r.t each parameter, you have 2 functions. Note that each of these estimators is still a statistic = a function of random variables (and of the parameters).

Next there’s princeton’s review of likelihood theory – it’s a bit more rigorous, and gives examples on the Geometric distrib, which also has one parameter. But they do show the fisher score function – vector of derivatives w.r.t each parameter. Once you get that it is a function with random variables,  you can compute its expectancy = its mean. It turns out (see my so question) that once you compute E with the true parameters of the distribution in your density function, and then you plug in these true parameters to the score function, you get 0.

Another interesting thing they do is calculate the variance of the score function, which is simple as its expectancy is 0. It is still a function of the parameters, and is called the expected information matrix. We can also calculate it using the expected value of the second derivatives of the log likelihood and then it is called the observed information matrix, and is the expectancy of the Hessian.

Then we can use either Newton-Raphson to find the mle, or instead of the Hessian use the information matrix, and then it is called Fisher’s scoring.

As to why there are 2 equivalent ways of calculating the information matrix, it’s basically because we take the expectancy w.r.t the random variable and the derivative w.r.t the parameters, so they’re interchangeable.

Some intuition behind these stat results: the score vector shows the sensitivity of the likelihood for changes in Theta1, Theta2 – it’s a 3d surface. As the likelihood is convex, the maximum is at the point where both those sensitivities are 0. The thing is, there is such a surface per sample, and the expectancy is the average surface – and it zeros out when evaluated at the true parameters.

The information matrix… is a tougher cookie. I can only grasp it when expressed as the expectancy of the second derivative – the extent of the bends in the likelihood surface.



Nelson-Aalen conservation of events

Credit goes to Yaron for explaining me this.

Nelson-Aalen has a cool property – if you sum it over all points in time, you’ll get the total number of events (deaths).

Key point to understand the reason – let’s assume that for each individual there’s a unique point-in-time t(i) in which he and he alone either died d(i) = 1, or was censored out d(i) = 0. The estimator in t(j) is H(j) = sum[d(j)/j].

Now to sum over all points in time:

S = sum sum [d(j)/j] = sum[i * d(i) / i] = sum[d(i)]



The code-chef travtree editorial proved pretty hard for me to understand. It contained many details and statements that weren’t obvious to me (understatement..). Here is the solution as simplified as I could. (credit goes to Yoav that explained it to me)

Every path between U and V can be thought of as a triangle with a single LCA W (with respect to a root R). When we think of how can 2 paths intersect eachother at a point X, turns out there are not so many configurations where this could happen without creating a cycle (which could never happen in tree). First let’s consider the most obvious intersection between 2 ‘brother’ triangles:


Fig1 can’t happen, as it will create a cycle RWXW’. So W and W’ can’t be brothers. So one of them must be higher than the other, let’s assume the higher one is U’W’V’ (the black path).


Fig2 also can’t happen – note the cycle W’XWY.


So, the only way 2 paths can intersect is:



The higher (U’W’V’) path passes thru the lower path’s LCA W on its way to V’.

When we add a new path we could be the higher path, trying to detect the lower or it could be the other way around:

  1. Imagine we are adding the higher black U’W’V’ path, to detect all the old red paths below us – we should count how many LCAs like W are occurring along our path?
  2. Now lets pretend we are adding the lower red UWV path, to detect all the old paths above us we should count the cases where only V’ (or U’) is below our LCA W, but not these cases:
  • U’ and V’ are both below us, W’ isn’t. This is the impossible Fig2 – so let’s not worry about it.
  • if W’ is below us this immediately means that both U’ and V’ and the entire path is below us, so we want to ignore this situation as this is Fig5.

Now we need a way to linearize the problem, so we can use a smart data structure that enables updating and querying to answer both (1) and (2) super quickly…

Build a scikit-learn virtualenv with no internet connection

The previous post assumed you have an internet connection. If that’s not the situation on your prod server, you’ll need the following tweaks:

build the virtualenv using a local

Download from scp it to the prod server, and then do this:

[ihadanny@prod ~]$ tar -xzvf virtualenv-1.11.6.tar.gz
[ihadanny@prod ~]$ ~/virtualenv-1.11.6/ svm_poc/test_scikit
New python executable in svm_poc/test_scikit/bin/python
Installing setuptools, pip...done.
[ihadanny@prod ~]$ source svm_poc/test_scikit/bin/activate

Download the python modules to local zips

On another machine that has internet connection, do this:
[ihadanny@dev ~]$ pip freeze > req.txt
[ihadanny@dev ~] $ pip install --download zips -r req.txt
[ihadanny@dev ~] $ tar -czvf zips.tar.gz zips

Now copy the tar.gz to the prod machine and unpack it.

Install python modules from the downloaded zips

back in the prod server, make sure you’ve installed BLAS, and then:

(test_scikit)[ihadanny@prod ~]$ which pip
(test_scikit)[ihadanny@prod ~]$ pip install --no-index --find-links=file:///x/home/ihadanny/webcat_svm_poc/zips numpy scipy scikit-learn pandas