Heuristic algorithms that speed up the global alignment solution: FASTA, BLAST.
FASTA detects diagonals of perfect match with len > ktup. We prepare an index for each ktup-sequence all the places it appears in. We then find all matches. Then run banded DP on regions with high scoring diagonals. Combine these sub-alignments (SA) into a maximum weight path.
FASTA global alignment output analysis – once we found hits, we want to know if its a false positive. The random distribution of scores is unknown. So all we can say is that if only 3/100 random sequences had a score larger than we got, then the p-value is likely <0.03, but we can’t say that the score acts normal. So we can express it in mean and std terms, but we can’t translate it to P with the normal distribution table.
BLAST – divide the query to words of size W (overlapping). Blast each word to a list of close words using a substitution matrix. Detect exact matches and then try to extend using ungapped extension. Prune when the score drops below the best score yet. Another variant – do a gapped extension using DP backward and forward from sequence. Stop at cells where value drops well below the best value found so far.
local alignment analysis – well understood distribution. The model – each amino-acid residue (there are ~20 of these) is selected independently, from a background distribution defining the probability per residue. The expected score for a random pair of residues should be negative, otherwise a longer random pairing would have a larger score. Turns out that the score distribution follows extreme value distribution. Just as the sum of i.i.d variables tends to normal, the max tends to that distribution.
And a = the number of sequences with higher scores >= s would follow a Poisson distribution:
so the probability of finding 0 would so our p value = the chance of finding 1 or more random sequences with larger score would finally be:
Substitution scores – we use the background probabilities , and the empirical target frequencies for this substitution between these 2 residues. The substitution score is given as
PAM – estimating by starting with a protein that has 1% changes per generation (1 PAM), collect statistics on exchanges in the matrix M, then estimate for k generations. Note that as residues will return to what they were, after 100 PAMs we will never be at 100% difference!
BLOSUM – matrices developed from local alignment on sequences that are not entirely identical. In BLOSUM-62, all sequences are less than 62% identical.