Heuristic algorithms that speed up the global alignment solution: FASTA, BLAST.

**FASTA** detects diagonals of perfect match with len > **ktup**. We prepare an index for each ktup-sequence all the places it appears in. We then find all matches. Then run banded DP on regions with high scoring diagonals. Combine these sub-alignments (SA) into a maximum weight path.

FASTA **global alignment** output analysis – once we found hits, we want to know if its a false positive. The random distribution of scores is **unknown**. So all we can say is that if only 3/100 random sequences had a score larger than we got, then the p-value is likely <0.03, but we can’t say that the score acts **normal**. So we can express it in mean and std terms, but we can’t translate it to P with the normal distribution table.

**BLAST** – divide the query to words of size W (overlapping). Blast each word to a list of close words using a substitution matrix. Detect exact matches and then try to extend using ungapped extension. Prune when the score drops below the best score yet. Another variant – do a gapped extension using DP backward and forward from sequence. Stop at cells where value drops well below the best value found so far.

**local alignment **analysis – well understood distribution. The model – each amino-acid **residue** (there are ~20 of these) is selected independently, from a background distribution defining the probability per residue. The expected score for a random pair of residues should be negative, otherwise a longer random pairing would have a larger score. Turns out that the score distribution follows **extreme value distribution**. Just as the **sum** of i.i.d variables tends to normal, the **max **tends to that distribution.

And **a = **the number of sequences with higher scores >= s would follow a **Poisson distribution:**

so the probability of finding 0 would so our p value = the chance of finding 1 or more random sequences with larger score would finally be:

**Substitution scores – **we use the background probabilities , and the empirical target frequencies for this substitution between these 2 residues. The substitution score is given as

**PAM **– estimating by starting with a protein that has 1% changes per generation (1 PAM), collect statistics on exchanges in the matrix M, then estimate for k generations. Note that as residues will return to what they were, after 100 PAMs we will never be at 100% difference!

**BLOSUM – **matrices developed from local alignment on sequences that are not entirely identical. In BLOSUM-62, all sequences are less than 62% identical.