# Computational Genomics 8 – Gene finding and motif analysis

## Modeling ORFs

Find ORF = open reading frames in a long DNA sequence. We try to build a Markov model where the hidden states are nucleotides to discern Gene mode from NORF mode, but the variance is too big. When we use 3-letter combinations of nucleotides (codons), it’s more significant. This makes for a 64-state HMM. We choose the reading frame (indentation) using the frequencies of all codons that appear in it.

• Q: slide 14, what do they mean by prefwindow?
• A: this slides shows us the cumulative probability on a window of 25 codons for the decoding. Note that outside ORF the codons do not translate to proteins, so we don’t measure their probability. Also note that only in reading frame1, inside an ORF the probability rises notably! So this must be the correct reading frame

## Promoters

A promoter is a sequence that encourages transcription of following gene – it calls the RNA polymerase.

A PWM (positional weight matrix) is a matrix with the frequency of each base per position for the promoter, in the non-coding segment. We can score each sequence assuming its a promoter and assuming its not a promoter, and divide the 2 results for a log-likelihood ratio score.

Splicing – removal of introns, happens at the RNA level.

• Q: slide 22, what do they mean by RNA-RNA base pairing?
• The splicing (removal of introns) is done after the RNA is composed, by translating it into a new RNA by the spliceosome complex.

## Intron/Exon length distribution

we can not use HMM as its memory-less, so can only model geometric distribution of length, and exons do not behave like that. So, we use a generalized HMM = GenScan. Output of each state can have a different distribution. We also generalize PWM to Weighted Array Model, where the distribution in each position can be dependent on other positions.

## Spliced alignment

if we have a spliced mRNA, align it to the DNA.

• Q: What are we seeing in slides 39-41?
• A: These are short segments, spread all over the RNA,

## Regulatory sequence analysis

The promoter is several BS (binding sites) in the DNA that are bound by several proteins called TF (transcription factor) that is regulating the gene’s expression. If several genes are co-expressed we assume they have common BSs!

The length we assume for the promoter is typically ~2Kbp, and we look in both strands upstream the transcription start site (TSS).

## Models for BS

1. exact string
2. 1 mismatch
3. degenerate string – several options per position
4. PWM

## Techniques for detecting BS

1. In vitro – protein binding microarrys – generate a DNA with all possible k-mers. detect the TF binding to each of them
2. In vivo – CHIP – use antibodies as fishing rods for the wanted BS
• Q: in slide 53 – I didn’t understand the method… are we looking for BS or are we looking for TF?
• A: both…

## PRIMA

suppose we know the motif. We want to see if its over-represented in co-regulated genes. We set the threshold for our PWM s.t. it will fish 5% hits of random sequences generated with a HMM of the background probabilities.

For each promoter, test how much it got and how much from the target set and then calculate of getting such or more extreme result by chance (With the hyper-geometric distribution). We can also find co-occurrence using the same technique.

## Motif discovery de-novo – MEME

We assume ZOOPS – Zero or One Per Sequence. Also we assume uniform probability of $\lambda = \delta / m$ in each position in each sequence. The hidden data is the PWM matrix. Then we run EM inference.

## Recitation 8 – spliced alignment

We do DP. S(i,j,k) defined when i is inside exon k. Everything goes as usual, only that in positions which are first positions of an exon, we go backward to any of the last positions of any previous exon. A trick to save more time – maintain P(i,j) the best chain that ends before i. This costs mn + mN, as we go over all the m positions for j, and either we go over exons that end in i (N), or simply over all i (n).