NGS – the race towards cheap and quick sequencing.
We have a reference genome, and some short reads. Instead of using a suffix tree which costs a lot of space, use hash for small chunks, and align each read from there. We also estimate the mapping quality by dividing the probability of fitting at “u” to the sum of probabilities fitting anywhere else.
Take a text of length L. do all cyclic transformations, and then sort it alphabetically, then read the last column. It maps to a new text of length L.
The value_counts stay the same. We can recover the first column as the sorting is lexicographic. Now because its cyclic, we can recover all the pairs of 2! and now we we can recover the second column, as its also lexicographically sorted.
The internal order of the “a” characters in the last column is the same order as in the first column. why? because they are ordered by their next character (which is in the first), which is in the second column as well – determining their order in the first column…
Using this lemma we can reconstruct the text! we take the character from F with a corresponding $ in L, and from there we can start cycling thru all characters.
Searching the BWT index
We search from the end of the pattern backwards, similar to reconstructing, only with uncertainty in which of the rows of the first column are we.
Relation with suffix array and finding indexes of match
The BWT is ordered just as the suffix array. We save the index only each 100 entries. We continue to go backward thru the pointers until we reach an indexed location.
Sequence assembly – no reference genome
de-Bruijn – build vertices for all possible k-mers (if the sequence length is ~100 then k would be ~30). Build edge from each vertex to other vertices who’s prefix match the source suffix. A sequence matches a path in this graph.
How to merge 2 short paths with overlaps to a long single paths?