Computational Genomics 10 – Suffix trees

Suffix trees allow for efficient multi-query. We pre-process the sequence, and build a suffix tree, which is a compressed trie. After we have that, a pattern P matches S iff it is a prefix of some suffix of S. The naive way to build a suffix tree is adding the suffixes one at a time – O(m^2).

UA (Ukkonen’s algorithm)

UA (Ukkonen’s algorithm) builds successive implicit suffix trees. At each stage we extend and add s[i+1] to s[j..i]. This is done by the extension rules:

  1. S[i] is leaf – just add a leaf under it
  2. Following S[i] is some other u, need to create a new internal node and a new leaf
  3. S[j..i+1] is already at the trie – do nothing

The naive way to do this would be O(m^3), but the using the following tricks we can improve to O(m):

  1. suffix links – as in the i phase we add the suffixes from the longest [1..i] to the shortest [i-1,i], we can build suffix links from xa to a. Then, we can use them by going up with them, them coming down for the size of the prefix!
  2. skip and count – after jumping with the suffix link, we have to walk down. We can count the number of letters on each edge, and thus keep the walking down proportional to the number of nodes, not letters. It can be shown that during a phase, we go to some node, then decrement a maximum of 2m with the up walks and suffix links, so we go down a maximum of 3m – as the deepest node is at depth m. -> overall we’re at O(m^2)
  3. rule 3 is a stopper – if a string is already in the tree, then all of its suffixes are also in the tree, we can move on to the next phase
  4. once a leaf, always a leaf – a leaf remains a leaf, it just gets added new characters. We can store indexes instead of characters, and mark the last index with the special symbol “e”. That way we can extend all the leafs from the previous phase in O(1).

Implementation issues

for each node, there’s a choice of how to represent the outgoing edges: an array, linked list , balanced tree, hash or mixture of the correct choice per node.

Applications

  1. find exact match – every node in the subtree are an occurrence
  2. generalized suffix trees – for multiple strings
    1. longest common substring – node with both type of descendants
    2. LCA of two suffixes (leaves) represent their LCP – we can preproces and answer such queries in constant time
    3. maximal palindrome – insert S and Sr, for every i find the LCP of the corresponding suffixes of S and Sr

Suffix array

All suffixes sorted lexicographic order. We build a ST and DFS traverse it. We search for a pattern with a binary search.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s