Some BioInformatics: Suffix Tree Construction and the Longest Repeated Substring Problem in Python

The following problem appeared as an assignment in the coursera course BioInformatics Algorithms II (by UCSD). The following description of the problem is taken from the course assignment, Rosalind (http://rosalind.info/) and Stepik (https://stepik.org/).

The following problems define a couple of data structures for fast pattern matching: a collection of (smaller) strings Patterns that we wish to match against a (larger) genome string Text.

Creating a Trie from Patterns

we would like to organize Patterns into a data structure to prevent multiple passes down Text and to reduce the runtime. To this end, we will consolidate Patterns into a directed tree called a trie (pronounced “try”), which is written Trie(Patterns) and has the following properties (example shown in the next figure).

  1. The trie has a single node with indegree 0, called the root.
  2. Each edge of Trie(Patterns) is labeled with a letter of the alphabet.
  3. Edges leading out of a given node have distinct labels.
  4. Every string in Patterns is spelled out by concatenating the letters along some path from
  5. the root downward.
  6. Every path from the root to a leaf, or node with outdegree 0, spells a string from Patterns.

im00.png

im01.png

Creating a Suffix Trie

A suffix trie, denoted SuffixTrie(Text), is the trie formed from all suffixes of Text (see figure below). A dollar sign (“$”) is appended to Text in order to mark the end of Text. We will also label each leaf of the resulting trie by the starting position of the suffix whose path through the trie ends at this leaf (using 0 based indexing). This way, when we arrive at a leaf, we will immediately know where this suffix came from in Text.

The following pseudocode constructs the suffix trie of a string Text by traversing the suffixes of Text from longest to shortest. Given a suffix, it attempts to spell the suffix downward by following edge labels as far as possible until it can go no further. At that point, it adds the rest of the suffix to the trie in the form of a path to a leaf, along with the position of each symbol in the suffix.

There are |Text| suffixes of Text, ranging in length from 1 to |Text| and having total length O(|Text|2)O(|Text|2). Thus, we need to reduce both the construction time and memory requirements of suffix tries to make them practical. We can reduce the number of edges in SuffixTrie(Text) by combining the edges on any nonbranching path into a single edge. We then label this edge with the concatenation of symbols on the consolidated edges, as shown in the figure below. The resulting data structure is called a suffix tree, written SuffixTree(Text).

Prove: A SuffixTree(Text) has exactly |Text| leaves (since each leaf correspond to a suffix of Text) and at most |Text| other nodes (in the worst case Text can be repetition of the same symbol, in which case the suffix tree will be linear and it will have |Text| other nodes).

Creating a Suffix Tree

Given a string s having length L, recall that its suffix tree T is defined by the following properties:

  1. T is a rooted tree having exactly L+1 leaves.
  2. Every edge of T is labeled with a substring of s$, where s$ is the string formed by adding a placeholder symbol $ to the end of s.
  3. Every internal node of T other than the root has at least two children; i.e., it has degree at least 3.
  4. The substring labels for the edges leading down from a node to its children must begin with different symbols.
  5. By concatenating the substrings along edges, each path from the root to a leaf corresponds to a unique suffix of s.

The following figures show example Suffix Trie and Suffix Tree created from a Text and also the algorithm pseudocodes.

im0.png

im1.png

The following figures show how the suffix tree is constructed for the Text panamabananas$ as all the suffixes are inserted into the suffix tree starting from the largest to the smallest suffix.

Inserting suffix panamabananas$ to the empty tree
tree0
Inserting suffix anamabananas$ to the tree
tree1
Inserting suffix namabananas$ to the tree
tree2
Inserting suffix amabananas$ to the tree
tree3
Inserting suffix mabananas$ to the tree
tree4
Inserting suffix abananas$ to the tree
tree5
Inserting suffix bananas$ to the tree
tree6
Inserting suffix ananas$ to the tree
tree7
Inserting suffix nanas$ to the tree
tree8
Inserting suffix anas$ to the tree
tree9
Inserting suffix nas$ to the tree
tree10
Inserting suffix as$ to the tree
tree11
Inserting suffix s$ to the tree

tree12

Inserting suffix $ to the tree
tree13

The next figures show how the suffix tree is constructed for the genome Text ATAAATG$.

tree0
tree1
tree2
tree3
tree4
tree5tree6tree7

The next figures show how the suffix tree is constructed for the genome TextTCTGAGCCCTACTGTCGAGAAATATGTATCTCGCCCCCGCAGCTT$.tree0tree1tree2tree3tree4tree5tree6tree7tree8tree9tree10tree11tree12tree13tree14tree15tree16tree17tree18tree19tree20tree21tree22tree23tree24tree25tree26tree27tree28tree29tree30tree31tree32tree33tree34tree35tree36tree37tree38tree39tree40tree41tree42tree43tree44tree45

A more efficient implementation for suffix tree construction can be done using the  Ukkonen’s suffix tree algorithm.

The Longest  Repeated Substring Problem

im2

As we can see from the following figure, in order to find any repeated pattern inside the Text,

  1. We need to start at the root from the suffix tree and traverse towards the internal nodes (concatenating all the edges from the root to any of the internal nodes represents a repeated pattern inside the text).
  2. Choose the longest of all such texts.

In the next figure the paths colors red, blue and green each starting from the root and ending in some internal node of the suffix tree represent repeated patterns.

lptree.png

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s