Some Bioinformatics: Global and Local Alignment of Genes with Dynamic Programming in Python

The following problem appeared as a project in the coursera course Algorithmic Thinking (by RICE university), a part of Fundamentals of Computing specialization. The following description of the problem is taken directly from the project description.

In this article, a central problem in evolutionary and molecular biology, namely pairwise sequence alignment, will be discussed. In particular, let’s reason about the structure of the problem, turn it into an optimization problem, and investigate two attempts at solving it. Then conclude with assessing the significance of alignments computed. The dynamic programming algorithmic techniques will be used for solving the problems. The following figures show the DP algorithms to be used for sequence alignments.

im13.png

im14.png

im15.png

im16.png

Let’s compute the global alignment between the sequences X = ACCT and Y = TACGGT with the following scoring matrix:

im24
The following animation shows the global pairwise alignment between the sequences.
out.gif
im25.png

Problems

The following figure shows the amino acid sequences that form the eyeless proteins in the human and fruit fly genomes, respectively. The scoring matrix PAM50 is shown below, which is defined over the alphabet {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,B,Z,X,-} which represents all possible amino acids and gaps (the “dashes” in the alignment).

im17.png

im18.png

Next, let’s compute the local alignments of the sequences of HumanEyelessProtein and FruitflyEyelessProtein using the PAM50 scoring matrix and enter the score and local alignments for these two sequences below. The next figure shows the heatmap of the local alignment matrix computed.

im19.png

The local alignment score computed is 875 and the following figure shows the local alignments of the two sequences.

 im20
To continue our investigation, we next consider the similarity of the two sequences in the local alignment computed in the last question to a third sequence, a “consensus” sequence of the PAX domain: that is, the sequence of amino acids in the PAX domain in any organism. In this problem, we will compare each of the two sequences of the local alignment computed earlier to this consensus sequence to determine whether they correspond to the PAX domain. The following image shows the consensus sequence.
im21.png
Let’s first compute the global alignment of this dash-less sequence with the ConsensusPAXDomain sequence. Then let’s compare corresponding elements of these two globally-aligned sequences (local vs. consensus) and compute the percentage of elements in these two sequences that agree. To reiterate, we need to compute the global alignments of local human vs. consensus PAX domain as well as local fruitfly vs. consensus PAX domain. The following images show the heatmaps of the global alignment matrices.
im22.png

Is it likely that the level of similarity exhibited by the answers could have been due to chance? For the local human vs. consensus PAX domain alignment, the chance of 97 corresponding elements matching between two 133 element sequences whose elements are chosen at random from a 23 character alphabet is less than 10(100)10(−100). So, it’s very highly unlikely that the level of similarity exhibited by the two original sequences HumanEyelessProtein and FruitflyEyelessProtein could have been due to chance.

Hypothesis testing

One weakness of our last approach was that we assumed that the probability of any particular amino acid appearing at a particular location in a protein was equal. In the next two questions, we will consider a more mathematical approach to answering the above question that avoids this assumption. In particular, we will take an approach known as statistical hypothesis testing to determine whether the local alignments computed are statistically significant (that is, that the probability that they could have arisen by chance is extremely small).

Write a function generate_null_distribution(seq_x, seq_y, scoring_matrix, num_trials) that takes as input two sequences seq_x and seq_y, a scoring matrix scoring_matrix M, and a number of trials num_trials. This function should return a dictionary scoring_distribution that represents an un-normalized distribution generated by performing the following process num_trials times:

  1. Generate a random permutation rand_y of the sequence seq_y using random.shuffle().
  2. Compute the maximum value score for the local alignment of seq_x and rand_y using the score matrix scoring_matrix .
  3. Increment the entry score in the dictionary scoring_distribution by one. Use the function generate_null_distribution to create a distribution with trials using the protein sequences HumanEyelessProtein and FruitflyEyelessProtein (using the PAM50 scoring matrix). Then, create a bar plot of the normalized version of this distribution.

im23.png

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s