The following problem appeared as a project in the *coursera course Algorithmic Thinking (by RICE university)*, a part of *Fundamentals of Computing* specialization. The following description of the problem is taken directly from the project description.

In this article, a central problem in evolutionary and molecular biology, namely *pairwise sequence alignment*, will be discussed. In particular, let’s reason about the structure of the problem, turn it into an *optimization problem*, and investigate two attempts at solving it. Then conclude with assessing the significance of alignments computed. The **dynamic programming** algorithmic techniques will be used for solving the problems. The following figures show the *DP* algorithms to be used for sequence alignments.

Let’s compute the **global alignment** between the sequences **X = ACCT** and **Y = TACGGT **with the following **scoring matrix:**

The following animation shows the **global pairwise alignment** between the sequences.

**Problems
**

The following figure shows the

*amino acid sequences*that form the

**eyeless proteins**in the

**human**and

**fruit fly genomes**, respectively. The

**scoring**

**matrix**

**is shown below, which is defined over the alphabet {**

*PAM50**A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V,B,Z,X,-}*which represents all possible amino acids and gaps (the “dashes” in the alignment).

Next, let’s compute the **local alignments** of the sequences of HumanEyelessProtein and FruitflyEyelessProtein using the PAM50 scoring matrix and enter the **score** and **local alignments** for these two sequences below. The next figure shows the *heatmap* of the **local alignment matrix** computed.

The **local alignment score** computed is **875** and the following figure shows the **local alignments** of the two sequences.

**local alignment**computed in the last question to a third sequence, a

*“consensus”*sequence of the

*: that is, the sequence of amino acids in the PAX domain in any organism. In this problem, we will compare each of the two sequences of the local alignment computed earlier to this*

**PAX**domain*consensus sequence*to determine whether they correspond to the PAX domain. The following image shows the

**.**

*consensus sequence***global alignment**of this dash-less sequence with the ConsensusPAXDomain sequence. Then let’s compare corresponding elements of these two globally-aligned sequences (local vs. consensus) and compute the percentage of elements in these two sequences that agree. To reiterate, we need to compute the global alignments of local human vs. consensus PAX domain as well as local fruitfly vs. consensus PAX domain. The following images show the heatmaps of the

**global alignment**matrices.

Is it likely that the level of similarity exhibited by the answers could have been due to chance? For the local human vs. consensus PAX domain alignment, the chance of 97 corresponding elements matching between two 133 element sequences whose elements are chosen at random from a 23 character alphabet is less than 10(−100)10(−100). So, it’s very highly unlikely that the level of similarity exhibited by the two original sequences HumanEyelessProtein and FruitflyEyelessProtein could have been due to chance.

### Hypothesis testing

One weakness of our last approach was that we assumed that the probability of any particular amino acid appearing at a particular location in a protein was equal. In the next two questions, we will consider a more mathematical approach to answering the above question that avoids this assumption. In particular, we will take an approach known as statistical hypothesis testing to determine whether the local alignments computed are statistically significant (that is, that the probability that they could have arisen by chance is extremely small).

Write a function generate_null_distribution(seq_x, seq_y, scoring_matrix, num_trials) that takes as input two sequences seq_x and seq_y, a scoring matrix scoring_matrix M, and a number of trials num_trials. This function should return a dictionary scoring_distribution that represents an un-normalized distribution generated by performing the following process num_trials times:

- Generate a random permutation rand_y of the sequence seq_y using random.shuffle().
- Compute the maximum value score for the local alignment of seq_x and rand_y using the score matrix scoring_matrix .
- Increment the entry score in the dictionary scoring_distribution by one. Use the function generate_null_distribution to create a distribution with trials using the protein sequences HumanEyelessProtein and FruitflyEyelessProtein (using the PAM50 scoring matrix). Then, create a bar plot of the normalized version of this distribution.