# Some Natural Language Processing: Using Trigram Hidden Markov Models and Viterbi Decoding to Tag Genes in Biological Text in Python

This problem appeared as a programming assignment in the coursera course Natural Language Processing (NLP) by Columbia University. The following description of the problem is taken directly from the description of the assignment.

In this assignment, we need to build a trigram hidden Markov model to identify gene names in biological text. Under this model, the joint probability of a sentence x_1,x_2,,x_n and a tag sequence y_1,y_2,y_n. A basic unigram HMM is shown in the figure below.

Our goal in this assignment is to use Trigram HMM is defined as follows:

where * is a padding symbol that indicates the beginning of a sentence and STOP is a special HMM state indicating the end of a sentence. The task is to implement this probabilistic model and a decoder for finding the most likely tag sequence for new sentences.

A labeled training dataset gene.train, a labeled and unlabeled versions of the development set, gene.key and gene.dev, and an unlabeled test set gene.test are provided. The labeled files take the format of one word per line with word and tag separated by space and a single blank line separates sentences, e.g.,

Comparison O
with O
alkaline I-GENE
phosphatases I-GENE
and O
5 I-GENE
I-GENE
nucleotidase I-GENE
Pharmacologic O
aspects O
of O
neonatal O
hyperbilirubinemia O
. O

The following shows a graphical model representation of yet another sentence to be tagged as shown:

An unlabeled test dataset is also provided, the unlabeled file contains only the words of each sentence and will be used to evaluate the performance of the model deveploed.

There are 13796 sentences in the training dataset, whereas the dev dataset contains 509 sentences.

The task consists of identifying gene names within biological text. In this dataset there is one type of entity: gene (GENE). The dataset is adapted from the BioCreAtIvE II shared task (http://biocreative.sourceforge.net/biocreative_2.html).

Here are the steps that are to be followed:

1. Estimate the parameters of the Trigram HMM from the training data.
2. Use the parameters learnt to compute the most likely tag sequence using viterbi decoding dynamic programming algorithm.

### Parameters estimated for HMM

The computed q parameter for HMM Tagger with MLE using the training dataset is shown below.

 ('*', '*', 'I-GENE'): 0.05429109886923746,
('*', '*', 'O'): 0.9457089011307626,
('*', 'I-GENE', 'I-GENE'): 0.6034712950600801,
('*', 'I-GENE', 'O'): 0.3951935914552737,
('*', 'I-GENE', 'STOP'): 0.0013351134846461949,
('*', 'O', 'I-GENE'): 0.04545106154671572,
('*', 'O', 'O'): 0.9543190005365219,
('*', 'O', 'STOP'): 0.00022993791676247414,
('I-GENE', 'I-GENE', 'I-GENE'): 0.6057704112952732,
('I-GENE', 'I-GENE', 'O'): 0.3937794147738899,
('I-GENE', 'I-GENE', 'STOP'): 0.0004501739308369143,
('I-GENE', 'O', 'I-GENE'): 0.20999759384023098,
('I-GENE', 'O', 'O'): 0.6809432146294514,
('I-GENE', 'O', 'STOP'): 0.10905919153031761,
('O', 'I-GENE', 'I-GENE'): 0.5778575025176234,
('O', 'I-GENE', 'O'): 0.422079556898288,
('O', 'I-GENE', 'STOP'): 6.294058408862034e-05,
('O', 'O', 'I-GENE'): 0.03741872901853501,
('O', 'O', 'O'): 0.9246458312860389,
('O', 'O', 'STOP'): 0.037935439695426

The following result shows the tag sequences generated with trigram HMM using the baseline tagger model, comparing it with the gold standard tag seuqences on the dev dataset for a sentence. The accuracy in terms of F1-Score is pretty low, it’s around 25.6%, as expected.

The Viterbi Decoding Algorithm to be implemented is shown in the next figure:

The following result shows the tag sequences generated with trigram HMM using the viterbi tagger model, comparing it with the gold standard tag seuqences on the dev dataset for the same sentence. The accuracy in terms of F1-Score improves quite a bit, it’s around 39.9% now (less number of words are tagged inaccurately), as expected.

• Now, HMM taggers can be improved by grouping words into informative word classes rather than just into a single class of rare words. For this part we need to implement four rare word classes:
1. Numeric The word is rare and contains at least one numeric characters.
2. All Capitals The word is rare and consists entirely of capitalized letters.
3. Last Capital The word is rare, not all capitals, and ends with a capital letter.
4. Rare The word is rare and does not fit in the other classes.

These can be implemented by replacing words in the original training data and re-learning the HMM parameters with MLE. Be sure to also replace words while testing. The expected total development F1-Score is 0.42.

The following result shows the tag sequences generated with trigram HMM using the viterbi tagger model with rare word groupings, comparing it with the gold standard tag sequences and viterbi tagger model without rare word groupings on the dev dataset for a couple of sentences. The accuracy in terms of F1-Score improves more, it’s around 42% now, as expected (still lesser number of words are tagged inaccurately).

The following figures show some outputs produced with Viterbi using rare-word groups.

## Summary of Results

Tagger Precision Recall F1-Score
0 Baseline 0.158861 0.660436 0.256116
1 Viterbi 0.541555 0.314642 0.398030
2 Viterbi with Rareword groups 0.534940 0.345794 0.420057