Some NLP: Spelling Correction with Noisy Channel Model in Python

The following problem appeared as an assignment in the coursera course Natural Languange Processing (by Stanford university). The following description of the problem is taken from the assignment description and the course slides.

The following figure shows the basic concepts of spelling correction using the noisy channel model. Given the misspelled word, the most probable correct word can be computed by

  1. generating the possible correct words (from the vocabulary) with some edit-distance model.
  2. estimating the prior and the likelihood probabilities using some language model.
  3. computing the posterior probabilities and the MAP estimate using the Bayes Theorem.

im0.png

The following figure shows how the candidate words (possible correct words) can be generated (mostly the words within edit distance 1) and then a suitable language model can be used to estimate (with MLE) the prior and the likelihood (insertion / deletion etc.) probabilities from the training corpus.

im0_1.png
The following figure shows the real world spelling correction methods using the noisy channel model, we are going to use these techniques to solve the spelling-correction problem. Given a k-word sentence W=w1,w2,,wk, we are going to apply the following steps to compute the most probable sentence with the possibly-corrected misspelledwords:

  1. For each word w_i generate the candidate set (e.g., the words that are withing 1-edit distance).
  2. Use a language model to compute the probability of a sequence P(W) and then find the most probable word-sequence W.

im0_2.png

 

The following figure shows the different language models that are going to be used to compute the MLE (without / with smoothing) probabilities and also how they can be used to compute the probability of a sentence (i.e. a sequence of words).

im1.png

 

  1. Given a training corpus, the parameters for the language models are first estimated from the corpus.
  2. Then these probabilities are used to compute the most probable (possibly spell-corrected) sequence of candidate words for each sentence in the dev corpus.
  3. The accuracy of spell-correction on the dev corpus is computed using the ground truth for each language model and reported.

Results

The following results show the accuracy of spell-correction using the different language models on the dev corpus:

  1. Uniform Language Model: correct: 31 total: 471 accuracy: 0.065817
  2. Unigram Language Model: correct: 29 total: 471 accuracy: 0.061571
  3. Laplace Unigram Language Model: correct: 52 total: 471 accuracy: 0.110403
  4. Laplace Bigram Language Model: correct: 64 total: 471 accuracy: 0.135881
  5. Stupid Backoff Language Model: correct: 85 total: 471 accuracy: 0.180467
  6. Custom Language Model (Simple Linear Interpolation λ1=0.17,λ2=0.39,λ3=0.44): correct: 91 total: 471 accuracy: 0.193206

im5.png

The following table shows some sentences from the dev corpus with misspelled word(s) along with the ground truths and the noisy channel model-corrected outputs with different language models.
im4.png

The next figure shows the simple linear interpolation language model accuracy on dev corpus with different values of λ1 and λ2.

im2.png
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s