The following problem appeared as an assignment in the coursera course Natural Languange Processing (by Stanford university). The following description of the problem is taken from the assignment description and the course slides.
The following figure shows the basic concepts of spelling correction using the noisy channel model. Given the misspelled word, the most probable correct word can be computed by
- generating the possible correct words (from the vocabulary) with some edit-distance model.
- estimating the prior and the likelihood probabilities using some language model.
- computing the posterior probabilities and the MAP estimate using the Bayes Theorem.
The following figure shows how the candidate words (possible correct words) can be generated (mostly the words within edit distance 1) and then a suitable language model can be used to estimate (with MLE) the prior and the likelihood (insertion / deletion etc.) probabilities from the training corpus.
The following figure shows the real world spelling correction methods using the noisy channel model, we are going to use these techniques to solve the spelling-correction problem. Given a k-word sentence , we are going to apply the following steps to compute the most probable sentence with the possibly-corrected misspelled–words:
- For each word w_i generate the candidate set (e.g., the words that are withing 1-edit distance).
- Use a language model to compute the probability of a sequence P(W) and then find the most probable word-sequence W.
The following figure shows the different language models that are going to be used to compute the MLE (without / with smoothing) probabilities and also how they can be used to compute the probability of a sentence (i.e. a sequence of words).
- Given a training corpus, the parameters for the language models are first estimated from the corpus.
- Then these probabilities are used to compute the most probable (possibly spell-corrected) sequence of candidate words for each sentence in the dev corpus.
- The accuracy of spell-correction on the dev corpus is computed using the ground truth for each language model and reported.
The following results show the accuracy of spell-correction using the different language models on the dev corpus:
- Uniform Language Model: correct: 31 total: 471 accuracy: 0.065817
- Unigram Language Model: correct: 29 total: 471 accuracy: 0.061571
- Laplace Unigram Language Model: correct: 52 total: 471 accuracy: 0.110403
- Laplace Bigram Language Model: correct: 64 total: 471 accuracy: 0.135881
- Stupid Backoff Language Model: correct: 85 total: 471 accuracy: 0.180467
- Custom Language Model (Simple Linear Interpolation ): correct: 91 total: 471 accuracy: 0.193206
The next figure shows the simple linear interpolation language model accuracy on dev corpus with different values of and .