Email Spam Detection with the Naive Bayes Classifier as a Probabilistic Graphical Model: Parameter Estimation and Prediction with Laplace Smoothing

This problem appeared as a project in the edX course Computational Probability and Inference, MITx – 6.008.1x. Here is the the problem statement.

Problem Statement


The below figure shows the math used for parameter estimation during training and log odds ratio computation during testing / classification phase. There were 3675 spam and 1500 ham emails in the training dataset, whereas there were 49 spam and 51 ham emails in the test dataset.

  1. First the parameters for the Naive Bayesian are estimated  (with MLE  and using Laplace Smoothing from the training dataset).
  2. Then the log odds ratio is computed for the new emails (from the test dataset) for the spam / ham prediction.


The below figures show the impact of spam prior s on different accuracy measures. As can be seen, as s increases, both the number of true positives and the false positives tend to increase.



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s