Email Spam Detection with the Naive Bayes Classifier as a Probabilistic Graphical Model: Parameter Estimation and Prediction with Laplace Smoothing

This problem appeared as a project in the edX course Computational Probability and Inference, MITx – 6.008.1x. Here is the the problem statement.

Problem Statement

ps.png

The below figure shows the math used for parameter estimation during training and log odds ratio computation during testing / classification phase. There were 3675 spam and 1500 ham emails in the training dataset, whereas there were 49 spam and 51 ham emails in the test dataset.

  1. First the parameters for the Naive Bayesian are estimated  (with MLE  and using Laplace Smoothing from the training dataset).
  2. Then the log odds ratio is computed for the new emails (from the test dataset) for the spam / ham prediction.

nbt.png

The below figures show the impact of spam prior s on different accuracy measures. As can be seen, as s increases, both the number of true positives and the false positives tend to increase.

performanceperformance1

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s