This problem appeared as a project in the edX course Computational Probability and Inference, MITx – 6.008.1x. Here is the the problem statement.
The below figure shows the math used for parameter estimation during training and log odds ratio computation during testing / classification phase. There were 3675 spam and 1500 ham emails in the training dataset, whereas there were 49 spam and 51 ham emails in the test dataset.
- First the parameters for the Naive Bayesian are estimated (with MLE and using Laplace Smoothing from the training dataset).
- Then the log odds ratio is computed for the new emails (from the test dataset) for the spam / ham prediction.
The below figures show the impact of spam prior s on different accuracy measures. As can be seen, as s increases, both the number of true positives and the false positives tend to increase.