This problem appeared as a project in the *edX course Computational Probability and Inference, MITx – 6.008.1x*. Here is the the problem statement.

**Problem Statement**

The below figure shows the math used for parameter estimation during *training* and log odds ratio computation during *testing* / *classification* phase. There were 3675 *spam* and 1500 *ham* *emails* in the training dataset, whereas there were 49 *spam* and 51 *ham* *emails* in the test dataset.

- First the
*parameters*for the*Naive Bayesian*are*estimated*(with*MLE*and using*Laplace Smoothing*from the*training dataset*). - Then the
*log odds ratio*is computed for the*new email*s (from the*test dataset*) for the*spam / ham prediction*.

The below figures show the impact of *spam prior* ** s** on different

*accuracy measures*. As can be seen, as

*increases, both the number of*

**s***true positives*and the

*false positives*tend to

*increase*.

Advertisements