Email Spam Detection with the Naive Bayes Classifier as a Probabilistic Graphical Model: Parameter Estimation and Prediction with Laplace Smoothing

This problem appeared as a project in the edX course Computational Probability and Inference, MITx – 6.008.1x. Here is the the problem statement.

Problem Statement

ps.png

The below figure shows the math used for parameter estimation during training and log odds ratio computation during testing / classification phase. There were 3675 spam and 1500 ham emails in the training dataset, whereas there were 49 spam and 51 ham emails in the test dataset.

  1. First the parameters for the Naive Bayesian are estimated  (with MLE  and using Laplace Smoothing from the training dataset).
  2. Then the log odds ratio is computed for the new emails (from the test dataset) for the spam / ham prediction.

nbt.png

The below figures show the impact of spam prior s on different accuracy measures. As can be seen, as s increases, both the number of true positives and the false positives tend to increase.

performanceperformance1

Advertisements

Solving the Deterministic and the Stochastic Versions of the Facility location problem with Julia / JUMP Solver

This problem appeared as an exercise in the edX course MITx: 15.053x Optimization Methods in Business Analytics.

Problem Statement (Deterministic version of Facility Location)

fcl.pngThe following shows a part of the Julia implementation of the problem with Jump solver:

j1.png

Problem Statement (Stochastic version of Facility Location)

sfcl.png

The following shows a part of the Julia implementation of the problem with Jump solver:

j2.png

Robot Localization with HMM as Probabilistic Graphical Model in Python & R

This problem appeared as a project in the edX course Computational Probability and InferenceMITx – 6.008.1x.  Here is the definition of the problem:

Problem Definition

Your pet robot is stuck on Mars somewhere on a grid of width 12 and height 8. You would like to figure out where your robot is over time so that you can rescue it.

MODEL

The robot’s position at time i is given by random variable Z_i, which takes on a value in {0,1,…,11}×{0,1,…,7}. For example, if Z_2=(5,4), then this means that at time step 2, the robot is in column 5, row 4. Luckily, the robot is quite predictable. At each time step, it makes one of five actions: it stays put, goes left, goes up, goes right, or goes down. But the action it chooses depends on its previous action.

In particular, if its previous action was a movement, it moves in the same direction with probability 0.9 and stays put with probability 0.1. If its previous action was to stay put, it stays again (w.p. 0.2), or moves in any direction (each w.p. 0.2). For example, if the robot’s previous action was ‘up’, then with probability 0.1, the robot’s next action will be ‘stay’, and with probability 0.9, the robot’s next action will be ‘up’.

We can visually represent these transitions with the following transition diagram (a transition diagram, while graphical and with probabilities is not the same as a probabilistic graphical model; each node in a probabilistic graphical model represents a random variable whereas in a transition diagram, each node is not a random variable but is instead an actual state that a random variable can take on, and the directed edges specify transition probabilities to get to other states):

images_miniproj-robot-statetransition.png

There’s a catch: We need to treat the boundary of the grid differently. For instance, at the top of the grid, the robot can’t go any higher. We’ll re-normalize the probabilities of its remaining actions at the top so that they sum to 1. Such boundary cases suggest that the transition probabilities depend on the robot’s current location and its previous action. Thus, we model the robot’s hidden state Xi at time i to consist of both its location Z_i and the action A_i it last took to reach Z_i, i.e., X_i=(Z_i,A_i) as depicted below:

images_miniproj-robot-state.png

Unfortunately, we will not have access to directly observing the robot’s hidden state X_i. Instead, we have access to a noisy sensor that puts a uniform distribution on valid grid positions within 1 grid cell of the robot’s current true position. Also, this noisy sensor only gives a guess as to the robot’s current position and tells us nothing about what actions your robot has taken. In other words, at time i, we observe Y_i, which takes on a value in {0,1,…,11}×{0,1,…,7} as depicted below:

images_miniproj-robot-obs.png

Lastly, we shall assume that initially the robot is in any of the grid locations with equal probability and always starts with its action set to ‘stay’.

(a) For every time step i, you want to compute a distribution which tells us where the robot was at time i. Implement the forward-backward algorithm to compute the marginal distribution p_X_i|Y_0,…,Y_{n−1}(⋅|y_0,…,y_{n−1}) for each i.

(b) Some of the observations were lost when they were transmitted from Mars back to Earth. Modify forward_backward so that it can handle missing observations, i.e., where at some time steps, there is no observation. In a list of observations, a missing observation is stored as None rather than the usual tuple (x, y).
(c) Now, you want to know which sequence of states the robot was most likely to have visited. Implement the Viterbi algorithm for finding the MAP estimate of the entire sequence of hidden states given a sequence of observations.

(d) Implement a modified version of the Viterbi algorithm that outputs the second-best solution to the MAP problem rather than the best solution.

Forward and Backward message passing as Sum Product and Viterbi decoding as Max Product Algorithms

image2016-11-9 19-3-0.png

Here are the Viterbi decoding outputs for the most likely hidden states (actual location and action of the robot for a few of the test cases):

ezgif-1686156007.gif

Here are the Viterbi decoding outputs for the 2nd most most likely path for the robot:

2ndbest_missing_test2ndbest_test

test2
test1

The same problem is solved with R and the below figure shows the animation for the hidden states estimated with the Viterbi Decoding:

animation.gif

Green rectangle: actual robot location (hidden state)
Red rectangle: noisy sensor estimates for the robot location (observed state)
Blue Rectangle: HMM correction (estimated hidden states with Viterbi)

Solving Sudoku with Integer Programming in R

This problem appeared as a homework problem in the problem set in the edX course MITx: 15.053x Optimization Methods in Business Analytics.

Problem Statement

Sudoku is a popular number puzzle. The goal is to place the digits in [1,9] on a nine-by-nine grid, with some of the digits already filled in, where [1,9] denotes the set of integers from 1 to 9. Your solution must satisfy the following four rules:

  • Rule 1. Each cell contains an integer in [1,9].
  • Rule 2. Each row must contain each of the integers in [1,9].
  • Rule 3. Each column must contain each of the integers in [1,9].
  • Rule 4. Each of the nine 3×3 squares with bold outlines must contain each of the integers in [1,9].

Sudoku isn’t an optimization problem, it’s actually a feasibility problem: we wish to find a feasible solution that satisfies these rules. You can think of it as an optimization problem in which the objective is always equal to 0. This problem firsts asks you to use integer programming formulation techniques to model the rules of Sudoku. It then asks you to solve it.

Formulate the rules and solve the following Sudoku puzzle as a feasibility problem. What is the value of A+B+C in your solution?

sudokuabc

Problem Formulation
rules

The above five constraints exactly represent all the four rules of Sudoku. Let’s solve the integer programming problem using R Rglpk.

## [1] "objective function vector length: 729"
## [1] "dimension of the constraint matrix: 354 x 729"
## [1] "an arbitrary block subset of the constraint matrix [81:90, 131:140] shown below:"
##     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
## row    0    1    0    0    0    0    0    0    0     0
## row    0    0    1    0    0    0    0    0    0     0
## row    0    0    0    1    0    0    0    0    0     0
## row    0    0    0    0    1    0    0    0    0     0
## row    0    0    0    0    0    1    0    0    0     0
## row    0    0    0    0    0    0    1    0    0     0
## row    0    0    0    0    0    0    0    1    0     0
## row    0    0    0    0    0    0    0    0    1     0
## row    0    0    0    0    0    0    0    0    0     1
## row    0    0    0    0    0    0    0    0    0     0
## [1] "The binary constraint matrix as image:"

mat.png

The following is the solution that Rglpk came up with as a result of the feasibility integer program with the above constraints:

## $optimum
## [1] 0
## 
## $solution
##   [1] 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0
##  [36] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0
##  [71] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
## [106] 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 0
## [141] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0
## [176] 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
## [211] 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [246] 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0
## [281] 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [316] 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0
## [351] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1
## [386] 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0
## [421] 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 1 0
## [456] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [491] 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1
## [526] 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
## [561] 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0
## [596] 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0
## [631] 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0
## [666] 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0
## [701] 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1
## 
## $status
## [1] 0
## [1] 81
## [1] "A + B + C =  5"

The following is the solution that Rglpk came up with converted to the tabular form for Sudoku:
solution