# Online Learning: Sentiment Analysis on Amazon Product Review Dataset with Logistic Regression via Stochastic Gradient Ascent in Python

This problem appeared as an assignment in the coursera course Machine Learning – Classification, part of Machine Learning specialization by the University of Washington. The following description of the problem is taken directly from the assignment.

The goal of this assignment is to implement an online logistic regression classifier using stochastic gradient ascent. The following are the sub-tasks:

• Amazon product reviews dataset is used along with positive / negative labels as the training dataset.
• Bag of words features will be extracted from the training dataset, only a pre-selected set of important words will be used as features.
• The partial derivative of log likelihood (with L2 penalty) with respect to each single coefficient is computed.
• Stochastic gradient ascent is implemented from sractch.
• Convergence of stochastic gradient ascent is compared with that of batch gradient ascent.

## Load and process review dataset

For this assignment, a subset of Amazon product review dataset is going to be used. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted of mostly positive reviews.

Preprocessing: We shall work with a hand-curated list of important words extracted from the review data. We will also perform 2 simple data transformations:

1. Remove punctuation using string manipulation functionality.
2. Compute word counts (only for the important_words)

The data frame products now contains one column for each of the 193 important_words.

### Split data into training and validation sets

We will now split the data into a 90-10 split where 90% is in the training set and 10% is in the validation set. We use seed=1 so that everyone gets the same result.

Training set  : 47765 data points
Validation set: 5307 data points


An additional colum ‘intercept‘ (filled with 1’s) is needed to be inserted into the data frame to take account of the intercept term.

## Building on logistic regression

Now the link function for logistic regression can be defined as:

where the feature vector h(xi)h(xi) is given by the word counts of important_words in the review xixi. Now our goal is to maximize the log-likelihood function, which does not have a closed-form solution, hence techniques like gradient descent needs to be used.

The way the probability predictions are computed is not affected by using stochastic gradient ascent as a solver. Only the way in which the coefficients are learned is affected by using stochastic gradient ascent as a solver.

Note. We are not using regularization in this assignment, but, stochastic gradient can also be used for regularized logistic regression, there will be one addtional term for regularization in the partial derivative.

To verify the correctness of the gradient computation, we use a function for computing average log likelihood (to be used for its numerical stability).

To track the performance of stochastic gradient ascent, we provide a function for computing average log likelihood.

Note the added 1/N term which averages the log likelihood across all data points. The 1/N term makes it easier for us to compare stochastic gradient ascent with batch gradient ascent.

In other words, we update the coefficients using the average gradient over data points (instead of using a summation). By using the average gradient, we ensure that the magnitude of the gradient is approximately the same for all batch sizes. This way, we can more easily compare various batch sizes of stochastic gradient ascent (including a batch size of all the data points), and study the effect of batch size on the algorithm as well as the choice of step size.

Now let’s extend the algorithm for batch gradient ascent (that takes all the data at once) to stochastic (takes one data point at a time) and mini-batch gradient ascent (that takes data in small batches as input, computes the gradient and updates the coefficients). The following figure shows the gradient ascent algorithm, which needs to scaled dwon by appropriate batch size.

## Compare convergence behavior of stochastic gradient ascent

For the remainder of the assignment, let’s compare stochastic gradient ascent against batch gradient ascent. For this, we need a reference implementation of batch gradient ascent.

Let’s now run stochastic gradient ascent over the feature_matrix_train for 10 iterations using:

• initial_coefficients = zeros
• step_size = 5e-1
• batch_size = 1
• max_iter = 10
Iteration 0: Average log likelihood (of data points in batch [00000:00001]) = -0.01416346
Iteration 1: Average log likelihood (of data points in batch [00001:00002]) = -0.00505439
Iteration 2: Average log likelihood (of data points in batch [00002:00003]) = -0.00177457
Iteration 3: Average log likelihood (of data points in batch [00003:00004]) = -0.00311449
Iteration 4: Average log likelihood (of data points in batch [00004:00005]) = -0.06140707
Iteration 5: Average log likelihood (of data points in batch [00005:00006]) = -0.00000011
Iteration 6: Average log likelihood (of data points in batch [00006:00007]) = -0.02461738
Iteration 7: Average log likelihood (of data points in batch [00007:00008]) = -0.00876472
Iteration 8: Average log likelihood (of data points in batch [00008:00009]) = -0.00003921
Iteration 9: Average log likelihood (of data points in batch [00009:00010]) = -0.00000620

As expected, as each iteration passes, how does the average log likelihood in the batch fluctuates with stochastic gradient descent (i.e., with batch size 1).

Now let’s again run batch gradient ascent over the feature_matrix_train but this time for 200 iterations using:

• initial_coefficients = zeros
• step_size = 5e-1
• batch_size = # data points in the training dataset
• max_iter = 200
Iteration   0: Average log likelihood (of data points in batch [00000:47765]) = -0.68313840
Iteration   1: Average log likelihood (of data points in batch [00000:47765]) = -0.67402166
Iteration   2: Average log likelihood (of data points in batch [00000:47765]) = -0.66563558
Iteration   3: Average log likelihood (of data points in batch [00000:47765]) = -0.65788618
Iteration   4: Average log likelihood (of data points in batch [00000:47765]) = -0.65070149
Iteration   5: Average log likelihood (of data points in batch [00000:47765]) = -0.64402111
Iteration   6: Average log likelihood (of data points in batch [00000:47765]) = -0.63779295
Iteration   7: Average log likelihood (of data points in batch [00000:47765]) = -0.63197173
Iteration   8: Average log likelihood (of data points in batch [00000:47765]) = -0.62651787
Iteration   9: Average log likelihood (of data points in batch [00000:47765]) = -0.62139668
Iteration  10: Average log likelihood (of data points in batch [00000:47765]) = -0.61657763
Iteration  11: Average log likelihood (of data points in batch [00000:47765]) = -0.61203378
Iteration  12: Average log likelihood (of data points in batch [00000:47765]) = -0.60774127
Iteration  13: Average log likelihood (of data points in batch [00000:47765]) = -0.60367892
Iteration  14: Average log likelihood (of data points in batch [00000:47765]) = -0.59982787
Iteration  15: Average log likelihood (of data points in batch [00000:47765]) = -0.59617125
Iteration 100: Average log likelihood (of data points in batch [00000:47765]) = -0.49541833
Iteration 199: Average log likelihood (of data points in batch [00000:47765]) = -0.47143083


As expected, with (full) batch gradient ascent, as each iteration passes, the average log likelihood in the batch continuously increases.

## Make “passes” over the dataset

To make a fair comparison between stochastic gradient ascent and batch gradient ascent, we measure the average log likelihood as a function of the number of passes (defined as follows):

## Log likelihood plots for stochastic gradient ascent

With the terminology in mind, let us run stochastic gradient ascent for 10 passes. We will use

• step_size=1e-1
• batch_size=100
• initial_coefficients to all zeros.
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.41823078


Let’s plot the average log likelihood as a function of the number of passes.

## Smoothing the stochastic gradient ascent curve

The plotted line oscillates so much that it is hard to see whether the log likelihood is improving. In our plot, we apply a simple smoothing operation using the parameter smoothing_window. The smoothing is simply a moving average of log likelihood over the last smoothing_window “iterations” of stochastic gradient ascent.

To compare convergence rates for stochastic gradient ascent with batch gradient ascent, let’s plot the change in log-likelihood with the iterations.

We are comparing:

• stochastic gradient ascent: step_size = 0.1, batch_size=100
• batch gradient ascent: step_size = 0.5, batch_size= # data points

Write code to run stochastic gradient ascent for 200 passes using:

• step_size=1e-1
• batch_size=100
• initial_coefficients to all zeros.
Iteration     0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration     1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration     2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration     3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration     4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration     5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration     6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration     7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration     8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration     9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration    10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration    11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration    12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration    13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration    14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration    15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration   100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration   200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration   300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration   400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration   500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration   600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration   700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration   800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration   900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration  1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration  2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration  3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration  4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration  5000: Average log likelihood (of data points in batch [23000:23100]) = -0.41321275
Iteration  6000: Average log likelihood (of data points in batch [27600:27700]) = -0.42095621
Iteration  7000: Average log likelihood (of data points in batch [32200:32300]) = -0.47438456
Iteration  8000: Average log likelihood (of data points in batch [36800:36900]) = -0.40689130
Iteration  9000: Average log likelihood (of data points in batch [41400:41500]) = -0.44582019
Iteration 10000: Average log likelihood (of data points in batch [46000:46100]) = -0.39752726
Iteration 20000: Average log likelihood (of data points in batch [44300:44400]) = -0.50001293
Iteration 30000: Average log likelihood (of data points in batch [42600:42700]) = -0.44909961
Iteration 40000: Average log likelihood (of data points in batch [40900:41000]) = -0.41075257
Iteration 50000: Average log likelihood (of data points in batch [39200:39300]) = -0.47957450
Iteration 60000: Average log likelihood (of data points in batch [37500:37600]) = -0.42584682
Iteration 70000: Average log likelihood (of data points in batch [35800:35900]) = -0.37312738
Iteration 80000: Average log likelihood (of data points in batch [34100:34200]) = -0.41330111
Iteration 90000: Average log likelihood (of data points in batch [32400:32500]) = -0.47600432
Iteration 95399: Average log likelihood (of data points in batch [47600:47700]) = -0.47449630


We compare the convergence of stochastic gradient ascent and batch gradient ascent in the following cell. Note that we apply smoothing with smoothing_window=30.

As can be seen from the figure above, the batch gradient ascent needs at least 150 passes to achieve a similar log likelihood as stochastic gradient ascent.

## Explore the effects of step sizes (learning rate) on stochastic gradient ascent

To start, let’s explore a wide range of step sizes that are equally spaced in the log space and run stochastic gradient ascent with step_size set to 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, and 1e2, using the following set of parameters:

• initial_coefficients=zeros
• batch_size=100
• max_iter initialized so as to run 10 passes over the data.
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69313572
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69313813
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69312970
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69313664
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69312912
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69312352
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69312565
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69312596
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69312480
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69311820
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69312342
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.69309997
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.69310417
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.69311459
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.69311289
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.69310315
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.69299666
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.69262795
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.69259227
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.69216304
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.69184517
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.69233727
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.69184444
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.69162156
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.69137017
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.69116453
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.68868229
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.68748389
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.68381866
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.68213944
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69303256
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69305660
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69297245
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69304176
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69296671
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69291077
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69293204
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69293505
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69292337
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69285782
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69290957
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.69267562
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.69271772
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.69282167
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.69280442
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.69270726
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.69163566
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.68802729
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.68772431
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.68369142
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.68064510
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.68541475
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.68090549
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.67879020
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.67693059
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.67539881
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.65759465
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.64745516
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.62162582
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.61371736
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69200364
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69223670
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69141056
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69209296
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69135181
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69080412
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69100987
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69103436
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69091067
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69029154
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69076626
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.68848541
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.68891938
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.68992883
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.68973094
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.68878712
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.67788829
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.64832833
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.64792583
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.62345602
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.60389801
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.63841711
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.61096879
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.59948158
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.59326446
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.59519901
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.54578301
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.51997970
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.46497627
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.46731743
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.41823078
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.60671913
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.61435096
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.57582992
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.60419455
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.56461895
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.55369469
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.56188804
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.56098460
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.53659402
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.54855532
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.54643770
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.45436554
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.51347209
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.53114501
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.51323915
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.50234155
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.44799354
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.38955785
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.41840095
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.42081935
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.35694652
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.44873903
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.42411231
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.47364810
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.36346510
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.47974711
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.45019123
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.39097701
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.34895703
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.40687203
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.75339506
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -5.19955914
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -1.35343001
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -3.63980553
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -1.05854033
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -1.11538249
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.86603585
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67571232
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.83674532
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -1.07638709
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -1.20809203
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.90955296
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -1.58077817
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.77787311
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.62852240
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.70284036
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.62403867
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.30394690
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.56782701
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.48147752
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.43850709
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.43008741
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -1.11804684
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.53574169
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.31389670
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.61323575
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.76125665
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -1.02808956
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.46513784
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.47187763
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -5.69805760
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -61.13545979
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -7.42566120
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -27.73988196
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -15.42138334
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -11.20038218
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -11.11461119
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -8.49925505
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -16.55690529
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -16.98581355
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -13.56402260
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -6.25429003
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -16.59818372
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -8.45630034
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -4.98292077
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -7.47891216
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -6.88355879
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -3.34058488
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -3.83117603
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -6.29604810
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -2.99625091
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -2.31125647
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -8.19978746
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -3.76650208
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -4.06151269
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -12.66107559
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -13.33445580
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -1.63544030
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -3.55951973
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -3.41717551


### Plotting the log likelihood as a function of passes for each step size

Now, let’s plot the change in log likelihood for each of the following values of step_size:

• step_size = 1e-4
• step_size = 1e-3
• step_size = 1e-2
• step_size = 1e-1
• step_size = 1e0
• step_size = 1e1
• step_size = 1e2

For consistency, we again apply smoothing_window=30.

Now, let us remove the step size step_size = 1e2 and plot the rest of the curves.

As can be seen from the above plots, the step size 1e2 gives the worst result and step size 1 and 0.1 gives the best results in terms of convergence.

# Sentiment Analysis on the Large Movie Review Dataset using Linear Model Classifier with Hinge-loss and L1 Penalty with Language Model Features and Stochastic Gradient Descent in Python

This problem appeared as a project in the edX course ColumbiaX: CSMM.101x Artificial Intelligence (AI). The following description of the problem is taken directly from the project description.

In this assignment, an active research area in Natural Language Processing (NLP), sentiment analysis will be touched on. Given the exponential growth of online review data (Amazon, IMDB and etc), sentiment analysis becomes increasingly important. In this assignment, the task will be to build a sentiment classifier, i.e., evaluating a piece of text being either positive or negative.

The “Large Movie Review Dataset“(*) shall be used for this project. The dataset is compiled from a collection of 50,000 reviews from IMDB on the condition there are no more than 30 reviews each movie. Number of positive and negative reviews are equal. Negative reviews have scores lesser or equal 4 out of 10 while a positive review greater or equal 7 out of 10. Neutral reviews are not included on the other hand. Then, 50,000 reviews are divided evenly into the training and test set.

*Dataset is credited to Prof. Andrew Mass in the paper, Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

## Instruction

In this project, a Stochastic Gradient Descent Classifier will be trained. While gradient descent is powerful, it can be prohibitively expensive when the dataset is extremely large because every single data point needs to be processed.

However, it turns out when the data is large, rather than the entire dataset, SGD algorithm performs just as good with a small random subset of the original data. This is the central idea of Stochastic SGD and particarly handy for the text data since corpus are often humongous.

## Data Preprocessing

The first task is explore the training data and create one single training data file combining the positive and negative labeled texts. The column “polarity” for each movie-review text consists of sentiment labels, 1 for positive and 0 for negative. In addition, common English stopwords should be removed.

The following table shows the first few rows of the training dataset. The training dataset contains 25000 movie reviews.

## Unigram Data Representation

The very first step in solving any NLP problem is finding a way to represent the text data so that machines can understand. A common approach is using a document-term vector where each document is encoded as a discrete vector that counts occurrences of each word in the vocabulary it contains. For example, consider two one-sentence documents:

• d1: “I love Columbia Artificial Intelligence course”.
• d2: “Artificial Intelligence is awesome”.

The vocabulary V = {artificial, awesome, Columbia, course, I, intelligence, is, love} and two documents can be encoded as v1 and v2 as follows:

This data representation is also called a unigram model.

Now, the next task is to transform the text column in the training dataset into a term-document matrix using uni-gram model. A few rows and columns of this transformed dataset with unigram features (~75k features) are shown as shown below. As can be noticed, stemming is not used.

As can be seen from the above table, the unigram feature matrix is extremely sparse (it took 1.6G space to store the first 10k rows as csv and after compressing the zipped file size became only ~4.5MB, around 99.5% sparse) and the following density plot shows density of the average number of occurences of all the unigram features (after discrading the top 15 freatures with the highest average number of occurences) and density is concentrated < 0.5, which means for the first 10k text reviews almost all the unigram features have average value of occurence < 0.5.

Next, we need to train a Stochastic Descend Gradient (SGD) classifier whose loss=“hinge” and penalty=“l1” on this transformed training dataset.

On the other hand, a test dataset is provided which serves as the benchmark file for the performance of the trained classifier. Next task is to use the trained SGD to predict the sentiments of the text in the test dataset, after converting it to the corresponding unigram represenetation. the trained SGD classifier to predict this information.

The test dataset too has 25000 text reviews and sentiment for each of them needs to be predicted. Here are the prediction counts for positive and negative sentiments predicted with the unigram model.

## Bigram Representation

A more sophisticated data representation model is the bigram model where occurrences depend on a sequence of two words rather than an individual one. Taking the same example like before, v1 and v2 are now encoded as follows:

Instead of enumerating every individual words, bigram counts the number of instance a word following after another one. In both d1 and d2 “intelligence” follows “artificial” so v1(intelligence | artificial) = v2(intelligence | artificial) = 1. In contrast, “artificial” does not follow “awesome” so v1(artificial | awesome) = v2(artificial | awesome) = 0.

The same exercise from Unigram is to be repeated for the Bigram Model Data Representation and the corresponding test prediction file needs to be produced. A few rows and columns of this transformed dataset with bigram features (~175k total bigram features) are shown  below.

## Tf-idf

Sometimes, a very high word counting may not be meaningful. For example, a common word like “say” may appear 10 times more frequent than a less-common word such as “machine” but it does not mean “say” is 10 times more relevant to our sentiment classifier. To alleviate this issue, we can instead use term frequency tf[t] = 1 + log(f[t,d]) where f[t,d] is the count of term t in document d. The log function dampens the unwanted influence of common English words.

Inverse document frequency (idf) is a similar concept. To take an example, it is likely that all of our training documents belong to a same category which has specific jargons. For example, Computer Science documents often have words such as computers, CPU, programming and etc appearing over and over. While they are not common English words, because of the document domain, their occurrences are very high. To rectify, we can adjust using inverse term frequency idf[t] = log( N / df[t] ) where df[t] is the number of documents containing the term t and N is the total number of document in the dataset.

Therefore, instead of just word frequency, tf-idf for each term t can be used, tf-idf[t] = tf[t] ∗idf[t].

The same exercise needs tp be repeated as in the Unigram and Bigram data model but tf-idf needs to be applied this time to produce test prediction files.

A few rows and columns of this transformed dataset with tf-idf unigram features (~75k unigram tf-idf features) are shown below.

A few rows and columns of this transformed dataset with tf-idf bigram features (~175k bigram tifidf features) are shown below.

The next figure shows how the SGD for different models converges with epochs. As can be seen, till 100 epochs, the unigram models cost function (hinge loss) decrease with a much faster rate than the bigram models.

# Using Minimax (with the full game tree) to implement the machine players to play TictacToe in Computer with Python

The following problem appeared in one of the mini-projects in the coursera course Principles of Computing 2 which is a part of the Specialization Fundamentals of Computing, taught by RICE university. The following problem description is taken from the project description from the course itself:

## TicTacToe (Minimax)

### Overview

For this assignment, the task is to implement a machine player for TicTacToe that uses a Minimax strategy to decide its next move. Tic-tac-toe is a zero-sum-game with perfect information. Although the game is played on a 3×3 grid, the implemented version should be able to handle any square grid (however, the time it will take to search the tree for larger grid sizes will be prohibitively slow).

In this project, Minimax will be used to search the entire game tree and no pruning will be used (since the full game tree is relatively of small size, exploring the entire game tree without pruning will not be inefficient on a 3×3 board). If we start with an empty TicTacToe board, it will take a long time for the strategy to search the tree. Therefore, it’s strongly suggested to write tests that start with a partially full board. This will allow the code to run much faster and will lead to a more pleasant development and debugging experience.

### Machine Player Strategy

Both the players are machine players and will use Minimax strategy to choose the next move from a given TicTacToe position. The general idea on Minimax is to search the entire game tree alternating between minimizing and maximizing the score at each level. For this to work, we need to start at the bottom of the tree and work back towards the root. However, instead of actually building the game tree to do this, should use recursion to search the tree in a depthfirst manner. The recursive function should call itself on each child of the current board position and then pick the move that maximizes (or minimizes, as appropriate) the score. If this is done, the recursive function will naturally explore all the way down to the bottom of the tree along each path in turn, leading to a depth first search that will implement Minimax. The following page https://class.coursera.org/principlescomputing2-002/wiki/minimax describes the process in more detail.

### Results

The following animation shows how each of the machine players play the game by using Minimax at every state of the board, in order to choose its next move.

The following figures show the game tree explored by the player (along with the scores evaluated) at a given state to choose the best move (MAX player chooses the move corresponding to the maximum of the minimum scores down the tree, where MIN player chooses the move corresponding to the minimum of the maximum scores down the tree, as shown next).

As can be seen, each of the state of the game tree is color-coded in w.r.t. the score, e.g.,

• green in favor of the MAX player
• red in favor of the MIN player
• gray neutral

### Game Tree Explored by Player X  (Zoom to see)

As can be seen from the above game tree, the next move chosen by the MAX player will be corresponding to the gray state, since all other next moves results in a red state.

### Game Tree Explored by Player O

As can be seen again from the above game tree, the next move chosen by the MIN player will be corresponding to the gray state, since all other next moves results in a green state.

### Game Tree Explored by Player X

As can be seen again from the above game tree, the next move chosen by the MAX player will be corresponding to the gray state, since all other next moves results in a red state.

### Game Tree Explored by Player O

As can be seen again from the above game tree, all of the moves result in a gray state, so the MIN player arbitrarily chooses one of them.

### Game Tree Explored by Player X

Hence the game ends in a tie.

# Using Monte-Carlo Simulation to play TicTacToe with Machine Players in Python

The following problem appeared in one of the mini-projects in the coursera course Principles of Computing 1 which is a part of the Specialization Fundamentals of Computing, taught by RICE university. In this article the problem is slightly modified as follows (the most part of the problem description is taken from the project description from the course itself, except the modification part):

## TicTacToe (Monte Carlo)

### Overview

TicTacToe is a simple children’s game played on a grid. Players alternate turns placing an “X” or an “O” on an empty grid square. The first player to get three in a row wins. If a player knows the appropriate strategy and the opponent does not, he cannot lose the game. Further, if both players understand the appropriate strategy the game will always end in a tie. An interesting variant of the game is “reverse” TicTacToe in which a player loses if he get three in arow. The game is also more interesting if it’s played on larger square grids.

In this article, a couple of machine players for TicTacToe will be implemented. Specifically, each of the machine players will use a Monte Carlo simulation to decide its next move. Although the game is played on a 3×3 grid, the implemented version should be able to handle any square grid.

### Machine Player Strategy

Each of the machine players should use a Monte Carlo simulation to choose the next move from a given TicTacToe board position. The general idea is to play a collection of games with random moves starting from the position, and then use the results of these games to compute a good move. When a paritular machine player wins one of these random games, it wants to favor the squares in which it played (in hope of choosing a winning move) and avoid the squares in which the opponent played. Conversely, when it loses one of these random games, it wants to favor the squares in which the opponent played (to block its opponent) and avoid the squares in which it played. In short, squares in which the winning player played in these random games should be favored over squares in which the losing player played. Both the players in this case will be the machine players.

Here is an outline of this simulation:

2. Repeat for the desired number of trials:
A. Set the current board to be board.
B. Play an entire game on this board by just randomly choosing a move for each player.
C. Score the resulting board.
D. Add the scores to a running total across all trials.
3. To select a move, randomly choose one of the empty squares on the board that has the maximum score.

These steps should be relatively straightforward except for step 2C, scoring the board. Scores are kept for each square on the board. The way to assign a score to a square depends on who won the game. If the game was a tie, then all squares should receive a score of 0, since that game will not help a player determine a winning strategy. If the current player (the player for which the code is currently selecting a move) won the game, each square that matches the current player should get a positive score and each square that matches the other player should get a negative score. Conversely, if the current player lost the game, each square that matches the current player should get a negative score and and each square that matches the other player should get a positive score. All empty squares should get a score of 0.

Note that it’s needed to select a final move from the total scores across all trials. So, in step 2D, all the scores from 2C are added to the running total of all scores. Higher scores indicate squares that are more likely to be played by the current player in winning games and lower scores indicate squares that are more likely to be played in losing games.

### Results

The following animations show how each of the machine players learns to play the game just by using Monte-Carlo Simulation with 10 trials (a small number of trials) at every state of the board, the scores shown at the right bottom of each grid square are used by each of the players at their corresponding turns, to choose its next move (the brighter cells represent better moves for the current player, as per the simulation results). Three different games are shown with 3 possible results.

The following game results in a tie.

As it’s intuitively clear, there is a trade-off between the learning accuracy (learn as many configurations as possible, so that it never loses, in turn none of the opponent machine players should lose, so every game should end in a tie in the perfect case) vs. learning efficiency (time to learn).

Ideally a machine player should play a large number of random games starting from the current state of the board to be more confident about the winning moves it needs to take, but it needs to use more and more trials for Monte-Carlo Simulation, which will increase its response time and the overall time to finish the game.

For each of the following different number of MC trials (from low = 5 to high = 5000), 100 different TicTacToe games are played in between the 2 machine players. The results obtained are shown below, they show the trade-off clearly. As the number of trials in Monte-Carlo Simulation increases,

1. Number of ties increases.
2. Total time to finish a game increases on average.

With a large number of trials, as large as 5000 each step, all the 100 games end in ties (both the machine players know enough from their experiences in the simulation, so they are highly likely not to lose ever), but it needs total time ~ 3-4 seconds on average to play each game.

# Solving 4-puzzle with Reinforcement Learning (Q-learning) in Python

This article describes a simple approach to solve the 4-puzzle problem with Reinforcement-learning (using Q-learning). Not all instances of 4-puzzle problem are solvable by only shifting the space (represented by 0). Let’s aim at solving those problem instances only with model-free-methods.

#### The Markov Decision Process for 4-puzzle problem

Consider the following 24-state MDP for the 4-puzzle problem, each state being encoded as a flattened string, with the goal state being ‘0123‘.

For each state except the goal state, we have exactly two valid actions from the set of actions {‘Up‘, ‘Down‘, ‘Left‘, ‘Right‘}. The goal state is the exception which has an additional actionSTAY‘ that allows self transition.

The only actions that have a positive reward of +100 are (‘2103‘, ‘UP‘), (‘1023‘, ‘LEFT‘) and (‘1023‘, ‘STAY‘), as shown in the following figure: all the single-step actions that can transition to the goal state from any state. All other actions get a 0 reward from all other states.

Also notice that there are two connected components in the graph with the MDP states as shown in the following figure, there is a component from which the goal state is not reachable. Let’s consider running Q-learning starting from any state belonging to the other component, one that contains the goal state.

The following Epsilon-Greedy Q-learning algorithm is used with exploration, with ϵ=0.9 (and γ=0.8) to learn the Q matrix, which will later be used to solve a given puzzle instance.

The following animations and the figures show the Q matrix learnt after 815 episodes.

The following figure shows the scaled final Q matrix where the maximum value is kept at 100.

The following figure shows how the difference between the norms of the Q matrices learnt at successive episodes change over episodes and becomes almost zero at convergence for a long enough sequence of episodes.
Solving a puzzle instance

The following figure shows the values from the Q matrix learnt by the Q-learning algorithm. As expected, the values learnt increase as the states come near to the goal state and decrease as the states go far away from the goal state. This will lead the agent choose the right path towards the goal state.

Now, after the Q-matrix is learnt, now let’s solve a 4-puzzle instance ‘0312′ using the Q matrix learnt. We have to just follow the steps below to solve the puzzle using the Q matrix as a look-up table:
1. Set current state = ‘0312’
2. Find the best action at the current state from the Q matrix (the action with the highest value).
3. Use the action to transition to the next state from the current state.
4. If next state == goal state then stop else go to step 1.

The next animation shows the solution using the Q matrix learnt.

# Some Reinforcement Learning: Using Policy & Value Iteration and Q-learning for a Markov Decision Process in Python and R

The following problems appeared as a project in the edX course ColumbiaX: CSMM.102x Machine Learning. Here is the description of the problem:

Let’s  consider the following 3-state MDP for a robot trying to walk, the three states being ‘Fallen‘, ‘Standing‘ and ‘Moving‘, as shown in the following figure.

(1) First use policy iteration method to find optimal policy for discount factor γ=0.1. Try using this method with a different discount factor, for example a much larger discount factor like 0.9 or 0.99, or a much smaller one like 0.01. Does the optimal policy change?

(2) Now, try using value iteration method instead. How does the number of iterations change? What about the complexity of implementing each iteration?

(3) Implement Q-learning method (without exploration) for this simple MDP. Run this script several times. Does the algorithm always converge?

(4) Now, try adding a simple epsilon greedy exploration and rerun the q-learning method.

The following theory is going to be used for implementation of the algorithms:

(1) The policy iteration method was implemented in python, where starting from the policy {Slow, Slow, Slow}, policy evaluation and improvement steps were applied iteratively.
As can be seen from the following results, a change in optimal policy happened at higher value of the discount factor γγ. The following figures (R visualizations) show how the optimal value vector changes with different values of γ .

#### Policy Iteration: changes in the value vector corresponding to the optimal policy

(2) The value iteration method was implemented similarly, it took much higher number of iterations to converge than the policy iteration method, although it individual iterations were much less complex.
As can be seen from the following results, a change in optimal policy happened at higher value of the discount factor γ . The following figures show how the optimal value vector changes with different values of γ.
Also, as can be seen, an increase in γγ (for which the expected future rewards became more important) resulted in a corresponding increase in the #iterations for the convergence of the value-iteration algorithm convergence (e.g., for γ=0.99γ=0.99 the value iteration method took more than 1000 iterations to converge).

(3) Q-learning without exploration: The algorithm was implemented by using weighted random sampling from the set of states, with the weights proportional to the corresponding probabilities, for any (state, action) at each time step, starting from the ‘Fallen‘ state at time step 0.For convergence, the algorithm was run until the difference in the norm of the learnt Q matrices in two successive iterations became < 1e6   and for at least 20 time steps.

As expected, without exploration, the ‘Fast‘ action in the ‘Moving‘ state (although actually with higher expected reward) does not get a chance to be explored ever and it always converges with the optimal policy as ‘Slow‘ action for all the states. With higher γγ, the number of iterations to converge becomes higher and the value for (Moving, Slow) in the Q matrix learnt becomes higher at convergence.

The following figures and animations show the Q matrix learnt at convergence for different values of γ.

(4) ϵ greedy Q-learning with exploration: with probability equal to ϵ, a random state was explored next, along with the greedy transition. As expected, with exploration, the ‘Fast‘ action at the Moving state also gets explored sometimes (the frequency of exploration depends on the value of ϵϵ chosen) and sometimes it converges with the optimal policy as ‘Slow‘ action for the states ‘Fallen‘ & ‘Standing‘ and ‘Fast‘ action for the ‘Moving‘ state, as expected.

The following figures and animations show two sample runs of the epslion greedy Q-learning with ϵ=0.1, each one converging at a different optimal policy.

#### Results with ϵ=0.1

Since different runs converged at different policies with ϵ=0.1, the algorithm was run 100 times and the frequencies of the unique Q matrix learnt (with values rounded to 1 decimal place) was noted, as shown in the following figure.

1. For 4 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for the states ‘Fallen‘ & ‘Standing‘ and ‘Fast‘ action for the ‘Moving‘ state.
2. For 92 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for all the states.
3. For 4 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for the states ‘Fallen‘ & ‘Moving‘ and ‘Fast‘ action for the ‘Standing‘ state.
4. For 45 out of 100 cases the ‘Fast‘ action at the ‘Moving‘ state is still unexplored.

The above results suggest that ϵ=0.1 is probably not exploring enough and its value should be increased.

#### Results with ϵ=0.5

Again the algorithm was run 100 times and the frequencies of the unique Q matrix learnt (with values rounded to 1 decimal place) was noted, as shown in the following figure.

1. For 33 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for the states ‘Fallen‘ & ‘Standing‘ and ‘Fast‘ action for the ‘Moving‘ state.
2. For 37 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for all the states.
3. For 15 out of 100 cases the optimal policy learnt by the Q matrix was ‘Slow‘ action for the states ‘Fallen‘ & ‘Moving‘ and ‘Fast‘ action for the ‘Standing‘ state.
4. For 15 out of 100 cases the optimal policy learnt by the Q matrix was ‘Fast‘ action for the states ‘Standing‘ & ‘Moving‘ and ‘Slow‘ action for the ‘Fallen‘ state.

# Hard & Soft Clustering with K-means, Weighted K-means and GMM-EM in Python

The following problems appeared as a project in the edX course ColumbiaX: CSMM.102x Machine Learning.

In this assignment the following clustering algorithms will be implemented:

1. Hard clustering with K-means
2. Soft clustering with
a. Weighted K-means
b. Gaussian mixture models with Expectation Maximization.

Some datasets with n data points {x_1,…,x_n} will be used for testing the algorithms, where each  x_∈ R^d.

### Hard Clustering

1. Each point is assigned to a one and only one cluster (hard assignment).
2. With K-means we try to find K centroids {μ1,,μK} and the corresponding assignments of each data point {c1,,cn}  where each ci{1,,K and c_i indicates which of the K clusters the observation x_i belongs to. The objective function that we seek to minimize can be written as

### Soft Clustering

(1) Each point is assigned to all the clusters with different weights or probabilities (soft assignment).

(2) With Weighed K-means we try to compute the weights ϕ_i(kfor each data point i to the cluster k as minimizing the following objective:

(3) With GMM-EM we can do soft clustering too. The EM algorithm can be used to learn the parameters of a Gaussian mixture model. For this model, we assume a generative process for the data as follows:

In other words, the i_th observation is first assigned to one of K clusters according to the probabilities in vector π, and the value of observation x_i is then generated from one of K multivariate Gaussian distributions, using the mean and covariance indexed by c_i. The EM algorithm seeks to maximize

over all parameters π,μ1,…,μK,Σ1,…,ΣK using the cluster assignments c1,…,cn as the hidden data.

The following figures show the algorithms that are going to be implemented for clustering.

The following animations show the output of the clustering algorithms (and how they converge with different iterations) on a few datasets (with k=3 clusters), the weighted k-means is run with the stiffness-parameter beta=10.

Dataset1

Dataset2

Dataset3

Dataset4

Dataset5

Dataset6

Dataset7

Dataset8

The next animation shows the results with weighted K-means with stiffness parameter beta=10

The next animation shows the results with weighted K-means with stiffness parameter beta=1

Here is how the log-likelihood for EM continuously increases over the iterations for a particular dataset:

The following animations show GMM EM convergence on a few more datasets:

# Solving Sudoku as a Constraint Satisfaction Problem using Constraint Propagation with Arc-Consistency Checking and then Backtracking with Minimum Remaining Value Heuristic and Forward Checking in Python

This problem appeared as a project in the edX course ColumbiaX: CSMM.101x Artificial Intelligence (AI). In this assignment the focus will be on constraint satisfaction problems (CSP). The AC-3 and backtracking (with MRV heuristic) algorithms will be implemented to solve Sudoku puzzles. The objective of the game is just to ﬁll a 9 x 9 grid with numerical digits so that each column, each row, and each of the nine 3 x 3 sub-grids (also called boxes) contains one of all of the digits 1 through 9.

The following description of the problem is taken from the course:

### I. Introduction

Let’s consider the Sudoku puzzle as pictured below. There are 81 variables in total, i.e. the tiles to be filled with digits. Each variable is named by its row and its column, and must be assigned a value from 1 to 9, subject to the constraint that no two cells in the same row, column, or box may contain the same value.

### II. AC-3 Algorithm

First, the AC-3 (arc-consistency checking) algorithm will be implemented. This algorithm will propagate the constraints and reduce the domain size of the variables by ensuring all possible (future) assignments consistent. As can be seen, very few of the puzzles can be solved by using this algorithm alone, as expected. The following figure describes the algorithm AC-3 (in the context of map-coloring problem) and also shows the pseudocode for the algorithm that will be used for ensuring the arc-consistency:

### III. Backtracking Algorithm

Now, the backtracking algorithm will be implemented using the minimum remaining value (MRV) heuristic. The order of values to be attempted for each variable can be arbitrary. When a variable is assigned, forward checking will be applied to further reduce variables domains.The following figure shows the BT search algorithm to be used and also describes the minimum remaining value (MRV) heuristic to improve the BT search and constraint propagation with forward checking (in the context of map-coloring problem):

### Outputs

The following figures and animations show how a few instances of sudoku puzzles were solved using the above algorithms.

#### Example 1:

This is a simple puzzle that can be solved by running the AC-3 algorithm alone. Here is the  output obtained.

#### Example 2:

The above puzzle is an example that can’t be solved using AC-3 algorithm alone, although the domain of the variables could be reduced. The following shows the output from the AC-3 algorithm which shows that no additional variables could be assigned.

Next BT (Backtracking) search with MRV heuristic and forward checking will be applied to solve the puzzle, starting from the reduced domain set of the variables. The following animation shows the BT algorithm steps (it took 254 total steps).

#### The following figure shows the full backtracking tree for the above example puzzle (the tree is huge, zoom in to see it properly):

Example 3:

The next puzzle is a more complex one, that takes total 2075 steps to be solved with BT-search equipped with RMV/FC, preceded by AC-3 constraint propagation.

After AC-3:

After BT:

The partial solutions with partial backtracking trees for this example puzzle are shown below:

#### Example 4:

BT Search Soluton

Summary

The following shows the histogram and the distribution of the total time taken to solve a  single Sudoku puzzle, where there were 400 different such puzzles. As can be seen, almost all of the puzzles were solved very fast, typically within 3 secs.

The below figure shows the distribution of time taken only for the AC-3 algorithm. As expected it consumes very small amount of time.

The next figure shows the distribution of time taken only for the BT search algorithm.

Finally let’s have a look at the joint distribution of the time taken by AC-3 (X) and by BT (Y)

# Using Uninformed & Informed Search Algorithms to Solve 8-Puzzle (n-Puzzle) in Python / Java

This problem appeared as a project in the edX course ColumbiaX: CSMM.101x Artificial Intelligence (AI). In this assignment an agent will be implemented to solve the 8-puzzle game (and the game generalized to an n × n array).

The following description of the problem is taken from the course:

## I. Introduction

An instance of the n-puzzle game consists of a board holding n^2-1 distinct movable tiles, plus an empty space. The tiles are numbers from the set 1,..,n^2-1. For any such board, the empty space may be legally swapped with any tile horizontally or vertically adjacent to it. In this assignment, the blank space is going to be represented with the number 0. Given an initial state of the board, the combinatorial search problem is to find a sequence of moves that transitions this state to the goal state; that is, the configuration with all tiles arranged in ascending order 0,1,… ,n^2−1. The search space is the set of all possible states reachable from the initial state. The blank space may be swapped with a component in one of the four directions {‘Up’, ‘Down’, ‘Left’, ‘Right’}, one move at a time. The cost of moving from one configuration of the board to another is the same and equal to one. Thus, the total cost of path is equal to the number of moves made from the initial state to the goal state.

## II. Algorithm Review

The searches begin by visiting the root node of the search tree, given by the initial state. Among other book-keeping details, three major things happen in sequence in order to visit a node:

1. First, we remove a node from the frontier set.
2. Second, we check the state against the goal state to determine if a solution has been found.
3. Finally, if the result of the check is negative, we then expand the node. To expand a given node, we generate successor nodes adjacent to the current node, and add them to the frontier set. Note that if these successor nodes are already in the frontier, or have already been visited, then they should not be added to the frontier again.

This describes the life-cycle of a visit, and is the basic order of operations for search agents in this assignment—(1) remove, (2) check, and (3) expand. In this assignment, we will implement algorithms as described here.

## III. What The Program Need to Output

Suppose the program is executed for breadth-first search starting from the initial state 1,2,5,3,4,0,6,7,8 as follows:

The output file should contain exactly the following lines:

path_to_goal: [‘Up’, ‘Left’, ‘Left’]
cost_of_path: 3
nodes_expanded: 10
fringe_size: 11
max_fringe_size: 12
search_depth: 3
max_search_depth: 4
running_time: 0.00188088
max_ram_usage: 0.07812500

The following algorithms are going to be implemented and taken from the lecture slides from the same course.

## Outputs

The following figures and animations show how the 8-puzzle was solved starting from different initial states with different algorithms. For A* and ID-A* search we are going to use Manhattan heuristic, which is an admissible heuristic for this problem. Also, the figures display the search paths from starting state to the goal node (the states with red text denote the path chosen). Let’s start with a very simple example. As can be seen, with this simple example all the algorithms find the same path to the goal node from the initial state.

#### Example 1: Initial State: 1,2,5,3,4,0,6,7,8

(1) BFS

The path to the goal node with BFS is shown in the following figure:

The nodes expanded by BFS (also the nodes that are in the fringe / frontier of the queue) are shown in the following figure:

(2) DFS

The path to the goal node with DFS is shown in the following figure:

(3) A*

The path to the goal node with A* (also the nodes expanded by A*) is shown in the following figure:

(4) ID-A*

The path to the goal node (as well as the nodes expanded) with ID-A* is shown in the following figure:

Now let’s try a little more complex examples:

Example 2: Initial State: 1,4,2,6,5,8,7,3,0

(1) A*

The path to the goal node with A* is shown in the following figure:

All the nodes expanded by A* (also the nodes that are in the fringe / frontier of the queue) are shown in the following figure:

(2) BFS

The path to the goal node with BFS is shown in the following figure:

All the nodes expanded by BFS are shown in the following figure:

Example 3: Initial State: 1,0,2,7,5,4,8,6,3

(1) A*

The path to the goal node with A* is shown in the following figures:

The nodes expanded by A* (also the nodes that are in the fringe / frontier of the priority queue) are shown in the following figure (the tree is huge, use zoom to view it properly):

(2) ID-A*

The nodes expanded by ID-A* are shown in the following figure (again the tree is huge, use zoom to view it properly):

The same problem (with a little variation) also appeared a programming exercise in the Coursera Course Algorithm-I (By Prof. ROBERT SEDGEWICK, Princeton). The description of the problem taken from the assignment is shown below (notice that the goal state is different in this version of the same problem):

Write a program to solve the 8-puzzle problem (and its natural generalizations) using the A* search algorithm.

The problem. The 8-puzzle problem is a puzzle invented and popularized by Noyes Palmer Chapman in the 1870s. It is played on a 3-by-3 grid with 8 square blocks labeled 1 through 8 and a blank square. Your goal is to rearrange the blocks so that they are in order. You are permitted to slide blocks horizontally or vertically into the blank square. The following shows a sequence of legal moves from an initial board position (left) to the goal position (right).

Best-first search. We now describe an algorithmic solution to the problem that illustrates a general artificial intelligence methodology known as the A* search algorithm. We define a state of the game to be the board position, the number of moves made to reach the board position, and the previous state. First, insert the initial state (the initial board, 0 moves, and a null previous state) into a priority queue. Then, delete from the priority queue the state with the minimum priority, and insert onto the priority queue all neighboring states (those that can be reached in one move). Repeat this procedure until the state dequeued is the goal state. The success of this approach hinges on the choice of priority function for a state. We consider two priority functions:

• Hamming priority function. The number of blocks in the wrong position, plus the number of moves made so far to get to the state. Intutively, a state with a small number of blocks in the wrong position is close to the goal state, and we prefer a state that have been reached using a small number of moves.
• Manhattan priority function. The sum of the distances (sum of the vertical and horizontal distance) from the blocks to their goal positions, plus the number of moves made so far to get to the state.

For example, the Hamming and Manhattan priorities of the initial state below are 5 and 10, respectively.

(1)
The following 8-puzzle is solvable with A* / Manhattan heuristic in 5 steps, as shown below:

(2) The following 15-puzzle is solvable in 6 steps, as shown below:

(3) The following 80-puzzle is solvable in 10 steps, as shown below:

# Some Machine Learning with Python

The following problems appeared as part of a project in the edX course ColumbiaX: CSMM.101x Artificial Intelligence (AI).

## I. Perceptron Learning Algorithm

In this question, the perceptron learning algorithm (“PLA”) will be implemented for a linearly separable 2-D dataset. The dataset contains a series of data points. Each point is a comma-separated ordered triple, representing feature_1, feature_2, and the label for the point. The values of the features can be though of as the x- and y-coordinates of each point. The label takes on a value of positive or negative one. The label can be thought of as separating the points into two categories. Implement your PLA in a file called problem1.py, which will be executed like so:

With each iteration of the PLA, your program must print a new line to the output file, containing a comma-separated list of the weights w_1, w_2, and b in that order. Upon convergence, the program will stop, and the final values of w_1, w_2, and b will be printed to the output file. This defines the decision boundary that your PLA has computed for the given dataset.

Here are first few rows of the sample dataset and the scatter plot of the dataset:

x1 x2 y
0 8 -11 1
1 7 7 -1
2 12 -20 1
3 14 -3 -1
4 12 8 -1

The following figure and the animation show how the perceptron algorithm converges to find the 2-D hyperplane (line) that separates two classes.

## II. Linear Regression

In this problem, multiple linear regression will be implemented using gradient descent on a dataset with multiple features. Again, the dataset contains a series of data points. Each point is a comma-separated ordered triple, representing age, weight, and height (derived from CDC growth charts data). A few rows of the sample dataset is shown below.

Out[26]:
age weight height
0 2.00 10.21027 0.805248
1 2.04 13.07613 0.919474
2 2.13 11.44697 0.908350
3 2.21 14.43984 0.803756
4 2.29 12.59622 0.811357

Data Preparation and Normalization. Once the dataset is loaded, the content needs to be explored to identify each feature. The intercept is needed to be added ahead of the data matrix. Since the features are not on the same scale, e.g., they represent age (years), and weight (kilograms), they need to normalized. What is the mean and standard deviation of each feature? The last column is the label, and represents the height (meters). Each feature (i.e. age and weight, except the intercept) has to be scaled by its standard deviation, and its mean has to be set to zero, using the following formula (for z-score normalization):

Gradient Descent: Gradient descent is to be implemented to find a regression model. The β’s are to be initialized to zero. The empirical risk and gradient descent rule are as follows:

The gradient descent algorithm is to be run using the following learning rates: α ∈ {0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5, 10}. For each value of α, the algorithm is to be run for exactly 100 iterations and the convergence rates to be compared when α is small versus large. Also, the ideal learning rate to obtain an accurate model has to be computed.

The animation below shows how the regression plane is learnt with gradient descent.

## III. Classification

In this problem the support vector classifiers in the sklearn package is to be used to learn a classification model for a chessboard-like dataset. Again, the input dataset containins a series of data points as shown below.

A B label
0 3.40 0.35 1
1 0.70 0.20 1
2 3.55 2.50 0
3 3.10 0.35 1
4 0.70 1.30 1

SVM with different kernels is to be used to build a classifier. The dataset is to be splitted into training (60%) and testing (40%), using stratified sampling (i.e. same ratio of positive to negative in both the training and testing datasets). K-Fold Cross validation (with the number of folds k = 5) is to be used instead of a validation set. SVM with Linear Kernel. The performance of the SVM with linear kernel is to be observed. A good setting of parameters is to be searched to obtain high classification accuracy. Specifically, the values of C = [0.1, 0.5, 1, 5, 10, 50, 100] are to be tried. After locating the optimal parameter value by using the training data, the corresponding best score (accuracy) achieved needs to e recorded Then by applying the testing data to the model, and the actual test score is to be recored. Both scores will be a number between zero and one.

SVM with Polynomial Kernel. (Similar to above). Try values of C = [0.1, 1, 3], degree = [4, 5, 6], and gamma = [0.1, 1].

SVM with RBF Kernel. (Similar to above). Try values of C = [0.1, 0.5, 1, 5, 10, 50, 100] and gamma = [0.1, 0.5, 1, 3, 6, 10].

Logistic Regression. (Similar to above). Try values of C = [0.1, 0.5, 1, 5, 10, 50, 100].

k-Nearest Neighbors. (Similar to above). Try values of n_neighbors = [1, 2, 3, …, 50] and leaf_size = [5, 10, 15, …, 60].

Decision Trees. (Similar to above). Try values of max_depth = [1, 2, 3, …, 50] and min_samples_split = [1, 2, 3, …, 10].

Random Forest. (Similar to above). Try values of max_depth = [1, 2, 3, …, 50] and min_samples_split = [1, 2, 3, …, 10].

The following plots show the cross validation accuracies for a few different classifiers with different values of the hyper-parameters.

Here are the decision boundaries learnt with different classifiers using the best-fit models: