# Online Learning: Sentiment Analysis on Amazon Product Review Dataset with Logistic Regression via Stochastic Gradient Ascent in Python

This problem appeared as an assignment in the coursera course Machine Learning – Classification, part of Machine Learning specialization by the University of Washington. The following description of the problem is taken directly from the assignment.

The goal of this assignment is to implement an online logistic regression classifier using stochastic gradient ascent. The following are the sub-tasks:

• Amazon product reviews dataset is used along with positive / negative labels as the training dataset.
• Bag of words features will be extracted from the training dataset, only a pre-selected set of important words will be used as features.
• The partial derivative of log likelihood (with L2 penalty) with respect to each single coefficient is computed.
• Stochastic gradient ascent is implemented from sractch.
• Convergence of stochastic gradient ascent is compared with that of batch gradient ascent.

## Load and process review dataset

For this assignment, a subset of Amazon product review dataset is going to be used. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted of mostly positive reviews.

Preprocessing: We shall work with a hand-curated list of important words extracted from the review data. We will also perform 2 simple data transformations:

1. Remove punctuation using string manipulation functionality.
2. Compute word counts (only for the important_words)

The data frame products now contains one column for each of the 193 important_words.

### Split data into training and validation sets

We will now split the data into a 90-10 split where 90% is in the training set and 10% is in the validation set. We use seed=1 so that everyone gets the same result.

Training set  : 47765 data points
Validation set: 5307 data points


An additional colum ‘intercept‘ (filled with 1’s) is needed to be inserted into the data frame to take account of the intercept term.

## Building on logistic regression

Now the link function for logistic regression can be defined as:

where the feature vector h(xi)h(xi) is given by the word counts of important_words in the review xixi. Now our goal is to maximize the log-likelihood function, which does not have a closed-form solution, hence techniques like gradient descent needs to be used.

The way the probability predictions are computed is not affected by using stochastic gradient ascent as a solver. Only the way in which the coefficients are learned is affected by using stochastic gradient ascent as a solver.

Note. We are not using regularization in this assignment, but, stochastic gradient can also be used for regularized logistic regression, there will be one addtional term for regularization in the partial derivative.

To verify the correctness of the gradient computation, we use a function for computing average log likelihood (to be used for its numerical stability).

To track the performance of stochastic gradient ascent, we provide a function for computing average log likelihood.

Note the added 1/N term which averages the log likelihood across all data points. The 1/N term makes it easier for us to compare stochastic gradient ascent with batch gradient ascent.

In other words, we update the coefficients using the average gradient over data points (instead of using a summation). By using the average gradient, we ensure that the magnitude of the gradient is approximately the same for all batch sizes. This way, we can more easily compare various batch sizes of stochastic gradient ascent (including a batch size of all the data points), and study the effect of batch size on the algorithm as well as the choice of step size.

Now let’s extend the algorithm for batch gradient ascent (that takes all the data at once) to stochastic (takes one data point at a time) and mini-batch gradient ascent (that takes data in small batches as input, computes the gradient and updates the coefficients). The following figure shows the gradient ascent algorithm, which needs to scaled dwon by appropriate batch size.

## Compare convergence behavior of stochastic gradient ascent

For the remainder of the assignment, let’s compare stochastic gradient ascent against batch gradient ascent. For this, we need a reference implementation of batch gradient ascent.

Let’s now run stochastic gradient ascent over the feature_matrix_train for 10 iterations using:

• initial_coefficients = zeros
• step_size = 5e-1
• batch_size = 1
• max_iter = 10
Iteration 0: Average log likelihood (of data points in batch [00000:00001]) = -0.01416346
Iteration 1: Average log likelihood (of data points in batch [00001:00002]) = -0.00505439
Iteration 2: Average log likelihood (of data points in batch [00002:00003]) = -0.00177457
Iteration 3: Average log likelihood (of data points in batch [00003:00004]) = -0.00311449
Iteration 4: Average log likelihood (of data points in batch [00004:00005]) = -0.06140707
Iteration 5: Average log likelihood (of data points in batch [00005:00006]) = -0.00000011
Iteration 6: Average log likelihood (of data points in batch [00006:00007]) = -0.02461738
Iteration 7: Average log likelihood (of data points in batch [00007:00008]) = -0.00876472
Iteration 8: Average log likelihood (of data points in batch [00008:00009]) = -0.00003921
Iteration 9: Average log likelihood (of data points in batch [00009:00010]) = -0.00000620

As expected, as each iteration passes, how does the average log likelihood in the batch fluctuates with stochastic gradient descent (i.e., with batch size 1).

Now let’s again run batch gradient ascent over the feature_matrix_train but this time for 200 iterations using:

• initial_coefficients = zeros
• step_size = 5e-1
• batch_size = # data points in the training dataset
• max_iter = 200
Iteration   0: Average log likelihood (of data points in batch [00000:47765]) = -0.68313840
Iteration   1: Average log likelihood (of data points in batch [00000:47765]) = -0.67402166
Iteration   2: Average log likelihood (of data points in batch [00000:47765]) = -0.66563558
Iteration   3: Average log likelihood (of data points in batch [00000:47765]) = -0.65788618
Iteration   4: Average log likelihood (of data points in batch [00000:47765]) = -0.65070149
Iteration   5: Average log likelihood (of data points in batch [00000:47765]) = -0.64402111
Iteration   6: Average log likelihood (of data points in batch [00000:47765]) = -0.63779295
Iteration   7: Average log likelihood (of data points in batch [00000:47765]) = -0.63197173
Iteration   8: Average log likelihood (of data points in batch [00000:47765]) = -0.62651787
Iteration   9: Average log likelihood (of data points in batch [00000:47765]) = -0.62139668
Iteration  10: Average log likelihood (of data points in batch [00000:47765]) = -0.61657763
Iteration  11: Average log likelihood (of data points in batch [00000:47765]) = -0.61203378
Iteration  12: Average log likelihood (of data points in batch [00000:47765]) = -0.60774127
Iteration  13: Average log likelihood (of data points in batch [00000:47765]) = -0.60367892
Iteration  14: Average log likelihood (of data points in batch [00000:47765]) = -0.59982787
Iteration  15: Average log likelihood (of data points in batch [00000:47765]) = -0.59617125
Iteration 100: Average log likelihood (of data points in batch [00000:47765]) = -0.49541833
Iteration 199: Average log likelihood (of data points in batch [00000:47765]) = -0.47143083


As expected, with (full) batch gradient ascent, as each iteration passes, the average log likelihood in the batch continuously increases.

## Make “passes” over the dataset

To make a fair comparison between stochastic gradient ascent and batch gradient ascent, we measure the average log likelihood as a function of the number of passes (defined as follows):

## Log likelihood plots for stochastic gradient ascent

With the terminology in mind, let us run stochastic gradient ascent for 10 passes. We will use

• step_size=1e-1
• batch_size=100
• initial_coefficients to all zeros.
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.41823078


Let’s plot the average log likelihood as a function of the number of passes.

## Smoothing the stochastic gradient ascent curve

The plotted line oscillates so much that it is hard to see whether the log likelihood is improving. In our plot, we apply a simple smoothing operation using the parameter smoothing_window. The smoothing is simply a moving average of log likelihood over the last smoothing_window “iterations” of stochastic gradient ascent.

To compare convergence rates for stochastic gradient ascent with batch gradient ascent, let’s plot the change in log-likelihood with the iterations.

We are comparing:

• stochastic gradient ascent: step_size = 0.1, batch_size=100
• batch gradient ascent: step_size = 0.5, batch_size= # data points

Write code to run stochastic gradient ascent for 200 passes using:

• step_size=1e-1
• batch_size=100
• initial_coefficients to all zeros.
Iteration     0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration     1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration     2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration     3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration     4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration     5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration     6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration     7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration     8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration     9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration    10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration    11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration    12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration    13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration    14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration    15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration   100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration   200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration   300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration   400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration   500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration   600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration   700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration   800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration   900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration  1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration  2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration  3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration  4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration  5000: Average log likelihood (of data points in batch [23000:23100]) = -0.41321275
Iteration  6000: Average log likelihood (of data points in batch [27600:27700]) = -0.42095621
Iteration  7000: Average log likelihood (of data points in batch [32200:32300]) = -0.47438456
Iteration  8000: Average log likelihood (of data points in batch [36800:36900]) = -0.40689130
Iteration  9000: Average log likelihood (of data points in batch [41400:41500]) = -0.44582019
Iteration 10000: Average log likelihood (of data points in batch [46000:46100]) = -0.39752726
Iteration 20000: Average log likelihood (of data points in batch [44300:44400]) = -0.50001293
Iteration 30000: Average log likelihood (of data points in batch [42600:42700]) = -0.44909961
Iteration 40000: Average log likelihood (of data points in batch [40900:41000]) = -0.41075257
Iteration 50000: Average log likelihood (of data points in batch [39200:39300]) = -0.47957450
Iteration 60000: Average log likelihood (of data points in batch [37500:37600]) = -0.42584682
Iteration 70000: Average log likelihood (of data points in batch [35800:35900]) = -0.37312738
Iteration 80000: Average log likelihood (of data points in batch [34100:34200]) = -0.41330111
Iteration 90000: Average log likelihood (of data points in batch [32400:32500]) = -0.47600432
Iteration 95399: Average log likelihood (of data points in batch [47600:47700]) = -0.47449630


We compare the convergence of stochastic gradient ascent and batch gradient ascent in the following cell. Note that we apply smoothing with smoothing_window=30.

As can be seen from the figure above, the batch gradient ascent needs at least 150 passes to achieve a similar log likelihood as stochastic gradient ascent.

## Explore the effects of step sizes (learning rate) on stochastic gradient ascent

To start, let’s explore a wide range of step sizes that are equally spaced in the log space and run stochastic gradient ascent with step_size set to 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, and 1e2, using the following set of parameters:

• initial_coefficients=zeros
• batch_size=100
• max_iter initialized so as to run 10 passes over the data.
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69313572
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69313813
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69312970
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69313664
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69312912
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69312352
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69312565
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69312596
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69312480
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69311820
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69312342
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.69309997
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.69310417
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.69311459
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.69311289
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.69310315
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.69299666
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.69262795
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.69259227
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.69216304
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.69184517
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.69233727
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.69184444
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.69162156
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.69137017
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.69116453
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.68868229
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.68748389
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.68381866
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.68213944
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69303256
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69305660
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69297245
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69304176
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69296671
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69291077
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69293204
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69293505
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69292337
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69285782
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69290957
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.69267562
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.69271772
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.69282167
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.69280442
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.69270726
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.69163566
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.68802729
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.68772431
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.68369142
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.68064510
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.68541475
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.68090549
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.67879020
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.67693059
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.67539881
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.65759465
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.64745516
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.62162582
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.61371736
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69200364
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69223670
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69141056
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69209296
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69135181
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69080412
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69100987
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69103436
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69091067
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69029154
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69076626
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.68848541
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.68891938
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.68992883
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.68973094
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.68878712
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.67788829
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.64832833
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.64792583
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.62345602
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.60389801
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.63841711
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.61096879
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.59948158
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.59326446
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.59519901
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.54578301
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.51997970
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.46497627
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.46731743
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.41823078
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.60671913
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.61435096
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.57582992
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.60419455
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.56461895
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.55369469
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.56188804
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.56098460
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.53659402
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.54855532
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.54643770
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.45436554
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.51347209
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.53114501
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.51323915
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.50234155
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.44799354
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.38955785
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.41840095
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.42081935
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.35694652
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.44873903
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.42411231
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.47364810
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.36346510
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.47974711
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.45019123
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.39097701
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.34895703
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.40687203
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.75339506
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -5.19955914
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -1.35343001
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -3.63980553
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -1.05854033
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -1.11538249
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.86603585
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67571232
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.83674532
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -1.07638709
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -1.20809203
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.90955296
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -1.58077817
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.77787311
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.62852240
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.70284036
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.62403867
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.30394690
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.56782701
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.48147752
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.43850709
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.43008741
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -1.11804684
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.53574169
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.31389670
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.61323575
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.76125665
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -1.02808956
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.46513784
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.47187763
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -5.69805760
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -61.13545979
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -7.42566120
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -27.73988196
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -15.42138334
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -11.20038218
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -11.11461119
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -8.49925505
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -16.55690529
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -16.98581355
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -13.56402260
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -6.25429003
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -16.59818372
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -8.45630034
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -4.98292077
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -7.47891216
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -6.88355879
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -3.34058488
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -3.83117603
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -6.29604810
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -2.99625091
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -2.31125647
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -8.19978746
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -3.76650208
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -4.06151269
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -12.66107559
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -13.33445580
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -1.63544030
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -3.55951973
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -3.41717551


### Plotting the log likelihood as a function of passes for each step size

Now, let’s plot the change in log likelihood for each of the following values of step_size:

• step_size = 1e-4
• step_size = 1e-3
• step_size = 1e-2
• step_size = 1e-1
• step_size = 1e0
• step_size = 1e1
• step_size = 1e2

For consistency, we again apply smoothing_window=30.

Now, let us remove the step size step_size = 1e2 and plot the rest of the curves.

As can be seen from the above plots, the step size 1e2 gives the worst result and step size 1 and 0.1 gives the best results in terms of convergence.