This problem appeared as an assignment in the *coursera course* **Machine Learning – Classification**, part of *Machine Learning* specialization by the University of Washington. The following description of the problem is taken directly from the assignment.

The goal of this assignment is to implement an *online logistic regression classifier* using *stochastic gradient ascent*. The following are the sub-tasks:

*Amazon product reviews*dataset is used along with positive / negative labels as the training dataset.*Bag of words features*will be extracted from the training dataset, only a pre-selected set of important words will be used as features.- The
*partial derivative of log likelihood*(with*L2 penalty*) with respect to each single coefficient is computed. *Stochastic gradient ascent*is implemented from sractch.*Convergence*of*stochastic gradient ascent*is compared with that of*batch gradient ascent*.

## Load and process review dataset

For this assignment, a subset of Amazon product review dataset is going to be used. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted of mostly positive reviews.

**Preprocessing**: We shall work with a hand-curated list of important words extracted from the review data. We will also perform 2 simple data transformations:

- Remove punctuation using string manipulation functionality.
- Compute word counts (only for the important_words)

The data frame **products** now contains one column for each of the 193 **important_words**.

### Split data into training and validation sets

We will now split the data into a 90-10 split where 90% is in the training set and 10% is in the validation set. We use `seed=1`

so that everyone gets the same result.

An additional colum ‘*intercept*‘ (filled with 1’s) is needed to be inserted into the data frame to take account of the intercept term.

## Building on logistic regression

Now the *link function* for **logistic regression** can be defined as:

where the feature vector h(xi)h(xi) is given by the word counts of **important_words** in the review xixi. Now our goal is to maximize the log-likelihood function, which does not have a closed-form solution, hence techniques like *gradient descent* needs to be used.

The way the probability predictions are computed is not affected by using stochastic gradient ascent as a solver. Only the way in which the coefficients are learned is affected by using stochastic gradient ascent as a solver.

**Note**. We are not using regularization in this assignment, but, stochastic gradient can also be used for regularized logistic regression, there will be one addtional term for regularization in the partial derivative.

To verify the correctness of the gradient computation, we use a function for computing average log likelihood (to be used for its numerical stability).

To track the performance of stochastic gradient ascent, we provide a function for computing **average log likelihood**.

**Note** the added **1/N** term which averages the log likelihood across all data points. The 1/N term makes it easier for us to compare stochastic gradient ascent with batch gradient ascent.

In other words, we update the coefficients using the **average gradient over data points** (instead of using a summation). By using the average gradient, we ensure that the magnitude of the gradient is approximately the same for all batch sizes. This way, we can more easily compare various batch sizes of stochastic gradient ascent (including a batch size of **all the data points**), and study the effect of batch size on the algorithm as well as the choice of step size.

## Implementing stochastic gradient ascent

Now let’s extend the algorithm for *batch gradient ascent* (that takes all the data at once) to *stochastic* (takes one data point at a time) and *mini-batch* gradient ascent (that takes data in small batches as input, computes the gradient and updates the coefficients). The following figure shows the gradient ascent algorithm, which needs to scaled dwon by appropriate batch size.

## Compare convergence behavior of stochastic gradient ascent

For the remainder of the assignment, let’s compare *stochastic gradient ascent* against *batch gradient ascent*. For this, we need a reference implementation of batch gradient ascent.

## Running gradient ascent using the stochastic gradient ascent implementation

Let’s now **run stochastic gradient ascent** over the **feature_matrix_train** for 10 iterations using:

`initial_coefficients = zeros`

`step_size = 5e-1`

`batch_size = 1`

`max_iter = 10`

*stochastic gradient descent*(i.e., with batch size 1).

Now let’s again run **batch gradient ascent** over the **feature_matrix_train** but this time for 200 iterations using:

`initial_coefficients = zeros`

`step_size = 5e-1`

`batch_size = # data points in the training dataset`

`max_iter = 200`

As expected, with (full) batch gradient ascent, as each iteration passes, the average log likelihood in the batch continuously increases.

## Make “passes” over the dataset

To make a fair comparison between stochastic gradient ascent and batch gradient ascent, we measure the average log likelihood as a function of the number of passes (defined as follows):

## Log likelihood plots for stochastic gradient ascent

With the terminology in mind, let us run stochastic gradient ascent for 10 passes. We will use

`step_size=1e-1`

`batch_size=100`

`initial_coefficients`

to all zeros.

Let’s plot the average log likelihood as a function of the number of passes.

## Smoothing the stochastic gradient ascent curve

The plotted line oscillates so much that it is hard to see whether the log likelihood is improving. In our plot, we apply a simple smoothing operation using the parameter `smoothing_window`

. The smoothing is simply a moving average of log likelihood over the last `smoothing_window`

“iterations” of stochastic gradient ascent.

## Stochastic gradient ascent vs batch gradient ascent

To compare convergence rates for stochastic gradient ascent with batch gradient ascent, let’s plot the change in log-likelihood with the iterations.

We are comparing:

**stochastic gradient ascent**:`step_size = 0.1`

,`batch_size=100`

**batch gradient ascent**:`step_size = 0.5`

,`batch_size= # data points`

Write code to run stochastic gradient ascent for 200 passes using:

`step_size=1e-1`

`batch_size=100`

`initial_coefficients`

to all zeros.

We compare the convergence of stochastic gradient ascent and batch gradient ascent in the following cell. Note that we apply smoothing with `smoothing_window=30`

.

As can be seen from the figure above, the batch gradient ascent needs at least 150 passes to achieve a similar log likelihood as stochastic gradient ascent.

## Explore the effects of step sizes (learning rate) on stochastic gradient ascent

To start, let’s explore a wide range of step sizes that are equally spaced in the log space and run *stochastic gradient ascent* with `step_size`

set to 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, and 1e2, using the following set of parameters:

`initial_coefficients=zeros`

`batch_size=100`

`max_iter`

initialized so as to run 10 passes over the data.

### Plotting the log likelihood as a function of passes for each step size

Now, let’s plot the change in log likelihood for each of the following values of `step_size`

:

`step_size = 1e-4`

`step_size = 1e-3`

`step_size = 1e-2`

`step_size = 1e-1`

`step_size = 1e0`

`step_size = 1e1`

`step_size = 1e2`

For consistency, we again apply `smoothing_window=30`

.

Now, let us remove the step size `step_size = 1e2`

and plot the rest of the curves.