Convergence of the gradient descent algorithms (stochastic and batch) for the linear / logistic regression and perceptron

In this article, the convergence of the optimization algorithms for the linear regression and the logistic regression is going to be shown using online (stochastic) and batch gradient descent on a few datasets. Also, the online and batch version of the perceptron learning algorithm convergence will be shown on a synthetically generated dataset.

  1. The following simple dataset (obtained from Andrew Ng’s Coursera ML course) is used for running the gradient descent algorithms.
    linreg

    ## [1] "Theta found by running batch gradient descent (with 600 iterations): "
    ##                [,1]
    ## rep(1, m) -3.112591
    ## data[, 1]  1.114354
    ## [1] "Theta found by Stochastic gradient descent (with 10 repeats): "
    ##           [,1]
    ## [1,] -1.947365
    ## [2,]  1.251294
    ## [1] "Theta found with normal equation: "
    ##                [,1]
    ## rep(1, m) -3.895781
    ## data[, 1]  1.193034
  2. The next figure shows how the stochastic (noisy and slower convergence) and batch gradient descent (steady and much faster convergence) algorithms converge to the theta parameter value that minimizes the cost function (computed using normal equation, shown as a red dot in the figures).grad_reg1.gif
  3. The following figure shows the stochastic and batch versions of the algorithms.galgo.png
  4. Next the gradient descent algorithms are run on the two variables (wt and mpg) of the mtcars datatset in R, wt being the predictor and mpg being the response variable. The following figures / animations show the results.linreg2
    ## [1] "Theta found by running batch gradient descent (with 600 iterations): "
    ##                [,1]
    ## rep(1, m) 37.248255
    ## data[, 1] -5.333882
    ## [1] "Theta found by Stochastic gradient descent (with 10 repeats): "
    ##           [,1]
    ## [1,] 25.954195
    ## [2,] -2.342769
    ## [1] "Theta found with normal equation: "
    ##                [,1]
    ## rep(1, m) 37.285126
    ## data[, 1] -5.344472

    grad_reg2

     

  5. Now the output variable mpg (miles per galon) is converted to a binary variable by binning it around the median value of the variable (if > 19 then mpg is high or 1, otherwise it’s 0). Now the gradient descent algorithms are run on the two variables (wt and mpg) of the mtcars datatset in R, wt being the predictor and mpg being the response variable, but this time using the following cost function for the logistic regression.
    logit
  6. The following figures / animations show the results of running the gradient descent algorithms and their convergence. The true minimum is computed using a more efficient optimization algorithm (such as conjugate gradient) and shown as a red dot in the figures.
    grad_log
  7. Next let’s compare the convergence rates of a couple of uniformly randomly generated 2-D linearly separable datasets with batch and the online versions of perceptron (pocket) algorithm. The following shows the batch and the stochastic versions of the perceptron learning algorithm.palgo
  8. As can be seen from the following animations, in the first case the batch version converges much faster than the online counterpart in the synthetic dataset generated, whereas in the second case, the online version converges early.o1b1
    o2b2
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s