Implementing LASSO Regression with Coordinate Descent, Sub-Gradient of the L1 Penalty and Soft Thresholding in Python

This problem appeared as an assignment in the coursera course Machine Learning – Regression, part of Machine Learning specialization by the University of Washington. The following description of the problem is taken directly from the assignment.

The goal of this assignment is to implement an LASSO Solver using coordinate descent. The following are the sub-tasks:

  • First the features are needed to be normalized.
  • The coordinate descent for LASSO needs to be implemented (with the subgradient of the L1 penalty).
  • The effects of L1 penalty are going to be explored.

The following dataset (few rows and columns are shown in the below table) is from house sales in King County, the region where the city of Seattle, WA is located. This is going to be our input datatset for training the LASSO model, with the response variable as price.


Normalize features

In the house dataset, features vary wildly in their relative magnitude: sqft_living is very large overall compared to bedrooms, for instance. As a result, weight for sqft_living would be much smaller than weight for bedrooms. This is problematic because “small” weights are dropped first as l1_penalty goes up.

To give equal considerations for all features, we need to normalize features by dividing each feature by its 2-norm so that the transformed feature has norm 1.

Implementing Coordinate Descent with normalized features

 The following theory is going to be used to implement the coordinate descent (for normalized feature_j, we shall have z_j=1):

Effect of L1 penalty

Let us consider a simple model with 2 features:

simple_features = ['sqft_living', 'bedrooms']

Single Coordinate Descent Step

Using the formula above, let’s implement coordinate descent that minimizes the cost function over a single feature i. Note that the intercept (weight 0) is not regularized. The function should accept feature matrix, output, current weights, l1 penalty, and index of feature to optimize over. The function should return new weight for feature i.

Cyclical coordinate descent

Now that we have a function that optimizes the cost function over a single coordinate, let us implement cyclical coordinate descent where we optimize coordinates 0, 1, …, (k-1) in order and repeat.

When do we know to stop? Each time we scan all the coordinates (features) once, we measure the change in weight for each coordinate. If no coordinate changes by more than a specified threshold, we stop.

For each iteration:

  1. As we loop over features in order and perform coordinate descent, measure how much each coordinate changes.
  2. After the loop, if the maximum change across all coordinates is falls below the tolerance, stop. Otherwise, go back to step 1.

Return weights

IMPORTANT: when computing a new weight for coordinate i, make sure to incorporate the new weights for coordinates 0, 1, …, i-1. One good way is to update the weights variable in-place. 

The following figure shows the coefficient path for LASSO for different values of the L1-penalty using two simple features


Evaluating LASSO fit with more features

Let us consider the following set of features.

all_features = ['bedrooms',

The following figure shows the coefficient path for LASSO for different values of the L1-penalty using all the above features:



Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s