This problem appeared as an assignment in the coursera course Machine Learning – Regression, part of Machine Learning specialization by the University of Washington. The following description of the problem is taken directly from the assignment.
The goal of this assignment is to implement an LASSO Solver using coordinate descent. The following are the sub-tasks:
- First the features are needed to be normalized.
- The coordinate descent for LASSO needs to be implemented (with the subgradient of the L1 penalty).
- The effects of L1 penalty are going to be explored.
The following dataset (few rows and columns are shown in the below table) is from house sales in King County, the region where the city of Seattle, WA is located. This is going to be our input datatset for training the LASSO model, with the response variable as price.
In the house dataset, features vary wildly in their relative magnitude:
sqft_living is very large overall compared to
bedrooms, for instance. As a result, weight for
sqft_living would be much smaller than weight for
bedrooms. This is problematic because “small” weights are dropped first as
l1_penalty goes up.
To give equal considerations for all features, we need to normalize features by dividing each feature by its 2-norm so that the transformed feature has norm 1.
Implementing Coordinate Descent with normalized features
Effect of L1 penalty
Let us consider a simple model with 2 features:
simple_features = ['sqft_living', 'bedrooms']
Single Coordinate Descent Step
Using the formula above, let’s implement coordinate descent that minimizes the cost function over a single feature i. Note that the intercept (weight 0) is not regularized. The function should accept feature matrix, output, current weights, l1 penalty, and index of feature to optimize over. The function should return new weight for feature i.
Cyclical coordinate descent
Now that we have a function that optimizes the cost function over a single coordinate, let us implement cyclical coordinate descent where we optimize coordinates 0, 1, …, (k-1) in order and repeat.
When do we know to stop? Each time we scan all the coordinates (features) once, we measure the change in weight for each coordinate. If no coordinate changes by more than a specified threshold, we stop.
For each iteration:
- As we loop over features in order and perform coordinate descent, measure how much each coordinate changes.
- After the loop, if the maximum change across all coordinates is falls below the tolerance, stop. Otherwise, go back to step 1.
IMPORTANT: when computing a new weight for coordinate i, make sure to incorporate the new weights for coordinates 0, 1, …, i-1. One good way is to update the weights variable in-place.
The following figure shows the coefficient path for LASSO for different values of the L1-penalty using two simple features
Evaluating LASSO fit with more features
Let us consider the following set of features.
all_features = ['bedrooms', 'bathrooms', 'sqft_living', 'sqft_lot', 'floors', 'waterfront', 'view', 'condition', 'grade', 'sqft_above', 'sqft_basement', 'yr_built', 'yr_renovated']
The following figure shows the coefficient path for LASSO for different values of the L1-penalty using all the above features: