Predicting life expectancy for different countries with the GapMinder dataset using Lasso Regression with Python Scikit Learn and R

A lasso regression analysis (with L1 penalty) was conducted to identify a subset of variables from a pool of 14 quantitative predictor variables that best predicted a quantitative response variable measuring the life expectancy in different countries. Quantitative predictor variables include income per person, alcohol consumption, armed forces rate, breast cancer per 100th, co2 emissions, female employment rate, hiv rate, internet use rate, oil per person, polity score, relectric per person, suicide per 100th, employment rate, urbanization rate. All predictor variables were standardized (z-score normalized) to have a mean of zero and a standard deviation of one.

After removal of the NA values in the life expectancy variable, the predictor variables in the original dataset were imputed, the missing values in the numeric columns were replaced with median values. Then the data were randomly split into a training set that included 70% of the observations (N=133) and a test set that included 30% of the observations (N=58). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Of the 14 predictor variables, 9 were retained in the selected model. During the estimation process, internetuserate and hivrate were most strongly associated with life expectancy, followed by polityscore and employrate. The variables hivrate and employrate were negatively associated with life expectancy and polityscore and internetuserate were positively associated with life expectancy. Other predictors associated with life expectancy included alcconsumption, incomeperperson, suicideper100th, urbanrate and armedforcesrate. These 9 variables accounted for 74.12% of the variance in the life expectancy response variable.

The model learnt from the training dataset was used to predict the life expectancy for the countries in the test dataset. The mean square error on the train and test dataset are shown below, the model could explain ~66.12% variance on the held-out unseen dataset. The figures below show the results (the coefficients, the Change in the validation mean square error at each step, how the alpha is chosen with cross validation etc.) with Lasso Regression.

Here are the coefficients learnt, notice that some of the coefficients are shrunk to zero by Lasso.

Coefficients with Lasso
{‘alcconsumption’: -0.14130650117737306, ‘armedforcesrate’: 0.35471292244623748, ‘breastcancerper100th’: 0.0, ‘co2emissions’: 0.0, ’employrate’: -1.2972472147653313, ‘femaleemployrate’: 0.0, ‘hivrate’: -3.4897386781551298, ‘incomeperperson’: 0.077045133952797745, ‘internetuserate’: 4.4754404806196204, ‘oilperperson’: 0.0, ‘polityscore’: 1.3628309963989624, ‘relectricperperson’: 0.0, ‘suicideper100th’: -0.49459920657916612, ‘urbanrate’: 0.83363420775836838}

p1p2p3

training data MSE
22.8123709324

test data MSE
35.8843281685

training data R-square
0.741196286274

test data R-square
0.662066604108

Comparing Lasso with Linear Regression

As can be seen from the results below, is performing better on the held-out dataset, the model is more generalizable.

Coefficients with Linear Regression

{‘hivrate’: -3.8344571892352577, ‘co2emissions’: 0.09678711461134426, ‘oilperperson’: 0.035106524584148299, ‘urbanrate’: 0.86155708172804846, ‘internetuserate’: 5.0152935091192052, ‘armedforcesrate’: 0.6126261035223185, ‘incomeperperson’: 0.46216405180319375, ‘polityscore’: 2.0123854459607222, ‘femaleemployrate’: 1.1202101789322687, ‘suicideper100th’: -0.68776256824454451, ‘breastcancerper100th’: -1.0993691668696637, ‘alcconsumption’: -0.83113290898076309, ’employrate’: -2.6051934133371044, ‘relectricperperson’: 0.15236659683560905}

training data MSE
21.7615307335

test data MSE
38.2548762373

training data R-square
0.753117946973

test data R-square
0.639742447577

p4

Similar analysis can be done with R, the results obtained are as follows:

Lasso Coefficients (for minimum lambda)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           7.312000e+01
## incomeperperson       2.205352e-05
## alcconsumption        .           
## armedforcesrate       1.254725e-01
## breastcancerper100th  .           
## co2emissions          6.141528e-11
## femaleemployrate      .           
## hivrate              -8.231187e-01
## internetuserate       1.710453e-01
## oilperperson          .           
## polityscore           1.502006e-01
## relectricperperson    .           
## suicideper100th      -2.410269e-01
## employrate           -1.703687e-01
## urbanrate             5.964472e-02

p5p6p7

## [1] "lambda for which the cross validation error is minimum= 0.175158431057511"
## [1] "training data MSE = 25.3246340499158"
## [1] "test data MSE = 36.3629769025651"

Comparing Lasso with Linear Regression

As can be seen from the results below, is performing better on the held-out dataset, the model is more generalizable.

##          (Intercept)      incomeperperson       alcconsumption 
##         7.758254e+01         6.157751e-05         3.639136e-02 
##      armedforcesrate breastcancerper100th         co2emissions 
##         2.823504e-01        -4.576937e-02         7.636361e-11 
##     femaleemployrate              hivrate      internetuserate 
##         1.010807e-01        -9.106386e-01         1.689103e-01 
##         oilperperson          polityscore   relectricperperson 
##         1.180588e-01         2.061751e-01         1.563416e-04 
##      suicideper100th           employrate            urbanrate 
##        -3.063516e-01        -3.117576e-01         6.248395e-02
## [1] "training data MSE 24.6063107607239"
## [1] "test data MSE 43.5461958509038"

p8

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s