A **lasso regression analysis** (with *L1 penalty*) was conducted to identify a subset of variables from a pool of 14 quantitative predictor variables that best predicted a quantitative response variable measuring the* life expectancy* in different countries. Quantitative predictor variables include income per person, alcohol consumption, armed forces rate, breast cancer per 100th, co2 emissions, female employment rate, hiv rate, internet use rate, oil per person, polity score, relectric per person, suicide per 100th, employment rate, urbanization rate. All predictor variables were standardized (z-score normalized) to have a mean of zero and a standard deviation of one.

After removal of the NA values in the life expectancy variable, the predictor variables in the original dataset were imputed, the missing values in the numeric columns were replaced with median values. Then the data were randomly split into a training set that included 70% of the observations (N=133) and a test set that included 30% of the observations (N=58). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Of the 14 predictor variables, 9 were retained in the selected model. During the estimation process, internetuserate and hivrate were most strongly associated with life expectancy, followed by polityscore and employrate. The variables hivrate and employrate were negatively associated with life expectancy and polityscore and internetuserate were positively associated with life expectancy. Other predictors associated with life expectancy included alcconsumption, incomeperperson, suicideper100th, urbanrate and armedforcesrate. These 9 variables accounted for 74.12% of the variance in the life expectancy response variable.

The model learnt from the training dataset was used to predict the life expectancy for the countries in the test dataset. The mean square error on the train and test dataset are shown below, the model could explain ~66.12% variance on the held-out unseen dataset. The figures below show the results (the coefficients, the Change in the validation mean square error at each step, how the alpha is chosen with cross validation etc.) with Lasso Regression.

Here are the coefficients learnt, notice that some of the coefficients are *shrunk *to zero by Lasso.

**Coefficients with Lasso**

{‘alcconsumption’: -0.14130650117737306, ‘armedforcesrate’: 0.35471292244623748, ‘breastcancerper100th’: 0.0, ‘co2emissions’: 0.0, ’employrate’: -1.2972472147653313, ‘femaleemployrate’: 0.0, ‘hivrate’: -3.4897386781551298, ‘incomeperperson’: 0.077045133952797745, ‘internetuserate’: 4.4754404806196204, ‘oilperperson’: 0.0, ‘polityscore’: 1.3628309963989624, ‘relectricperperson’: 0.0, ‘suicideper100th’: -0.49459920657916612, ‘urbanrate’: 0.83363420775836838}

**training data MSE
**22.8123709324

**35.8843281685**

test data MSE

test data MSE

**0.741196286274**

training data R-square

training data R-square

**0.662066604108**

test data R-square

test data R-square

**Comparing Lasso with Linear Regression**

As can be seen from the results below, is performing better on the held-out dataset, the model is more generalizable.

**Coefficients with Linear Regression**

{‘hivrate’: -3.8344571892352577, ‘co2emissions’: 0.09678711461134426, ‘oilperperson’: 0.035106524584148299, ‘urbanrate’: 0.86155708172804846, ‘internetuserate’: 5.0152935091192052, ‘armedforcesrate’: 0.6126261035223185, ‘incomeperperson’: 0.46216405180319375, ‘polityscore’: 2.0123854459607222, ‘femaleemployrate’: 1.1202101789322687, ‘suicideper100th’: -0.68776256824454451, ‘breastcancerper100th’: -1.0993691668696637, ‘alcconsumption’: -0.83113290898076309, ’employrate’: -2.6051934133371044, ‘relectricperperson’: 0.15236659683560905}

**training data MSE
**21.7615307335

**test data MSE
**38.2548762373

**training data R-square
**0.753117946973

**test data R-square
**0.639742447577

Similar analysis can be done with **R**, the results obtained are as follows:

### Lasso Coefficients (for minimum lambda)

```
## 15 x 1 sparse Matrix of class "dgCMatrix"
## 1
## (Intercept) 7.312000e+01
## incomeperperson 2.205352e-05
## alcconsumption .
## armedforcesrate 1.254725e-01
## breastcancerper100th .
## co2emissions 6.141528e-11
## femaleemployrate .
## hivrate -8.231187e-01
## internetuserate 1.710453e-01
## oilperperson .
## polityscore 1.502006e-01
## relectricperperson .
## suicideper100th -2.410269e-01
## employrate -1.703687e-01
## urbanrate 5.964472e-02
```

`## [1] "lambda for which the cross validation error is minimum= 0.175158431057511"`

`## [1] "training data MSE = 25.3246340499158"`

`## [1] "test data MSE = 36.3629769025651"`

## Comparing Lasso with Linear Regression

As can be seen from the results below, is performing better on the held-out dataset, the model is more generalizable.

```
## (Intercept) incomeperperson alcconsumption
## 7.758254e+01 6.157751e-05 3.639136e-02
## armedforcesrate breastcancerper100th co2emissions
## 2.823504e-01 -4.576937e-02 7.636361e-11
## femaleemployrate hivrate internetuserate
## 1.010807e-01 -9.106386e-01 1.689103e-01
## oilperperson polityscore relectricperperson
## 1.180588e-01 2.061751e-01 1.563416e-04
## suicideper100th employrate urbanrate
## -3.063516e-01 -3.117576e-01 6.248395e-02
```

`## [1] "training data MSE 24.6063107607239"`

`## [1] "test data MSE 43.5461958509038"`