# Using PCA to represent digits in the eigen-digits space

In this article, the handwritten digits dataset (mnist_train) is going to be used to visualize and demonstrate how Principal Component Analysis can be used to represent the digits in the low dimensional feature space as a linear combination of the principal components as orthonormal basis vectors.

• Here only the digit 8 is considered for the analysis.
• The data contains 477 handwritten images of the digit 8 and each digit is a 28x28 image which is stored as a row of length 784.
• Below are the original images of 8 in the dataset.

• Next PCA is done on the dataset and the variance explained by the first few principal components are shown below.

Percentage variance explained by the first few principal components

##  [1] 57.87 63.76 66.76 69.38 71.32 72.95 74.32 75.55 76.69 77.79 78.87
## [12] 79.87 80.72 81.50 82.25 82.95 83.60 84.18 84.74 85.26 85.77 86.26
## [23] 86.73 87.16 87.55 87.92 88.29 88.64 88.97 89.31 89.62 89.92 90.20
## [34] 90.47 90.72 90.97 91.20 91.43 91.64 91.84 92.03 92.22 92.41 92.59
## [45] 92.76 92.93 93.08 93.24 93.39 93.54 93.69 93.83 93.97 94.10 94.22
## [56] 94.35 94.47 94.58 94.69 94.81 94.91 95.02 95.12 95.22 95.31 95.41
## [67] 95.50 95.59 95.67 95.76 95.84 95.92 96.00 96.08 96.15
• The Principal components (the eigenvectors) are then visualized. Since they (particularly the dominant ones) look like 8 we can name this orthonormal basis vectors in the rotated space as eigen digits. They typically capture different features of the handwritten digit 88, depending on different writing styles.
• The first 210 eigen-digits are visualized (stacked in 15 rows and 14 columns), in the column-major order, with the first eigenvector is the top left one, the second one being the next one below the first one in the same column and so on. As can be seen, the first few dominant eigenvectors clearly represents the digit 8 and hence the name.

• Next let’s just focus on one single data point (one particular handwritten digit image) and try to see how it can be represented in the eigen-digits space, as linear combination of the eigen-digits. The next figure shows the one we shall now try to represent in the feature space.

## [1] "Weights for the first 20 basis vectors for the digit"
##  [1] -1905.408055  1355.563894   564.485519   -78.098594   747.401692
##  [6]   116.314375  -293.701813   249.330045   158.920772  -258.179455
## [11]  -659.294321  -322.247876   215.188713  -341.871715     4.276662
## [16]   279.882382   301.395575   123.411382    67.083178  -285.374166

• The below figures show how the digit image looks like when it’s represented as linear comibination of only the first k eigen vectors (eigen digits) only, where
k=1,2,…,75 in the same column major order.
• As expected, the approximate representation of the digit looks like the actual representation shown above when it’s expressed as linear combination of first few eigenvectors with the weights computed.
• The approximate representation of the digit resembles the original digit as more and more basis vectors are included in its representation.

• The representation of the original digit in the eigen-space using first k eigenvectors (k=1,2,..,784) is shown in the following animation.
• The next figure shows how the error (computed as Frobenius norm of the difference of the original and the approximated digit image, expressed as the linear combination of the first few eigenvectors in the PCA space). As expected again, the figure shows that the error decreases very fast.

# GapMinder Dataset: Exploratory Analysis

Three numeric variables were selected: lifeexpectancy, employrate,internetuserate. First a subset of the dataset was created, since we were interested primarily in the countries with lifeexpectancy less than equal to 70. All the exploratory analysis were done on this subset.

Since the variables selected are continuous, each variable was binned (grouped) into a few bins (e.g., employrate and lifexpectancy variables were binned into 4 equal-depth groups whereas the internetuserate variable was binned into 3 equal-width groups). For each of the variables, the frequency tables were computed and the missing values were coded out. We tried to understand how the lifexpectancy varied across the countries in different groups of employrate and internetuserate.

The frequency distributions of the managed (grouped) variables are shown below. As can be seen, number of countries with (low) employrate less than 56 is 19, whereas number of countries with (high) employrate greater than 70 is around 18. Also, it shows that there were 3 countries with NAN values for employ rate in the subset.

The below heatmap of the crostab shows that there are relatively high number of countries with low intermetuserate but high employrate!

The scatter plot of the variables employrate vs. internetuserate above shows that for the countries with lifeexpectancy > 60 we have higher internetuserate in general. The box plots below show that the lifeexpectancy is highest on average for the countries with second lowest employrate but for those with the highest interntuserate.

# Finding the most important predictors for life expectancy and making predictions with the GapMinder dataset with Random Forest and ExtraTree Forest Ensemble Classifiers using Python Scikit Learn and Pandas

Ensemble learning using Random Forest and ExtraTree Forest were performed to evaluate the importance of a series of explanatory variables in predicting a binary, categorical response variable with the GamMinder dataset.

The following explanatory variables were included as possible contributors to a classification tree model evaluating life expectancy (my response variable, which is a continuous numeric variable but was binned into 2 categories: (0-60] and (60-100]), income per person, alcohol consumption, armed forces rate, breast cancer per 100th, co2 emissions, female employment rate, hiv rate, internet use rate, oil per person, polity score, relectric per person, suicide per 100th, employment rate, urbanization rate.

The following figure shows relations in between some of the predictors used and some exploratory visualizations:

After removal of the NA values in the life expectancy variable, the predictor variables in the original dataset were imputed, the missing values in the numeric columns were replaced with median values. Then the dataset was divided (by taking a random sample of size 60% of the entire dataset) into training dataset with 114 data tuples and test dataset with 77 data tuples. Then the Random Forest and Extra Tree classifiers were trained on the trainign dataset and the models were used to predict on the test dataset. Ans ensemble of 25 decision trees were used to build the random forest predictor and gini index measure was used for the best feature selection at each round for the decision trees.

As can be seen from the 3 most important predictors selected by the ExtraTree Forest model were: hiv rate, income per person and internet user rate.

The model learnt from the training dataset was used to predict the life expectancy for the countries in the test dataset. The confusion matrix (contingency table) on the test dataset is shown below, which shows that we obtained ~90.9% accuracy on the held-out unseen dataset.

# Using One-way Analysis of Variance with R and Python to find the Association between quantitative response variable Life expectancy and the converted categorical explanatory variable Income per person / Alcohol consumption in the GapMinder Dataset

Model Interpretation for ANOVA:

When examining the association between the life expectancy in number of years (quantitative response) and the variable income per person (which is the GDP per capita in constant 2000 US\$) categorized into 2 ordered categories (if income per person is in between (0, 2385], it’s low, otherwise it’shigh, where 2385 is approximately the median value of the variable, splitting around which we got categorical explanatory variable) for different countries from the Gapminder dataset, a (one-way) Analysis of Variance (ANOVA) revealed that among the countries with high (2385-52302] income per person, reported to have significantly more life expectancy (Mean=75.74 s.d. ±6.08) compared to the countries with low (0-2385] income per person (Mean=63.57, s.d. ±8.86), F(1, 174)=113.0, p = 1.8 x 10^(-20) .

Note that the degrees of freedom that I report in parentheses) following ‘F’ can be found in the OLS table as the DF model and DF residuals. In this example 113.0 is the actual F value from the OLS table and we commonly report a very very small p value as simply = 1.8 x 10^(-20).

The results from python are shown below.

The following are the same results with R.

Model Interpretation for post hoc ANOVA results:

When examining the association between the life expectancy in number of years (quantitative response) and another explanatory variable alcohol consumption (avg in litres) categorized into 4 ordered categories (splitting around the quartiles we got categorical explanatory variable with 4 levels (0,3], (3-6], (6-10], (10-25]) for different countries from the same dataset, (one-way) ANOVA revealed that among daily, the life expectancy (quantitative response variable) and alcohol consumption were significantly associated, F (3, 172)=8.927, p=1.57×10^-5.

Post hoc comparisons of the alcohol consumption by pairs of categories revealed that the countries with alcohol consumption level (10,25] (group 1) reported significantly more life expectancy compared to those with level (0,3] (group 0). Similarly, the countries with alcohol consumption level (10,25] reported significantly more life expectancy compared to those with level (3,6]. And the countries with alcohol consumption level (6,10] reported significantly more life expectancy compared to those with level (3,6].   All other comparisons were statistically similar.

The results from python are shown below.

The following are the same results with R.

# Predicting life expectancy for different countries with the GapMinder dataset using Lasso Regression with Python Scikit Learn and R

A lasso regression analysis (with L1 penalty) was conducted to identify a subset of variables from a pool of 14 quantitative predictor variables that best predicted a quantitative response variable measuring the life expectancy in different countries. Quantitative predictor variables include income per person, alcohol consumption, armed forces rate, breast cancer per 100th, co2 emissions, female employment rate, hiv rate, internet use rate, oil per person, polity score, relectric per person, suicide per 100th, employment rate, urbanization rate. All predictor variables were standardized (z-score normalized) to have a mean of zero and a standard deviation of one.

After removal of the NA values in the life expectancy variable, the predictor variables in the original dataset were imputed, the missing values in the numeric columns were replaced with median values. Then the data were randomly split into a training set that included 70% of the observations (N=133) and a test set that included 30% of the observations (N=58). The least angle regression algorithm with k=10 fold cross validation was used to estimate the lasso regression model in the training set, and the model was validated using the test set. The change in the cross validation average (mean) squared error at each step was used to identify the best subset of predictor variables.

Of the 14 predictor variables, 9 were retained in the selected model. During the estimation process, internetuserate and hivrate were most strongly associated with life expectancy, followed by polityscore and employrate. The variables hivrate and employrate were negatively associated with life expectancy and polityscore and internetuserate were positively associated with life expectancy. Other predictors associated with life expectancy included alcconsumption, incomeperperson, suicideper100th, urbanrate and armedforcesrate. These 9 variables accounted for 74.12% of the variance in the life expectancy response variable.

The model learnt from the training dataset was used to predict the life expectancy for the countries in the test dataset. The mean square error on the train and test dataset are shown below, the model could explain ~66.12% variance on the held-out unseen dataset. The figures below show the results (the coefficients, the Change in the validation mean square error at each step, how the alpha is chosen with cross validation etc.) with Lasso Regression.

Here are the coefficients learnt, notice that some of the coefficients are shrunk to zero by Lasso.

Coefficients with Lasso
{‘alcconsumption’: -0.14130650117737306, ‘armedforcesrate’: 0.35471292244623748, ‘breastcancerper100th’: 0.0, ‘co2emissions’: 0.0, ’employrate’: -1.2972472147653313, ‘femaleemployrate’: 0.0, ‘hivrate’: -3.4897386781551298, ‘incomeperperson’: 0.077045133952797745, ‘internetuserate’: 4.4754404806196204, ‘oilperperson’: 0.0, ‘polityscore’: 1.3628309963989624, ‘relectricperperson’: 0.0, ‘suicideper100th’: -0.49459920657916612, ‘urbanrate’: 0.83363420775836838}

training data MSE
22.8123709324

test data MSE
35.8843281685

training data R-square
0.741196286274

test data R-square
0.662066604108

Comparing Lasso with Linear Regression

As can be seen from the results below, is performing better on the held-out dataset, the model is more generalizable.

Coefficients with Linear Regression

{‘hivrate’: -3.8344571892352577, ‘co2emissions’: 0.09678711461134426, ‘oilperperson’: 0.035106524584148299, ‘urbanrate’: 0.86155708172804846, ‘internetuserate’: 5.0152935091192052, ‘armedforcesrate’: 0.6126261035223185, ‘incomeperperson’: 0.46216405180319375, ‘polityscore’: 2.0123854459607222, ‘femaleemployrate’: 1.1202101789322687, ‘suicideper100th’: -0.68776256824454451, ‘breastcancerper100th’: -1.0993691668696637, ‘alcconsumption’: -0.83113290898076309, ’employrate’: -2.6051934133371044, ‘relectricperperson’: 0.15236659683560905}

training data MSE
21.7615307335

test data MSE
38.2548762373

training data R-square
0.753117946973

test data R-square
0.639742447577

Similar analysis can be done with R, the results obtained are as follows:

### Lasso Coefficients (for minimum lambda)

## 15 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)           7.312000e+01
## incomeperperson       2.205352e-05
## alcconsumption        .
## armedforcesrate       1.254725e-01
## breastcancerper100th  .
## co2emissions          6.141528e-11
## femaleemployrate      .
## hivrate              -8.231187e-01
## internetuserate       1.710453e-01
## oilperperson          .
## polityscore           1.502006e-01
## relectricperperson    .
## suicideper100th      -2.410269e-01
## employrate           -1.703687e-01
## urbanrate             5.964472e-02

## [1] "lambda for which the cross validation error is minimum= 0.175158431057511"
## [1] "training data MSE = 25.3246340499158"
## [1] "test data MSE = 36.3629769025651"

## Comparing Lasso with Linear Regression

As can be seen from the results below, is performing better on the held-out dataset, the model is more generalizable.

##          (Intercept)      incomeperperson       alcconsumption
##         7.758254e+01         6.157751e-05         3.639136e-02
##      armedforcesrate breastcancerper100th         co2emissions
##         2.823504e-01        -4.576937e-02         7.636361e-11
##     femaleemployrate              hivrate      internetuserate
##         1.010807e-01        -9.106386e-01         1.689103e-01
##         oilperperson          polityscore   relectricperperson
##         1.180588e-01         2.061751e-01         1.563416e-04
##      suicideper100th           employrate            urbanrate
##        -3.063516e-01        -3.117576e-01         6.248395e-02
## [1] "training data MSE 24.6063107607239"
## [1] "test data MSE 43.5461958509038"

# Using PCA to Detect Outliers in Images

In this article, the Principal Component Analysis will be used to find the outliers in images. PCA can be interpreted in the following ways:

1. The principal components found in PCA captures the directions with highest variance in data (maximize the variance of projection along each component).
2. The principal components minimize the reconstruction error (i.e., the squared distance between the original data and its estimate, by projecting the data on the first few principal compnents).

Since most of the time, the first few principal components explain almost all of the variance in the data, the above interpretations lead to the intuition that the data points that are not explained well by the first few principal components are probably the ones that are noisy.

• Colored images can be viewed as a collection of three dimensional data points (e.g., in RGB space), where each data point (pixel) has 3 dimensions.
• PCA will be used on an image dataset II and depending on how much variance the first two principal components explain (usingscreeplot) only the first or the first two PC vector(s) are chosen to represent the image data in lower dimension.
• The projection is computed in the usual manner: I^=I.V1:k.VT1:kI^=I.V1:k.V1:kT, where V1:kV1:k are the first kk orthonormal principal component vectors, where k{1,2}k∈{1,2}.
• The reconstruction error ||II^||2||I−I^||2will be used to score each of the data points.
• The ones with the higher scores (e.g., > 90 percentile) will be marked as the outliers.
• Pixmap library is used to convert extract the color channels from an image.
• As can be seen from the following results, for each of the images the data points (pixels) with score >90>90 quantile value of the scores are marked as outliers.
• The outliers are marked with black patches.
• The 8585, 9090 and 9595 quantiles are shown with red dashed lines.
• In the first image the orange is detected as an outlier among the apples, in the second image the magenta patterns are detected as outliers in the space and in the third image some features of my face (eyebrows, hair, nostrill) are detected as outliers.

## [1] "------------------------------------------"
## [1] "% Variance explained by upto k (1,2,3) PCs"
## [1] "------------------------------------------"
## [1]  94.79  98.85 100.00

## [1] "----------------------------------------------"
## [1] "85%, 90% and 95% quantile values of the scores"
## [1] "----------------------------------------------"
##       85%       90%       95%
## 0.1311154 0.1864878 0.2558950

## [1] "------------------------------------------"
## [1] "% Variance explained by upto k (1,2,3) PCs"
## [1] "------------------------------------------"
## [1]  99.47  99.99 100.00

## [1] "----------------------------------------------"
## [1] "85%, 90% and 95% quantile values of the scores"
## [1] "----------------------------------------------"
##         85%         90%         95%
## 0.006053905 0.010789753 0.017195377

## [1] "------------------------------------------"
## [1] "% Variance explained by upto k (1,2,3) PCs"
## [1] "------------------------------------------"
## [1]  99.48  99.90 100.00
## [1] "----------------------------------------------"
## [1] "85%, 90% and 95% quantile values of the scores"
## [1] "----------------------------------------------"
##        85%        90%        95%
## 0.03542360 0.04108505 0.05129503

• The following animation shows the pixels detected as outliers with different threshold outlier scores.
• Yet another animation for outlier detection with PCA based score.

# Crime Analytics: Visualization of Crime Incident Reports for Summar 2014 in San Francisco and Seattle

1. In this assignment, some exploratory analysis is done on the criminal incident data from Seattle and San Francisco to visualize patterns and contrast and compare patterns across the two cities.
2. Data used: The real crime dataset from Summer (June-Aug) 2014 for both of two US cities Seattle and San Francisco has been used for the analysis. The datasets used for analysis are reduced forms of publicly-available datasets.
3. Each figure (supporting the analysis) has one / few of the descriptions / takeaways / insights on top of it (as header), although the purpose of each analysis is self-explanatory.
4. Please Zoom (Ctrl+ in a browser) if required, to view the figures (e.g., read the axis labels) properly.
5. All of the analysis / visualization was done using R and with the library graphic grammar plot and should be easily reproducible.