Some Machine Learning with Python (Contd..)

In this article, a few python scikit learn implementations of a few machine learning problems will be discussed, all of them appeared as Lab exercises in the edX course Microsoft: DAT210x Programming with Python for Data Science. The problem descriptions are taken straightaway from the course itself.

SVM vs. KNN

In this exercise, support vector machine classifier will be used to classify UCI’s wheat-seeds dataset.

  1. First, let’s benchmark how long SVM takes to train and predict with SVC relative to how long K-Neighbors takes to train and test.
  2. Then compare the decision boundary plot produced by the two using the wheat dataset.

The following table shows the first few rows of the entire dataset.

area perimeter compactness length width asymmetry groove wheat_type
0 15.26 14.84 0.8710 5.763 3.312 2.221 5.220 kama
1 14.88 14.57 0.8811 5.554 3.333 1.018 4.956 kama
2 14.29 14.09 0.9050 5.291 3.337 2.699 4.825 kama
3 13.84 13.94 0.8955 5.324 3.379 2.259 4.805 kama
4 16.14 14.99 0.9034 5.658 3.562 1.355 5.175 kama

As usual, the entire dataset is divided into 2 parts, 70% as training and 30% as test dataset. The first few rows of the training dataset is shown below.

area perimeter compactness length width asymmetry groove
61 11.23 12.63 0.8840 4.902 2.879 2.269 4.703
116 18.96 16.20 0.9077 6.051 3.897 4.334 5.750
154 11.36 13.05 0.8382 5.175 2.755 4.048 5.263
38 14.80 14.52 0.8823 5.656 3.288 3.112 5.309
194 12.11 13.27 0.8639 5.236 2.975 4.132 5.012

The next figure and the results show the performance of the k-nearest neighbor classifier in terms of the training / prediction time and the accuracy of prediction, with

  1. All the features are used for training and prediction.
  2. Only 2 features at a time is used for training and prediction.
KNeighbors Results
5000 Replications Training Time:  1.61899995804
5000 Replication Scoring Time:  2.78100013733
High-Dimensionality Score:  83.607
Max 2D Score:  90.164

knn.png

Again, the next figure and the results show the performance of the SVM classifier (with linear kernel and slcak C=1) in terms of the training / prediction time and the accuracy of prediction, with

  1. All the features are used for training and prediction.
  2. Only 2 features at a time is used for training and prediction.

As can be seen, both the accuracies (with all the features and maximum accuracy obtained with any 2-features) are higher in case of SVM. Also, as expected, the training for SVM is slower than KNN but the prediction is faster.

SVC Results
5000 Replications Training Time:  3.09300017357
5000 Replication Scoring Time:  1.27099990845
High-Dimensionality Score:  86.885
Max 2D Score:  93.443

svm.png

The following heatmaps show the accuracies of the two classifiers using different 2D features.

Accuracy (%) of KNN with 2D features

knn1.png

Accuracy (%) of SVM (with linear kernel and slack C=1) with 2D features

 svm1.png

Accuracy (%) of SVM (with different kernels and slack variable values) with all features

Out[10]:
C 0.001 0.01 0.1 1.0 10.0 100.0 1000.0
kernel
linear 57.377 86.885 85.246 86.885 91.803 95.082 93.443
poly 88.525 93.443 90.164 90.164 93.443 93.443 93.443
rbf 29.508 29.508 86.885 86.885 85.246 88.525 86.885
 svm2.png

Training time (with replications) for SVM in seconds (with different kernels and slack variable values) with all features

C 0.001 0.01 0.1 1.0 10.0 100.0 1000.0
kernel
linear 3.179 2.636 1.942 2.855 6.327 20.649 28.705
poly 3.828 9.994 31.531 65.456 131.638 132.218 130.544
rbf 6.322 5.804 4.543 2.992 3.257 3.212 3.724

svm3.png

Test time (with replications) for SVM in seconds (with different kernels and slack variable values) with all features

C 0.001 0.01 0.1 1.0 10.0 100.0 1000.0
kernel
linear 1.764 1.465 1.376 1.264 1.245 1.359 1.192
poly 1.285 1.246 1.255 1.254 1.233 1.226 1.236
rbf 2.336 2.641 2.084 1.683 1.769 1.457 1.454

svm3.png

Handwritten-Digits Classification with SVM

Even though the United States Postal Service, as an organization, was formed in 1971, it traces its roots back to the Post Office Department, an organization formed in 1792 by President Benjamin Franklin. It later evolved into a cabinet-level department in 1872, before finally being transformed into the USPS we know today in 1971, as an agency of the U.S. government.

Back in the day, all mail was hand read and delivered. Even up the turn of the 20th century, antiquated techniques such as the pigeonhole method from colonial times were used for mail-handling. During the 1950’s, the post office started intense research on the coding systems used in many other countries and started down the process of automation. In 1982, the first computer-driven, OCR machine got installed in Los Angeles, and by the end of 1984, over 250 OCRs machines were installed in 118 major mail processing centers across the country and were processing an average of 6,200 pieces of mail per hour.

im1.png

Nowadays, the Postal Service is one of the world leaders in optical character recognition technology with machines reading nearly +98 percent of all hand-addressed letter mail and +99.5 percent of machine-printed mail, with a single tray sorting machines capable of sorting more than 18 million trays of mail per day.

Let’s train a support vector classifier in a few seconds using machine learning, and compute the classification accuracy and compare with the advertised USPS stats. For this lab, we shall use of the Optical Recognition of Handwritten Digits dataset, provided courtesy of UCI’s Machine Learning Repository.

Train your SVC classifier with the parameters provided, and keep testing until you’re able to beat the classification abilities of the USPS.

Remember how important having a lot of samples is for machine learning? Try tossing out 96% of your samples, and see how it affects the accuracy of your highest accuracy support vector classifier.

Here are the few digits extracted from the training dataset.

im2.png

The SVM model best hyper-parameters learnt with grid-search cross validation:

Training SVC Classifier...
Best parameters: {'kernel': 'linear', 'C': 0.01, 'gamma': 1e-06}
Best (mean) score: 0.915032679739

The following heatmaps show the grid-search cross validation accuracy on the validation dataset with different kernels with SVM classifier:

iml.png
imp.png
imr.png
Although on the test dataset this does not have a very high score:

Scoring SVC Classifier... Score: 0.854120267261

The following figure shows a few digits picked from the test dataset and classified with the best-tuned SVM model, the ground truth true labels are drawn on top of each digit, the labels of the correctly predicted ones are colored green, the wrongly predicted ones are colored red.

im3.png

Finally the 1000 th image is predicted with the model, the below figure shows the image along with the prediction result.

1000th test label:  4
1000th test prediction:  [4]

im4.png

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s