Comparing Spectral partitioning / clustering (with Normalized Graph Laplacian) with KMeans Clustering in R

In this article, the clustering output results of Spectral clustering (with normalized Laplacian) is going to be compared with KMeans on a few shape datasets. The following figures taken from the slides from the Coursera Course: Mining Massive Datasets by Stanford University describe the basic concept of spectral clustering and the spectral partitioning algorithm.

1. The spectral clustering algorithm has many variants. In this article, The following algorithm (by Ng. et al) is going to be used for spectral clustering (with normalized Laplacian).
2. The following figures / animations show the spectral clustering results with different Gaussian similarity kernel bandwidth parameter values on different shape datasets and also the comparison with the results obtained using the KMeans counterpart.

3. Finally, both KMeans and Spectral clustering (with bandwidth=0.1) algorithms are applied on an apples and oranges image to segment the image into 2 clusters. As can be seen from the next figure, the spectral clustering (by clustering the first two dominant eigenvectors of the Normalized Laplacian) could separate out the orange from the apples, while the simple KMeans on the RGB channels could not.

4. The following simpler spectral partitioning approach (thresholding on the second dominant eigenvector) can also be applied for automatic separation of the foreground from the background.

The following figure shows the result of spectral partitioning for automatic
separation of foreground from the background on another image.

Image clustering with GMM-EM soft clustering in R

In this article, the GMM-EM soft clustering technique will be used to cluster a bunch of images. The average RGB value will be used to represent an image (the input images consist of a few sky-cloud images, sunset images, forest images or sea images, each category has a different range of average RGB values, so this technique is supposed to work).

This problem has appeared as a python assignment problem in the Coursera Course ML Clustering and Retrieval (by UW).

1. The following images shown are used for the GMM EM clustering.
2. The EM algorithm shown in the following figure is used for soft clustering.
3. The next animation / figure shows how the GMM-EM gradually converged with iteration along with the increase in log-likelihood per iteration. The x and y axes represent the R channel and the G channel corresponding to a pixel’s RGB value and each point represents an image’s average RGB value.
4. The next figures show the 4 image clusters obtained. Each cluster is marked by the convex hull containing the images belonging to the cluster.
5. As expected, the sky-cloud images are all assigned to one cluster with high probability, whereas the green forests, sea images and sunset images form the remaining 3 clusters respectively.

Using Bayesian Kalman Filter to predict positions of moving particles / objects in 2D (in R)

In this article, we shall see how the Bayesian Kalman Filter can be used to predict positions of some moving particles / objects in 2D.

This article is inspired by a programming assignment from the coursera course Robotics Learning by University of Pennsylvania, where the goal was to implement a Kalman filter for ball tracking in 2D space.  Some part of the problem description is taken from the assignment.

• The following equations / algorithms are going to be used to compute the Bayesian state updates for the Kalman Filter.

• For the first set of experiments, a few 2D Brownian Motion like movements are simulated for a particle.
• The noisy position measurements of the particle are captured at different time instants (by adding random Gaussian noise).
• Next, Kalman Filter is used to predict the particle’s position at different time instants, assuming different position, velocity and measurement uncertainty parameter values.
• Both the actual trajectory and KF-predicted trajectory of the particle are shown in the following figures / animations.
• The positional uncertainty (as 2D-Gaussian distribution) assumed by the Kalman Filter is also shown as gray / black contour (for different values of uncertainties).

• The next set of figures / animations show how the position of a moving bug is tracked using Kalman Filter.
• First the noisy measurements of the positions of the bug are obtained at different time instants.
• Next the correction steps and then the prediction steps are applied, after assuming some uncertainties in position, velocity and the measurements with some Gaussian distributions.

• The position of the bug as shown in the figure above is moving in the x and y direction randomly in a grid defined by the rectangle [-100,100]x[-100,100].
• The next figure shows how different iterations for the Kalman Filter predicts and corrects the noisy position observations. The uncertain motion model p(x_t|x_{t-1}) increases the spread of the contour.  We observe a noisy position estimate z_t. The contour of the corrected position p(x_t) has less spread than both the observation p(z_t|x_t) and the motion p(x_t|x_{t-1})  adjusted state.

• Next the GPS dataset from the UCI Machine Learning Repository is used to get the geospatial positions of some vehicles at different times.
• Again some noisy measurement is simulated by adding random noise to the original data.
• Then the Kalman Filter is again used to predict the vehicle’s position at different time instants, assuming different position, velocity and measurement uncertainties.
• The position and measurement uncertainties (σ_p,  σ_m) are in terms of latitude / longitude values, where uncertainty in the motion model is σ_v.
• Both the actual trajectory and KF-predicted trajectory of the vehicle are shown in the following figures / animations.
• As expected, the more the uncertainties in the position / motion model, the more the actual trajectory differs from the KF-predicted one.

Using Linear PCA, Kernel PCA (with Gaussian Kernel) and Isomap for dimensionality reduction in Python

In this article, the linear PCA, the kernel PCA and the Isomap algorithms will be applied on a few datasets, to show whether the structure of the data in higher dimensions are preserved in the lower dimension or not for the methods. The scikit learn implementations will be used.

For Isomap, the original dataset from Joshua Tenenbaum, the primary creator of the isometric feature mapping algorithm, will be used (as given in one of the assignments of the Edx Course Microsoft: DAT210x Programming with Python for Data Science, by replicating his canonical, dimensionality reduction research experiment for visual perception). It consists of 698 samples of 4096-dimensional vectors. These vectors are the coded brightness values of 64×64-pixel heads that have been rendered facing various directions and lighted from many angles.

For this assignment, Dr. Tenenbaum’s experiment will be replicated by:

1. Applying both PCA and Isomap to the 698 raw images to derive 2D principal components and a 2D embedding of the data’s intrinsic geometric structure.
2. Projecting both onto a 2D scatter plot, with a few superimposed face images on the associated samples.
3. Increasing n_components to three and plotting.

The following figure shows first 100 images from the dataset.

Applying Principal Component Analysis (PCA)

Let’s first apply PCA on the dataset and choose the first 3 principal components. If we project the data onto these components, to obtain the following plots. Notice that with only 3 principal components chosen, the projected images approximate the original images quite well.

Applying Kernel Principal Component Analysis (KPCA)

Now let’s apply KPCA on the dataset (with **Gaussian Kernel) and choose only the first 3 principal components. If we project the data onto these components, to obtain the following plots. Notice that with only 3 principal components chosen, the projected images approximate the original images quite well.

Applying Isomap

Finally let’s apply Isomap on the same dataset with 8 nearest neighbors and choose only the first 3 manifolds. If we project the data onto these components, to obtain the following plots. Notice that with only 3 manifolds chosen, the projected images approximate the original images quite well.

As can be seen from the above results, between linear PCA, Kernel PCA and the non-linear Isomap, the Isomap algorithm is better able to capture the true nature of the faces dataset when reduced to two component.

Each coordinate axis of the 3D manifold correlates highly with one degree of freedom from the original, underlying data (e.g., Left, Right, Down, Up Head Positions).

Applying Linear PCA vs. Kernel PCA (with Gaussian Kernel) for dimensionality reduction on a few datasets in R

In this article, both the linear PCA and the kernel PCA will be applied on a few shape datasets, to show whether the structure of the data (in terms of different clusters) in higher dimensions are preserved in the lower dimension or not for both the methods.

1. For the linear PCA, as usual, the dataset is first z-score normalized and then the eigen-analysis of the covariance matrix is done. Then to reduce the dimension, the dataset is projected onto the first few principal components (dominant eigenvectors of the covariance matrix).
2. For the kernel PCA, Gaussian Kernel is used to compute the distances between the datapoints and the Kernel matrix is computed (with the kernel trick), then normalized. Next the eigen-analysis of the Kernel matrix is done. Then to reduce the dimension, the first few dominant eigenvectors of the kernel matrix are chosen, which implicitly represent the data already projected on the principal components of the infinite dimensional space. The next figure shows the algorithms for the two methods.

Training Backpropagation Neural Nets on the handwritten digits dataset

1. There are 5000 examples of handwritten digits in the dataset. Some of the examples are shown below. We have 10 class labels here: the digits 0:9. Given an image we want to classify it as one of the 10 classes.

2. The next set of equations show how the feed-forward and back-propagation learning happens in a neural net with a single hidden layer.

3. The next figure shows how the cost function decreases with #iterations of the numerical optimization algorithm used (conjugate gradient) with 25 hidden units in the hidden layer.

4. The next figures show the hidden features learnt for different number of hidden units in the neural net.

5. The next figure shows the accuracy on the training dataset obtained with different number of hidden units in the neural net. Ideally we should expect a strict increase in accuracy on the training set (leading to overfitting),  but there is a slight zig-zag pattern that can be attributed to the random initialization of the weights and the regularization.

Decision boundaries learnt by training with some R library classifiers

1. A 2-D non-separable dataset (generated uniformly and randomly) is used to train and generate the decision boundaries with a few of the following R library classifiers.
2. The following are the decision boundaries learnt by different ML models from a few R packages, when trained on the above dataset.
3. The following are the decision boundaries learnt a few linear classifiers such as (logistic regression, with / with L1/L2 penalties, linear discriminant analysis) and also with Naive Bayes.
4. The following are the decision boundaries learnt by k-nearest neighbor algorithm with different values of k.
5. The following are the decision boundaries learnt by a backpropagation neural net with single hidden layer with different numbers of hidden units.
6. The following are the decision boundaries learnt by the CART decision tree with different values of the pruning parameter.
7. The following are the decision boundaries learnt by the random forest ensemble classifier with different numbers of trees.
8. The following are the decision boundaries learnt by the support vector machine classifier with different values of the C parameter.
9. As can be seen, the models overfit / underfit the training data with different values of the parameters used for learning.

Using Multiclass Softmax Multinomial Regularized Logit Classifier and One vs. all Binary Regularized Logistic Regression Classifier with Gradient Descent

In this article, the gradient-descent-based implementations of two different techniques softmax multinomial logit classifier vs. one-vs-all binary logistic regression classifier (both of them with L2 regularization) are going to be compared for multi-class classification on the handwritten digits dataset.

1. There are 5000 examples of handwritten digits in the dataset. Some of the examples are shown below. We have 10 class labels here: the digits 0:9. Given an image of a digit, we want to classify it as one of the 10 classes.

2. The next set of equations show the cross-entropy sigmoid cost function and the gradient for the binary sigmoid logistic regression classifier. In one-vs-all technique we need to learn 10 such logistic regression classifiers and then combine the parameter vectors learnt (using batch gradient descent) to a 10-column matrix and then classify an image.

3. The next figure shows the cross-entropy softmax cost function and the gradient for the multiclass softmax multinomial logit classifier. Here we learn the 10-column parameter matrix directly at once (using batch gradient descent) and then classfiy an image.

4. The entire dataset is divided into two parts: 4000 samples were taken for for training and 1000 samples were taken for test. The following figure shows the performance of the classifiers in terms of accuracy and also time taken to train and predict. Both the classifiers were trained with same set of parameters for gradient descent (for one-vs-all logistic regression each of the 10 classifiers were trained using gradient descent upto 1500 iterations, where for the softmax multinomial logit just one classifier was trained with gradient descent upto 1500 iterations) and regularization. As can be seen in the results, the performance of the classifiers on the test dataset is almost same (with accuracy 0.9), but the softmax classifier took 1/7 th fraction of time the one-vs-all logistic regression classifier took.

5. The next figure shows how the gradient descent reduces the cost function (error) when run for 5000 iterations, both for the 10 one-vs-all binary logistic regression classifiers and softmax multinomial logit classifier.
6. The next figure shows how the training and test dataset error changes w.r.t. number of iterations in gradient descent with the softmax multinomial logit classifier.

Dual Percptron and Kernels: Learning non-linear decision boundaries

In this article, the dual perceptron implementation and some non-linear kernels are used to separate out some linearly inseparable datasets.

1. The following figure shows the dual perceptron algorithm used. The implementation is comparative fast since we don’t need to generate the non-linear features by ourselves,  the kernel trick does that for us.
2. Few 2-D training datasets are generated uniformly randomly and the corresponding binary class labels are also generated randomly. Then the dual implementation of the perceptron with a few non-linear kernels, such as Gaussian (with different bandwidth), Polynomial (with different degrees) were used to separate out the positive from the negative class data points. The following figures show the decision boundaries learnt by the perceptron upon convergence.
3. The next animation shows the iteration steps how the dual perceptron implementation (with Gaussian Kernel with bandwidth 0.1) converges on a given training dataset.
4. The next figure shows how the dual perceptron with polynomial kernel overfits with the increase in the degree of the polynomial. Also, with the Gaussian kernel it learns a different decision boundary altogether.

Convergence of the gradient descent algorithms (stochastic and batch) for the linear / logistic regression and perceptron

In this article, the convergence of the optimization algorithms for the linear regression and the logistic regression is going to be shown using online (stochastic) and batch gradient descent on a few datasets. Also, the online and batch version of the perceptron learning algorithm convergence will be shown on a synthetically generated dataset.

1. The following simple dataset (obtained from Andrew Ng’s Coursera ML course) is used for running the gradient descent algorithms.

## [1] "Theta found by running batch gradient descent (with 600 iterations): "
##                [,1]
## rep(1, m) -3.112591
## data[, 1]  1.114354
## [1] "Theta found by Stochastic gradient descent (with 10 repeats): "
##           [,1]
## [1,] -1.947365
## [2,]  1.251294
## [1] "Theta found with normal equation: "
##                [,1]
## rep(1, m) -3.895781
## data[, 1]  1.193034
2. The next figure shows how the stochastic (noisy and slower convergence) and batch gradient descent (steady and much faster convergence) algorithms converge to the theta parameter value that minimizes the cost function (computed using normal equation, shown as a red dot in the figures).
3. The following figure shows the stochastic and batch versions of the algorithms.
4. Next the gradient descent algorithms are run on the two variables (wt and mpg) of the mtcars datatset in R, wt being the predictor and mpg being the response variable. The following figures / animations show the results.
## [1] "Theta found by running batch gradient descent (with 600 iterations): "
##                [,1]
## rep(1, m) 37.248255
## data[, 1] -5.333882
## [1] "Theta found by Stochastic gradient descent (with 10 repeats): "
##           [,1]
## [1,] 25.954195
## [2,] -2.342769
## [1] "Theta found with normal equation: "
##                [,1]
## rep(1, m) 37.285126
## data[, 1] -5.344472

5. Now the output variable mpg (miles per galon) is converted to a binary variable by binning it around the median value of the variable (if > 19 then mpg is high or 1, otherwise it’s 0). Now the gradient descent algorithms are run on the two variables (wt and mpg) of the mtcars datatset in R, wt being the predictor and mpg being the response variable, but this time using the following cost function for the logistic regression.
6. The following figures / animations show the results of running the gradient descent algorithms and their convergence. The true minimum is computed using a more efficient optimization algorithm (such as conjugate gradient) and shown as a red dot in the figures.
7. Next let’s compare the convergence rates of a couple of uniformly randomly generated 2-D linearly separable datasets with batch and the online versions of perceptron (pocket) algorithm. The following shows the batch and the stochastic versions of the perceptron learning algorithm.
8. As can be seen from the following animations, in the first case the batch version converges much faster than the online counterpart in the synthetic dataset generated, whereas in the second case, the online version converges early.