# Kernel K-Means, K-Means++ and Cluster Evaluation

In this article, the R / Python implementations of KMeans Clustering and Kernel KMeans Clustering algorithms will be used to cluster a few datasets. Then the cluster assignments are compared with the ground truths w.r.t. (R implementations of) the following supervised clustering evaluation metrics: purity and NMI.

## KMeans and Kernel KMeans Clustering

1. The following two animations show the partitions induced in by KMeans clustering in the following two datasets. On the first dataset, KMeans does a good job, but on the second dataset it performs poorly.
2. The next animation shows that the Kernel KMeans Clustering (implemented in R with Gaussian Kernel) performs considerably better on the second dataset, starting with the Gram matrix ϕ, with ϕ(i,j)=exp((XiXj)^2/(2σ^2)), with σ=5.
3. The next couple of animations show the clusters obtained at diffetent iterations with the algorithms KMean and Kernel KMeans clustering (with σ=4) respectively, on the same dataset, but this time implemented in python.

4. The next couple of animations show the clusters obtained at diffetent iterations with the algorithms KMeans and Kernel KMeans clustering (with σ=4) respectively, on another dataset.

5. The next couple of animations show the clusters obtained at diffetent iterations with the algorithms KMean and Kernel KMeans clustering (with σ=4) respectively, on yet another dataset.

## KMeans++ Smart Initialization

1. Instead of random initialization for the KMeans clustering to start with the initial centroids, KMeans++ uses a smart initialization to probabilisitically assure that the centroids are far from each other. The following figure describes the smart initialization.
2. The following figures / animations show the comparative results in between KMeans with and without smart KMeans++ initialization, in terms of initialization time taken, the iterative convergence of the algorithm to a local optimum and cluster quality (in terms of heterogeneity) on a dataset with k=7 clusters. As can be seen, in general KMeans++ smart initialization takes more time, but generally tends to converge faster and to better quality clusters.

3. The next figures / animations show the comparative results in between KMeans with and without smart KMeans++ initialization, in terms of initialization time taken, the iterative convergence of the algorithm to a local optimum and cluster quality (in terms of heterogenity) on another dataset with k=5 clusters.

## Cluster Evaluations

1. The following figure shows the mathematical formulations of two supervised clustering evaluation metrics, purity and NMI that will be used to evaluate the clusters obtained.
2. As can be seen from the following evaluation results by running the algorithms on the following dataset, Kernel KMeans with K=2performs the best (keeping σσ of the kernel fixed at 4) in terms of the purity and NMI, when compared to the ground truth shown next.
3. Again, as can be seen from the following evaluation results by running the algorithms on another dataset, Kernel KMeans with K=2 performs the best (keeping σσ of the kernel fixed at 4) in terms of the purity and NMI, when compared to the ground truth shown next.
4. Now, keeping the number of clusters fixed at K=2, the σσ for the kernel is varied for both the datasets. As shown in the next figures / animations and results, for the kernel higher σσ values, Kernel KMeans performs better in terms of the purity and NMI.

# Implementing Low-Rank Matrix Factorization with Alternating Least Squares Optimization for Collaborative Filtering Recommender System in R

In this article, the low rank matrix facotrization will be used to predict the unrated movie ratings for the users from MovieLense (100k) dataset (given that each user has rated some of the movies). The data was collected through the MovieLens web site (movielens.umn.edu) during the seven-month period from September 19th, 1997 through April 22nd, 1998. The data set contains about 100,000 ratings (1-5) from 943 users on 1664 movies.

1. Each cell R{i,j} of the movie rating matrix contains the Rating given to the i_th movie by the j_th user (if any, otherwise NA), so that the rows represent the movies and the columns represents the users.
2. The intuition behind using matrix factorization is that there must be some latent features (much smaller than the number of users and the number of movies) that determine how a user rates a movie. We have a set n_m of movies and a set n_u of users and a rating matrix R of size n_m×n_u be the matrix that contains all the ratings that the movies are assigned by the users. We are interested to discover n latent features (n=10 latent features are used). Then our task, then, is to find two matrices θ (a n_m×n matrix) and X (a n_u×n matrix) such that their product approximates R, such that the error for the users/movie pairs where we know the correct ratings is minimized.
3. The couple of optimization problems are to be solved that can be done in two different ways:
• Solve with Alternate Least Squares Method (appears as an assigment in the BerkeleyX edX Course on Big Data Analysis with Spark): The ALS algorithm first randomly fills the users matrix with values and then optimizes the value of the movies such that the error is minimized. Then, it holds the movies matrix constant and optimizes the value of the user’s matrix. This alternation between which matrix to optimize is the reason for the “alternating” in the name. Given a fixed set of user factors (i.e., values in the users matrix), we use the known ratings to find the best values for the movie factors using the optimization written at the bottom of the figure. Then we “alternate” and pick the best user factors given fixed movie factors. The next figure shows the outline of this algorithm.
• Simultaneous optimization Method (from the Coursera ML Course by Stanford Prof. Andrew Ng.): By combining the two optimization problems and solving them simultaneously. Both the methods are show in the next figure.
4. The next figures show the results from both the algorithms: Conjugate Gradient method (with 100 iterations) is used in both cases and both the methods were iteratively called for 20 times, with regularization parameter λ=10. Then the RMSE value between the original rating matrix and the estimated rating matrix is computed for both the cases. The following figures show that the Simultaneous Optimization method achieved lower RMSE value immediately, while ALS took some iterations to get to the same RMSE value.
5. The next figures show how the RMSE changed with the regularization parameter λ, for both the methods. As expected, with lower values of λ the RMSE is much lower (due to possible overfitting).
6. The next figures show the original rating matrix and the estimated rating matrix with ALS matrix factorization (the rating matrix is shown as an adjacency matrix of a neighborhood graph).

# Implementing Radial Basis Function Classifier in R

• In this article, implementation of an RBF (Radial Basis Function) Classifier will be described for binary classification.
• The algorithm consists of two different parts:
1. Unsupervised: Find k << N centers (N = # data points) for the RBF network.
2. Supervised: Compute the weights w of the classifier hypothesis h(x) from the (training) data using the labels.
• k centroids will be used for the RBF network and those centroids will first be found with Lloyd’s algorithm.
• Then parameter vector w will be computed by solving the least square equation which is linear in ϕ.
• The following figure shows the outline of the algorithm to be used.

• Number of centers k and the the γ parameters will be varied to obtain different decision boundaries for classification.
• As can be seen from the experiments on a few 2 dimensional (labeled) datasets, γ works as regularization parameter for the classifier, at lower value of the parameter, the classifier in-sample accuracy is higher.
• The next two animations show the decision boundary contours learnt by the RBF classifier as a dark blue line on the same dataset for different γ values. The centers of the RBF network is represented by the blue stars surrounded by the black circles.
• The first animation shows the results of the RBF classifier starting with 3 centers, while the second one starting with 5 centers.

• The next two animations show the decision boundary contours learnt by the RBF classifier as a solid black line on another two different datasets, for different γ values. The centers of the RBF network is represented by the black stars surrounded by the black circles.
• The first animation shows the results of the RBF classifier starting with 3 centers, while the second one starting with 5 centers.

# KMeans for Image Compression, PCA / MDS / SVD for Visualization in the reduced dimension

## Image Compression and Color-based segmentation with KMeans

• This problem appears in an exercise of the Coursera ML course by Andrew Ng.
• Couple of colored images are used as input 3D dataset and the Kmeans clustring is run with k arbitray cluster centroids. After the algorithm converges, each pixel x is assigned to the color of the nearest cluster $$argmin_{i}d(x,c_i)$$.
• If the input image contains n different colors, the output image can be represented using only k << n colors, achieving high level of compression, by grouping regions with similar colors in a single cluster.
• The following animations show the result of the clustering with two different images and for a few diffeent k values.
• Next, a bird image is segmented with k=16.

• Next, a face image is segmented with k=16.

• Next, the same face image is segmented with k=8. As can be noticed, the image quality becomes poorer, as lesser and lesser number of colors are used.
• Next, the same face image is segmented with k=2 to obtain a binary image. Also, the convergence is achieved much faster (within first two iterations) in this case.

## Using PCA and MDS for Visualization in the reduced dimension

• This problem also appears in an exercise of the Coursera ML course by Andrew Ng.
• First, the 3-D RGB reprentation of the bird image is compressed with kmean clustering.
• First, PCA and then MDS are used to reduce the dimension (to n=2) of the compressed image and then it is visualized in reduced dimension.
• As can be seen, although the spatial neighborhood is not preserved in the very representation of the image, still some structure of the bird is preserved even in the reduced dimensions.

• Next the face dataset is used (with 5000 images of size 32×32) and PCA is used to find the top k dominant eigenvectors for different k. The eigenvectors look like faces, so they are called eigenfaces. The following image shows 100 of those faces and a few of the principal component eigenfaces.

• The next figure / animation shows 100 of those normalized eigenfaces and the faces reconstructed with top k eigenfaces using X_{approx} ≈ X.e[1:k].(e[1:k])’. As can be seen that the original faces can be closely reconstructed, when only k=100 principal components are used, instead of using all of 1024 dimensions.

## Using SVD, PCA and FFT for dimension reduction and compression

• First, the following two different implementations of the PCA will be used to reduce the dimensions of a 453×378 image and reconstruction of the image in the reduced dimension.
1. Implemented with the SVD (numerically stable, as done by R prcomp)
2. Implemented with the Eigen-decomposition of covariance matrix computed with the z-score-normalized image matrix (numerically less stable, as done by R princomp)
• In both the above cases, only the first k orthonormal principal components will be used to reconstruct the image, as shown in the following animation (the first one being the SVD implementation of PCA).

• As can be seen from above, the SVD implementation of PCA is much more robust and less susceptible to numerical errors, only first few principal components suffice to have a very close representation of the image.
• Next, FFT will be used to transform the image to frequency domain and then only first k orthonormal Fourier basis vectors in the frequency domain will be used to reconstruct the image, as shown in the following animation.
• As can be seen from above, PCA works better than DFT in terms of quality of image reconstructed in the reduced dimensions.
• The following figures show the decrease in errors (in between the original and the approximated image using the Frobenius norm / MSE).

# Testing Bayesian Concepts in R: using the Gaussian Conjugate Priors to compute the Posterior Distribution

• In this article, the Gaussian Conjugate Priors will be used to compute the Posterior distribution for some online dataset (1D and 2D) following Gaussian Distribution.
• First, a N(μ0,σ0) prior is chosen to model the unknown mean μ variable of the data, while assuming the data variance σ2 as fixed.
• Then a few i.i.d. samples are drawn from a Gaussian distribution the prior belief about the mean μ is updated, X_iN(μ,σ2) with prior μN(μ0,σ0).
• The posterior probability distribution is also a Gaussian distribution as shown in the figure below from the videos of professor Herbert Lee from the coursera course Bayesian Statistics Concepts by UCSC.

• Then the recursive Bayesian updates and the prior and posterior hyper-parameters and the means are updated as and when a new datapoint is received. Also, the frequentist’s MLE and 95% confidence interval are computed, along with the Bayesian 95% credible interval.
• The following animation shows the results of simulation of 50 such data samples, starting with the prior N(0,10).

• The left bottom plot visualizes the histogram of the data generated.
• Every time a new datapoint is received, the prior belief is updated.
• The right bottom table represents the summary statistics. Prior and Posterior means (of the arrival rate) respectively correspond to the previous and updated beliefs about the mean of the data.
• The next animation shows the same results (with contours) modeling a set of 2D Gaussian samples.

# Gibbs Sampling to find the Best K-mer Motifs from a collection of Dna strings in R: BioInformatics Concepts

• In this article, Gibbs Sampling is used to find the Best K-mer Motifs from a collection of Dna strings. This problem appears in the Rosalindand the Stepic Bioinformatics problem-solving sites and also as an assignment in a Coursera BioInformatics Algorithms course provided by UCSD.
• A Profile-randomly generated k-mer in a string is defined as follows: For each k-mer Pattern in Text, compute the probability Pr(Pattern | Profile), resulting in n = |Text| – k + 1 probabilities (p1, ., pn). These probabilities do not necessarily sum to 1, but can still form the random number generator Random(p1, ., pn) based on them.
• GIBBSSAMPLER uses this random number generator to select a Profile-randomly generated k-mer at each step: if the n-sided (biased) die with probability of face-showups (p1, ., pn) rolls the number i, then the Profile-randomly generated k-mer is defined as the i-th k-mer in Text.
• The algorithm is then repeated a few times with a few random starts and the best motifs are chosen in the end.
• The following shows the algorithm used by to implement the Gibbs Sampler and also couple of sample input datasets used and the output best motifs computed.
• The animations show the Profile matrix (with Information Content Scaled / unscaled) computed at each iteration of the Repeated GibsSampling algorithm and how it changes.

# Using Expectation Maximization for Soft-Clustering in Python

• In this article, Expectation Maximization will be used for soft clustering of a dataset in k clusters. Given n data points in an m dimensional space Data = (Data_1, … , Data_n), let’s first represent their soft assignments to k clusters as a k x n dimensional column-stochastic matrix HiddenMatrix = (HiddenVector_1, … , HiddenVector_n), where each HiddenVector_j itself is a dimensional vector representing the probabilities of assigning the Data_j to each of the k clusters, where \$j ∈ {1,..,k}. Then the k centers can be represented as k points in m dimensional space, Parameters = (θ1, …, θk) (Reference: Stepic, Rosalind and Coursera Course on Bioinformatics Algorithms by UCSD).
• The expectation maximization algorithm starts with a random choice of Parameters. It then alternates between the Estep, in which a responsibility matrix HiddenMatrix for Data given Parameters is computed and the Mstep, in which the Parameters are re-estimated (with MLE) using the HiddenMatrix, till convergence. The iterative algorithm outline is shown below.

• For this assignment, a 2 dimensional dataset with ~1000 data points will be used and EM algorithm will be run to cluster the dataset into 6 segments, starting with 6 random centers. The following figures / animations show the EM algorithm steps and the responsibility matrix and the clusters obtained with different values of the stiffness parameter β.
• As can be seen, the soft assignments come closer to hard assignments as the β parameter value is increased.
• For β = 1.1
• For  β = 10
• For  β = 0.1

# Image Projection with Homography Estimation in R: Computer Vision Concepts

(Sandipan Dey, 17 August 2016)

• In this article, the concepts of projective geometry and homographies will be used to project an image onto a scene in a natural way that respects perspective.
• To demonstrate this, first the UMBC logo will be projected onto the goal during a football match (this appeared as an assignment in the Coursera Course Robotics Perception by the University of Pennsyvania). The images were extracted from a video sequence of a football match, as well as the corners of the goal in each image and an image of the UMBC logo are provided.
• For each image in the video sequence, the homography is computed between the UMBC logo and the goal, and the goal points are then warpped onto the ones in the UMBC logo to generate a projection of the logo onto the video frame.
• Homography Estimation: To project one image patch onto another, the homography between the two image patches are needed to be calculated. This homography is a 3×3 matrix that satisfies the equation x_logoH_xvideo.
• Inverse Warping: first the logo image is projected onto the video frames and then every point in the video frame x_video is replaced with the corresponding point in the logo x_logo. The corresponding math equations needed are shown below:
• The original movie for the football match is shown below.
• The next movie shows the projected image of the UMBC logo on the goal post of the original movie using Homography.
• The next animations show the projected logo / face image of the image of the UMBC buildings, as the proportion of projection varies from 0 to 1.

# Modeling the growth of a sunflower with golden angle and Fibonacci numbers in R

• In this article, a mathematical model for the growth of a sunflower (shown below) will be described (reference: the video lectures of Prof. Jeffrey R Chesnov from Coursera Course on Fibonacci numbers).

• New florets are created close to center.
• Florets move radially out with constant speed as the sunflower grows.
• Each new floret is rotated through a constant angle before moving radially.
• Denote the rotation angle by 2πα, with 0<α<1.
• With ψ=(√51)/2, the most irrational of the irrational numbers and using α=1ψ, the following model of the sunflower growth is obtained, as can be seen from the following animation in R.

• In our model 2πα is chosen to be the golden angle, since α is very difficult to be approximated by a rational number.
• The model contains 34 anti-clockwise and 21 clockwise spirals, which are Fibonacci numbers, since the golden angle α=1ψ can be represented by the continued fraction [02,1,1,1,1,1,1,…].
• Let g / 2π = 1ψ = ψ^2 = 1 / Ø^2 = 1 / (1+ Ø) = [02,1,1,1,1,1,1,…]
• Then we can prove that g(n)/2π F(n)/F(n+2)where g(n) is the n-th rational
approximation
of the golden angle and F(n) is the n-th Fibonacci number
.

• Proof by induction (on n)

# Testing Bayesian Concepts in R: using the Exponential-Gamma Conjugate Priors to compute the Posterior Distribution

• In this article, the Exponential-Gamma Conjugate Priors will be used to compute the Posterior values for the customer arrival rate in a retail shop (Inter-arrival times can be best modeled by an exponential distribution).
• First, a Γ(α,β) prior is chosen with α=4β=0.4 (consistent with our belief that the mean customer arrival rate in the store is 1/3, so that mean arrival time is 3 mins, with a standard deviation of arrival rate of 1/9) to model the unknown mean customer arrival rate λ variable, so that λ ∼ Γ(9,27).
• Then a few trials of a random experiment simulating the customer arrival process are conducted to collect the data and update the prior belief about λ from the likelihood, which (the ) can be modeled as an exponential random variable, ∼ exp(λ),  λ ∼ Γ(α,β).
• The posterior probability distribution is also a Gamma distribution as shown in the figure below from the videos of professor Herbert Lee.

• Then the recursive Bayesian updates and the prior and posterior hyper-parameters and the means are updated as and when a new datapoint is received. Also, the frequentist’s MLE and 95% confidence interval are computed, along with the Bayesian 95% credible interval.
• The following animation shows the results of simulation of 20 customer arrivals and the inter-arrival times, starting with the prior Γ(9,27).

• The left bottom plot visualizes the customers arrivals.
• Every time a new datapoint is received (the next customers arrives at the shop), the prior belief is updated.
• The right bottom table represents the summary statistics. Prior and Posterior means (of the arrival rate) respectively correspond to the previous and updated beliefs about the customers arrival rate at the shop.
• The next animation shows the same results starting with a vague prior Γ(ϵ,ϵ).