In this article a few machine learning problems from a few online courses will be described.
1. Fitting the distribution of heights data
This problem appeared as an assignment problem in the coursera course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself. This video explains this problem and the solution in details.
In this assessment the steepest descent will be used to fit a Gaussian model to the distribution of heights data that was first introduced in Mathematics for Machine Learning: Linear Algebra.
The algorithm is the same as Gradient descent but this time instead of descending a
pre-defined function, we shall descend the χ2 (chi squared) function which is both a function of the parameters that we are to optimize, but also the data that the model is
to fit to.
Background
We are given a dataset with 100 data-points for the heights of people in a population, with x as heights (in cm.) and y as the probability that there is a person with that height, first few datapoints are shown in the following table:
x | y | |
---|---|---|
0 | 51.25 | 0.0 |
1 | 53.75 | 0.0 |
2 | 56.25 | 0.0 |
3 | 58.75 | 0.0 |
4 | 61.25 | 0.0 |
5 | 63.75 | 0.0 |
6 | 66.25 | 0.0 |
7 | 68.75 | 0.0 |
8 | 71.25 | 0.0 |
9 | 73.75 | 0.0 |
The dataset can be plotted as a histogram, i.e., a bar chart where each bar has a width representing a range of heights, and an area which is the probability of finding a person with a height in that range, using the following code.
import matplotlib.pylab as plt plt.figure(figsize=(15,5)) plt.bar(x, y, width=3, color=greenTrans, edgecolor=green) plt.xlabel('x') plt.ylabel('y') plt.show()
We can model that data with a function, such as a Gaussian, which we can specify with two parameters, rather than holding all the data in the histogram. The Gaussian function is given as,
By definition χ2 as the squared difference of the data and model, i.e., χ2 = |y − f (x; μ, σ)|^2.
x an y are represented as vectors here, as these are lists of all of the data points, the
|abs-squared|^2 encodes squaring and summing of the residuals on each bar.
To improve the fit, we shall want to alter the parameters μ and σ, and ask how that changes the χ2. That is, we will need to calculate the Jacobian,
Let’s look at the first term, ∂(χ2)/∂μ, using the multi-variate chain rule, this can be written as,
A similar expression for ∂(χ2)/∂σ can be obtained as follows:
The Jacobians rely on the derivatives ∂f/∂μ and ∂f/∂σ. It’s pretty straightforward to implement the python functions dfdmu() and dfdsig() to compute the derivatives.
Next recall that steepest descent shall move around in parameter space proportional to the negative of the Jacobian, i.e.,
with the constant of proportionality being the aggression of the algorithm.
The following function computes the expression for the Jacobian.
def steepest_step (x, y, mu, sig, aggression) : J = np.array([ -2*(y - f(x,mu,sig)) @ dfdmu(x,mu,sig), -2*(y - f(x,mu,sig)) @ dfdsig(x,mu,sig) ]) step = -J * aggression return step
We need to run a few rounds of steepest descent to fit the model. The next piece of code builds the model with steepest descent to fit the heights data:
# Do a few rounds of steepest descent. for i in range(50) : dmu, dsig = steepest_step(x, y, mu, sig, 2000) mu += dmu sig += dsig p = np.append(p, [[mu,sig]], axis=0)
The following animations show the steepest descent path for the parameters and the model fitted to the data, respectively.
The data is shown in orange, the model in magenta, and where they overlap it’s shown in green.
χ2 is represented in the figure as the sum of the squares of the pink and orange bars.
This particular model has not been fit well with the initial guess – since there is not a strong overlap.
But gradually the model fits the data better as more and more iterations of steepest descent are run.
Note that the path taken through parameter space is not necessarily the most direct path, as with steepest descent we always move perpendicular to the contours.
2. Back-propagation
This problem also appeared as an assignment problem in the coursera online course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself.
In this assignment, we shall train a neural network to draw a curve. The curve takes one input variable, the amount traveled along the curve from 0 to 1, and returns 2 outputs, the 2D coordinates of the position of points on the curve.
The below table shows the first few rows of the dataset. Here x is the input variable and y1, y2 are the output variables.
x | y1 | y2 | |
---|---|---|---|
0 | 0.00 | 0.500000 | 0.625000 |
1 | 0.01 | 0.500099 | 0.627015 |
2 | 0.02 | 0.500788 | 0.632968 |
3 | 0.03 | 0.502632 | 0.642581 |
4 | 0.04 | 0.506152 | 0.655407 |
5 | 0.05 | 0.511803 | 0.670852 |
6 | 0.06 | 0.519955 | 0.688198 |
7 | 0.07 | 0.530875 | 0.706641 |
8 | 0.08 | 0.544723 | 0.725326 |
9 | 0.09 | 0.561537 | 0.743386 |
The next figures show how the data looks:
To help capture the complexity of the curve, we shall use two hidden layers in our network with 6 and 7 neurons respectively.
We shall implement functions to calculate the Jacobian of the cost function, with respect to the weights and biases of the network. The code will form part of a stochastic steepest descent algorithm that will train our network.
Feed forward
The following figure shows the feed-forward equations,
In the following python code (taken from the same assignment) defines functions to set up our neural network. Namely an activation function, σ(z), it’s derivative, σ′(z), a function to initialize weights and biases, and a function that calculates each activation of the network using feed-forward.
In this assignment we shall use the logistic function as our activation function, rather than the more familiar tanh or relu.
sigma = lambda z : 1 / (1 + np.exp(-z)) d_sigma = lambda z : np.cosh(z/2)**(-2) / 4 # This function initialises the network with it's structure, it also resets any training already done. def reset_network (n1 = 6, n2 = 7, random=np.random) : global W1, W2, W3, b1, b2, b3 W1 = random.randn(n1, 1) / 2 W2 = random.randn(n2, n1) / 2 W3 = random.randn(2, n2) / 2 b1 = random.randn(n1, 1) / 2 b2 = random.randn(n2, 1) / 2 b3 = random.randn(2, 1) / 2 # This function feeds forward each activation to the next layer. It returns all weighted sums and activations. def network_function(a0) : z1 = W1 @ a0 + b1 a1 = sigma(z1) z2 = W2 @ a1 + b2 a2 = sigma(z2) z3 = W3 @ a2 + b3 a3 = sigma(z3) return a0, z1, a1, z2, a2, z3, a3 # This is the cost function of a neural network with respect to a training set. def cost(x, y) : return np.linalg.norm(network_function(x)[-1] - y)**2 / x.size
Backpropagation
Next we need to implement the functions for the Jacobian of the cost function with respect to the weights and biases. We will start with layer 3, which is the easiest, and work backwards through the layers.
The cost function C is the sum (or average) of the squared losses over all training examples:
The following python code shows how the J_W3 function can be implemented.
# Jacobian for the third layer weights. def J_W3 (x, y) : # First get all the activations and weighted sums at each layer of the network. a0, z1, a1, z2, a2, z3, a3 = network_function(x) # We'll use the variable J to store parts of our result as we go along, updating it in each line. # Firstly, we calculate dC/da3, using the expressions above. J = 2 * (a3 - y) # Next multiply the result we've calculated by the derivative of sigma, evaluated at z3. J = J * d_sigma(z3) # Then we take the dot product (along the axis that holds the training examples) with the final partial derivative, # i.e. dz3/dW3 = a2 # and divide by the number of training examples, for the average over all training examples. J = J @ a2.T / x.size # Finally return the result out of the function. return J
The following python code snippet implements the Gradient Descent algorithm (where the parameter aggression represents the learning rate and noise acts as a regularization parameter here):
while iterations < max_iteration: j_W1 = J_W1(x, y) * (1 + np.random.randn() * noise) j_W2 = J_W2(x, y) * (1 + np.random.randn() * noise) j_W3 = J_W3(x, y) * (1 + np.random.randn() * noise) j_b1 = J_b1(x, y) * (1 + np.random.randn() * noise) j_b2 = J_b2(x, y) * (1 + np.random.randn() * noise) j_b3 = J_b3(x, y) * (1 + np.random.randn() * noise) W1 = W1 - j_W1 * aggression W2 = W2 - j_W2 * aggression W3 = W3 - j_W3 * aggression b1 = b1 - j_b1 * aggression b2 = b2 - j_b2 * aggression b3 = b3 - j_b3 * aggression
The next figures and animations show how the prediction curve (pink / orange) with the neural net approaches the original curve (green) as it’s trained longer.
With learning rate = 7
The next figures and animations visualize all the curves learnt at different iterations.
With learning rate = 1
We can change parameters of the steepest descent algorithm, e.g., how aggressive the learning step is, and how much noise to add. The next figure shows the actual and the predicted curves for a few different values of the paramaters.
We can compute the model error (sum of square deviation in between the actual and predicted outputs) for different values of the parameters. The next heatmap shows how the model error varies with the aggression and noise parameter values.
3. The Kernel Perceptron
This problem appeared in an assignment in the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta).
We need to implement the Kernel Perceptron algorithm to classify some datasets that are not linearly separable. The algorithm should allow kernels like the quadratic and RBF kernel.
The next figure describes the theory and the algorithm (in dual form) for Kernel Perceptron. As can be seen, the kernel trick can be used both at the training and the prediction time to avoid basis expansion (by replacing the dot products of the expanded feature vectors with a Mercer Kernel).
The datasets on which we are going to classify with the dual perceptron algorithm are 2-dimensional datasets, each datapoint with a label +1 or -1, the first few datapoints of a dataset is shown below:
X1 | X2 | y | |
---|---|---|---|
0 | 1.0 | 1.0 | 1.0 |
1 | 2.0 | 1.0 | 1.0 |
2 | 3.0 | 1.0 | 1.0 |
3 | 4.0 | 1.0 | 1.0 |
4 | 5.0 | 1.0 | 1.0 |
The ground-truth of the data-points are represented by their color (red and black) and marker type (circle and triangle respectively).
The data-point that is mis-classified in a particular iteration is shown in blue.
When a mis-classified point is selected, the corresponding alpha value is up-voted, this is indicated by increase in the size of the data-point.
The decision boundary for the two classes are shown with green and magenta colors, respectively.
The following figures and animations show the classification of the datasets using kernel perceptron with RBF and quadratic kernels. The next python code snippet implements the kernel functions.
import numpy as np def kernel(x, z, type, s): if type == 'rbf': return np.exp(-np.dot(x-z, x-z)/s**2) if type == 'quadratic': return (1 + np.dot(x, z))**2 return np.dot(x, z)
Results
The next figures / animations show the result of classification with a python implementation of the (Dual) Kernel Perceptron Algorithm.
Dataset 1
Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.
Dataset 2
Results with RBF Kernel
Results with quadratic Kernel
Dataset 3
Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.
C
is the setting of the soft-margin parameterC
(default: 1.0)s
(for the RBF kernel) is the scaling parameters
(default: 1.0)
Dataset 4
Results with RBF Kernel
Results with quadratic Kernel
Dataset 5
Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.
Results with Kernel SVM Classifier (sklearn)
The following code and the figures show the decision boundaries and the support vectors (datapoints with larger size) learnt with sklearn SVC.
from sklearn.svm import SVC x = data[:,0:2] y = data[:,2] clf = SVC(kernel=kernel, C=C, kernel = 'rbf', gamma=1.0/(s*s)) clf.fit(x,y) clf.support_
Dataset 1
With polynomial kernel (degree=2, C=1)
With RBF kernel (C=10, σ = 10)
Dataset 2
With polynomial kernel (degree=2, C=1)
With RBF kernel (C=10, σ = 10)
Dataset 3
With RBF kernel (C=10, σ = 10)
4. Models for handwritten digit classification
This problem is taken from a few assignments from the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta). The problem description is taken from the course itself.
In this assignment we will build a few classifiers that take an image of a handwritten digit and outputs a label 0-9. We will start with a particularly simple strategy for this problem known as the nearest neighbor classifier, then a Gaussian generative model for classification will be built and finally an SVM model will be used for classification.
The MNIST dataset
MNIST
is a classic dataset in machine learning, consisting of 28×28 gray-scale images handwritten digits. The original training set contains 60,000 examples and the test set contains 10,000 examples. Here we will be working with a subset of this data: a training set of 7,500 examples and a test set of 1,000 examples.
The following figure shows the first 25 digits from the training dataset along with the labels.
Similarly, the following figure shows the first 25 digits from the test dataset along with the ground truth labels.
Nearest neighbor for handwritten digit recognition
Squared Euclidean distance
To compute nearest neighbors in our data set, we need to first be able to compute distances between data points. A natural distance function is Euclidean distance: for two vectors x, y ∈ ℝ^d, their Euclidean distance is defined as
Often we omit the square root, and simply compute squared Euclidean distance.
For the purposes of nearest neighbor computations, the two are equivalent: for three vectors x, y, z ∈ ℝ^d, we have ∥x−y∥≤∥x−z∥ if and only if ∥x−y∥^2≤∥x−z∥^2.
The following python function squared Euclidean distance.
## Computes squared Euclidean distance between two vectors. def squared_dist(x,y): return np.sum(np.square(x-y)
Computing nearest neighbors
Now that we have a distance function defined, we can now turn to (1-) nearest neighbor classification, with the following naive implementation with 0 training / pre-processing time.
## Takes a vector x and returns the index of its nearest neighbor in train_data def find_NN(x): # Compute distances from x to every row in train_data distances = [squared_dist(x,train_data[i,]) for i in range(len(train_labels))] # Get the index of the smallest distance return np.argmin(distances) ## Takes a vector x and returns the class of its nearest neighbor in train_data def NN_classifier(x): # Get the index of the the nearest neighbor index = find_NN(x) # Return its class return train_labels[index]
The following figure shows a test example correctly classified by finding the nearest training example and another incorrectly classified.
Processing the full test set
Now let’s apply our nearest neighbor classifier over the full data set.
Note that to classify each test point, our code takes a full pass over each of the 7500 training examples. Thus we should not expect testing to be very fast.
## Predict on each test data point (and time it!) t_before = time.time() test_predictions = [NN_classifier(test_data[i,]) for i in range(len(test_labels))] t_after = time.time() ## Compute the error err_positions = np.not_equal(test_predictions, test_labels) error = float(np.sum(err_positions))/len(test_labels) print("Error of nearest neighbor classifier: ", error) print("Classification time (seconds): ", t_after - t_before)
(‘Error of nearest neighbor classifier: ‘, 0.046)
(‘Classification time (seconds): ‘, 41.04900002479553)
The next figure shows the confusion matrix for classification
Faster nearest neighbor methods
Performing nearest neighbor classification in the way we have presented requires a full pass through the training set in order to classify a single point. If there are N training points in ℝ^d, this takes O(Nd) time.
Fortunately, there are faster methods to perform nearest neighbor look up if we are willing to spend some time pre-processing the training set. scikit-learn
has fast implementations of two useful nearest neighbor data structures: the ball tree and the k-d tree.
from sklearn.neighbors import BallTree ## Build nearest neighbor structure on training data t_before = time.time() ball_tree = BallTree(train_data) t_after = time.time() ## Compute training time t_training = t_after - t_before print("Time to build data structure (seconds): ", t_training) ## Get nearest neighbor predictions on testing data t_before = time.time() test_neighbors = np.squeeze(ball_tree.query(test_data, k=1, return_distance=False)) ball_tree_predictions = train_labels[test_neighbors] t_after = time.time() ## Compute testing time t_testing = t_after - t_before print("Time to classify test set (seconds): ", t_testing)
(‘Time to build data structure (seconds): ‘, 0.3269999027252197)
(‘Time to classify test set (seconds): ‘, 6.457000017166138)
similarly, with the KdTree data structure we have the following runtime:
(‘Time to build data structure (seconds): ‘, 0.2889997959136963)
(‘Time to classify test set (seconds): ‘, 7.982000112533569)
Next let’s use sklearn’s KNeighborsClassifier to compare with the runtimes.
from sklearn.neighbors import KNeighborsClassifier neigh = KNeighborsClassifier(n_neighbors=1) neigh.fit(train_data, train_labels) predictions = neigh.predict(test_data)
(‘Training Time (seconds): ‘, 0.2999999523162842)
(‘Time to classify test set (seconds): ‘, 8.273000001907349)
The next figure shows the error rate on the test dataset with k-NearestNeighbor classifier with different values of k.
Training the 1-NN classifier on the entire training dataset with 60k images and testing on the entire testset with 10k images yields the the following results:
(‘Training Time (seconds): ‘, 19.694000005722046)
(‘Time to classify test set (seconds): ‘, 707.7590000629425)
with the following accuracy on the test dataset and the confusion matrix:
accuracy: 0.9691 (error 3.09%)
Gaussian generative models for handwritten digit classification
Recall that the 1-NN classifier yielded a 3.09% test error rate on the MNIST data set of handwritten digits. We will now see that a Gaussian generative model does almost as well, while being significantly faster and more compact.
For this assignment we shall be using the entire MNIST
dataset, the training dataset contains 60k images and the test dataset contains 10k images.
Fit a Gaussian generative model to the training data
The following figure taken from the lecture videos from the same course describes the basic theory.
Let’s Define a function, fit_generative_model, that takes as input a training set (data x
and labels y
) and fits a Gaussian generative model to it. It should return the parameters of this generative model; for each label j = 0,1,...,9
, we have:
pi[j]
: the frequency of that labelmu[j]
: the 784-dimensional mean vectorsigma[j]
: the 784×784 covariance matrix
This means that pi
is 10×1, mu
is 10×784, and sigma
is 10x784x784.
We need to fit a Gaussian generative model. The parameters pi, mu and sigma are computed with corresponding maximum likelihood estimates (MLE) values: empirical count, mean and covariance matrix for each of the class labels from the data. However, now there is an added ingredient.
The empirical covariances are very likely to be singular (or close to singular), which means that we won’t be able to do calculations with them. Thus it is important to regularize these matrices. The standard way of doing this is to add cI
to them, where c
is some constant and I
is the 784-dimensional identity matrix. (To put it another way, we compute the empirical covariances and then increase their diagonal entries by some constant c
).
This modification is guaranteed to yield covariance matrices that are non-singular, for any c > 0
, no matter how small. But this doesn’t mean that we should make c
as small as possible. Indeed, c
is now a parameter, and by setting it appropriately, we can improve the performance of the model. We will study regularization in greater detail over the coming weeks.
The following python code snippet shows the function:
def fit_generative_model(x,y): k = 10 # labels 0,1,...,k-1 d = (x.shape)[1] # number of features mu = np.zeros((k,d)) sigma = np.zeros((k,d,d)) pi = np.zeros(k) c = 3500 # regularizer for label in range(k): indices = (y == label) pi[label] = ... # empirical count mu[label] = ... # empirical mean sigma[label] = ... # empirical regularized covariance matrix return mu, sigma, pi
Now let”s visualize the means of the Gaussians for the digits.
SVM for handwritten digit classification
The entire training dataset from the MNIST dataset is used to train the SVM model, the training dataset contains 60k images and the test dataset contains 10k images.
First let’s try linear SVM, the following python code:
from sklearn.svm import LinearSVC clf = LinearSVC(C=C, loss='hinge') clf.fit(train_data,train_labels) score = clf.score(test_data,test_labels)
The following figure shows the training and test accuracies of LinearSVC with different values of the hyper-parameter C.
Next let’s try SVM with quadratic kernel, as can be seen it gives 98.06% accuracy on the test dataset with C=1.
from sklearn.svm import SVC clf = SVC(C=1., kernel='poly', degree=2) clf.fit(train_data,train_labels) print clf.score(train_data,train_labels) print clf.score(test_data,test_labels)
training accuracy: 1.0
test accuracy: 0.9806 (error: 1.94%)
The following figure shows the confusion matrix: