# Few Machine Learning Problems (with Python implementations)

In this article a few machine learning problems from a few online courses will be described.

# 1. Fitting the distribution of heights data

This problem appeared as an assignment problem in the coursera course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself.

In this assessment the steepest descent will be used to fit a Gaussian model to the distribution of heights data that was first introduced in Mathematics for Machine Learning: Linear Algebra.

The algorithm is the same as Gradient descent but this time instead of descending a
pre-defined function, we shall descend the χ2 (chi squared) function which is both a function of the parameters that we are to optimize, but also the data that the model is
to fit to.

## Background

We are given a dataset with 100 data-points for the heights of people in a population, with x as heights (in cm.) and y as the probability that there is a person with that height, first few datapoints are shown in the following table:

x y
0 51.25 0.0
1 53.75 0.0
2 56.25 0.0
3 58.75 0.0
4 61.25 0.0
5 63.75 0.0
6 66.25 0.0
7 68.75 0.0
8 71.25 0.0
9 73.75 0.0

The dataset can be plotted as a histogram, i.e., a bar chart where each bar has a width representing a range of heights, and an area which is the probability of finding a person with a height in that range, using the following code.

import matplotlib.pylab as plt
plt.figure(figsize=(15,5))
plt.bar(x, y, width=3, color=greenTrans, edgecolor=green)
plt.xlabel('x')
plt.ylabel('y')
plt.show()


We can model that data with a function, such as a Gaussian, which we can specify with two parameters, rather than holding all the data in the histogram.  The Gaussian function is given as,

By definition χ2 as the squared difference of the data and model, i.e.,  χ|− (xμσ)|^2.

x an y are represented as vectors here, as these are lists of all of the data points, the
|abs-squared|^2 encodes squaring and summing of the residuals on each bar.

To improve the fit, we shall want to alter the parameters μ and σ, and ask how that changes the χ2. That is, we will need to calculate the Jacobian,

Let’s look at the first term, (χ2)/μ, using the multi-variate chain rule, this can be written as,

A similar expression for (χ2)/σ can be obtained as follows:

The Jacobians rely on the derivatives f/μ and f/σ. It’s pretty straightforward to implement the python functions dfdmu() and dfdsig() to compute the derivatives.

Next recall that steepest descent shall move around in parameter space proportional to the negative of the Jacobian, i.e.,

with the constant of proportionality being the aggression of the algorithm.

The following function computes the expression for the Jacobian.


def steepest_step (x, y, mu, sig, aggression) :
J = np.array([
-2*(y - f(x,mu,sig)) @ dfdmu(x,mu,sig),
-2*(y - f(x,mu,sig)) @ dfdsig(x,mu,sig)
])
step = -J * aggression
return step


We need to run a few rounds of steepest descent to fit the model. The next piece of code builds the model with steepest descent to fit the heights data:


# Do a few rounds of steepest descent.
for i in range(50) :
dmu, dsig = steepest_step(x, y, mu, sig, 2000)
mu += dmu
sig += dsig
p = np.append(p, [[mu,sig]], axis=0)



The following animations show the steepest descent path for the parameters and the model fitted to the data, respectively.

The data is shown in orange, the model in magenta, and where they overlap it’s shown in green.

χ2 is represented in the figure as the sum of the squares of the pink and orange bars.

This particular model has not been fit well with the initial guess – since there is not a strong overlap.

But gradually the model fits the data better as more and more iterations of steepest descent are run.

Note that the path taken through parameter space is not necessarily the most direct path, as with steepest descent we always move perpendicular to the contours.

# 2. Back-propagation

This problem also appeared as an assignment problem in the coursera online course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself.

In this assignment, we shall train a neural network to draw a curve. The curve takes one input variable, the amount traveled along the curve from 0 to 1, and returns 2 outputs, the 2D coordinates of the position of points on the curve.

The below table shows the first few rows of the dataset. Here x is the input variable and y1, y2 are the output variables.

x y1 y2
0 0.00 0.500000 0.625000
1 0.01 0.500099 0.627015
2 0.02 0.500788 0.632968
3 0.03 0.502632 0.642581
4 0.04 0.506152 0.655407
5 0.05 0.511803 0.670852
6 0.06 0.519955 0.688198
7 0.07 0.530875 0.706641
8 0.08 0.544723 0.725326
9 0.09 0.561537 0.743386

The next figures show how the data looks:

To help capture the complexity of the curve, we shall use two hidden layers in our network with 6 and 7 neurons respectively.

We shall implement functions to calculate the Jacobian of the cost function, with respect to the weights and biases of the network. The code will form part of a stochastic steepest descent algorithm that will train our network.

## Feed forward

The following figure shows the feed-forward equations,

In the following python code (taken from the same assignment) defines functions to set up our neural network. Namely an activation function, σ(z), it’s derivative, σ(z), a function to initialize weights and biases, and a function that calculates each activation of the network using feed-forward.

In this assignment we shall use the logistic function as our activation function, rather than the more familiar tanh or relu.


sigma = lambda z : 1 / (1 + np.exp(-z))
d_sigma = lambda z : np.cosh(z/2)**(-2) / 4

# This function initialises the network with it's structure, it also resets any training already done.
def reset_network (n1 = 6, n2 = 7, random=np.random) :
global W1, W2, W3, b1, b2, b3
W1 = random.randn(n1, 1) / 2
W2 = random.randn(n2, n1) / 2
W3 = random.randn(2, n2) / 2
b1 = random.randn(n1, 1) / 2
b2 = random.randn(n2, 1) / 2
b3 = random.randn(2, 1) / 2

# This function feeds forward each activation to the next layer. It returns all weighted sums and activations.
def network_function(a0) :
z1 = W1 @ a0 + b1
a1 = sigma(z1)
z2 = W2 @ a1 + b2
a2 = sigma(z2)
z3 = W3 @ a2 + b3
a3 = sigma(z3)
return a0, z1, a1, z2, a2, z3, a3

# This is the cost function of a neural network with respect to a training set.
def cost(x, y) :
return np.linalg.norm(network_function(x)[-1] - y)**2 / x.size



## Backpropagation

Next we need to implement the functions for the Jacobian of the cost function with respect to the weights and biases. We will start with layer 3, which is the easiest, and work backwards through the layers.

The following python code shows how the J_W3 function can be implemented.


# Jacobian for the third layer weights.
def J_W3 (x, y) :
# First get all the activations and weighted sums at each layer of the network.
a0, z1, a1, z2, a2, z3, a3 = network_function(x)
# We'll use the variable J to store parts of our result as we go along, updating it in each line.
# Firstly, we calculate dC/da3, using the expressions above.
J = 2 * (a3 - y)
# Next multiply the result we've calculated by the derivative of sigma, evaluated at z3.
J = J * d_sigma(z3)
# Then we take the dot product (along the axis that holds the training examples) with the final partial derivative,
# i.e. dz3/dW3 = a2
# and divide by the number of training examples, for the average over all training examples.
J = J @ a2.T / x.size
# Finally return the result out of the function.
return J



The following python code snippet implements the Gradient Descent algorithm (where the parameter aggression represents the learning rate and noise acts as a regularization parameter here):


while iterations < max_iteration:
j_W1 = J_W1(x, y) * (1 + np.random.randn() * noise)
j_W2 = J_W2(x, y) * (1 + np.random.randn() * noise)
j_W3 = J_W3(x, y) * (1 + np.random.randn() * noise)
j_b1 = J_b1(x, y) * (1 + np.random.randn() * noise)
j_b2 = J_b2(x, y) * (1 + np.random.randn() * noise)
j_b3 = J_b3(x, y) * (1 + np.random.randn() * noise)

W1 = W1 - j_W1 * aggression
W2 = W2 - j_W2 * aggression
W3 = W3 - j_W3 * aggression
b1 = b1 - j_b1 * aggression
b2 = b2 - j_b2 * aggression
b3 = b3 - j_b3 * aggression



The next figures and animations show how the prediction curve (pink / orange) with the neural net approaches the original curve (green) as it's trained longer.

With learning rate = 7

The next figures and animations visualize all the curves learnt at different iterations.

With learning rate = 1

We can change parameters of the steepest descent algorithm, e.g., how aggressive the learning step is, and how much noise to add. The next figure shows the actual and the predicted curves for a few different values of the paramaters.

We can compute the model error (sum of square deviation in between the actual and predicted outputs) for different values of the parameters. The next heatmap shows how the model error varies with the aggression and noise parameter values.

## 3. The Kernel Perceptron

This problem appeared in an assignment in the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta).

We need to implement the Kernel Perceptron algorithm to classify some datasets that are not linearly separable. The algorithm should allow kernels like the quadratic and RBF kernel.

The next figure describes the theory and the algorithm (in dual form) for Kernel Perceptron. As can be seen, the kernel trick can be used both at the training and the prediction time to avoid basis expansion (by replacing the dot products of the expanded feature vectors with a Mercer Kernel).

The datasets on which we are going to classify with the dual perceptron algorithm are 2-dimensional datasets, each datapoint with a label +1 or -1, the first few datapoints of a dataset is shown below:

X1 X2 y
0 1.0 1.0 1.0
1 2.0 1.0 1.0
2 3.0 1.0 1.0
3 4.0 1.0 1.0
4 5.0 1.0 1.0

The ground-truth of the data-points are represented by their color (red and black) and marker type (circle and triangle respectively).

The data-point that is mis-classified in a particular iteration is shown in blue.

When a mis-classified point is selected, the corresponding alpha value is up-voted, this is indicated by increase in the size of the data-point.

The decision boundary for the two classes are shown with green and magenta colors, respectively.

The following figures and animations show the classification of the datasets using kernel perceptron with RBF and quadratic kernels. The next python code snippet implements the kernel functions.


import numpy as np
def kernel(x, z, type, s):
if type == 'rbf':
return np.exp(-np.dot(x-z, x-z)/s**2)
if type == 'quadratic':
return (1 + np.dot(x, z))**2
return np.dot(x, z)



## Results

The next figures / animations show the result of classification with a python implementation of the (Dual) Kernel Perceptron Algorithm.

Dataset 1

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

Dataset 2

Results with RBF Kernel

Results with quadratic Kernel

Dataset 3

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

• C is the setting of the soft-margin parameter C (default: 1.0)
• s (for the RBF kernel) is the scaling parameter s (default: 1.0)

Dataset 4

Results with RBF Kernel

Results with quadratic Kernel

Dataset 5

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

## Results with Kernel SVM Classifier (sklearn)

The following code and the figures show the decision boundaries and the support vectors (datapoints with larger size) learnt with sklearn SVC.


from sklearn.svm import SVC
x = data[:,0:2]
y = data[:,2]
clf = SVC(kernel=kernel, C=C, kernel = 'rbf', gamma=1.0/(s*s))
clf.fit(x,y)
clf.support_



Dataset 1

With polynomial kernel (degree=2, C=1)

With RBF kernel (C=10, σ = 10)

Dataset 2

With polynomial kernel (degree=2, C=1)

With RBF kernel (C=10, σ = 10)

Dataset 3

With RBF kernel (C=10, σ = 10)

## 4. Models for handwritten digit classification

This problem is taken from a few assignments from the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta). The problem description is taken from the course itself.

In this assignment we will build a few classifiers that take an image of a handwritten digit and outputs a label 0-9. We will start with a particularly simple strategy for this problem known as the nearest neighbor classifier, then a Gaussian generative model for classification will be built and finally an SVM model will be used for classification.

### The MNIST dataset

MNIST is a classic dataset in machine learning, consisting of 28×28 gray-scale images handwritten digits. The original training set contains 60,000 examples and the test set contains 10,000 examples. Here we will be working with a subset of this data: a training set of 7,500 examples and a test set of 1,000 examples.

The following figure shows the first 25 digits from the training dataset along with the labels.

Similarly, the following figure shows the first 25 digits from the test dataset along with the ground truth labels.

## Nearest neighbor for handwritten digit recognition

### Squared Euclidean distance

To compute nearest neighbors in our data set, we need to first be able to compute distances between data points. A natural distance function is Euclidean distance: for two vectors x∈ ℝ^d, their Euclidean distance is defined as

Often we omit the square root, and simply compute squared Euclidean distance.

For the purposes of nearest neighbor computations, the two are equivalent: for three vectors xy∈ ℝ^d, we have xyxz if and only if xy∥^2xz∥^2.

The following python function squared Euclidean distance.


## Computes squared Euclidean distance between two vectors.
def squared_dist(x,y):
return np.sum(np.square(x-y)



### Computing nearest neighbors

Now that we have a distance function defined, we can now turn to (1-) nearest neighbor classification, with the following naive implementation with 0 training / pre-processing time.

## Takes a vector x and returns the index of its nearest neighbor in train_data
def find_NN(x):
# Compute distances from x to every row in train_data
distances = [squared_dist(x,train_data[i,]) for i in range(len(train_labels))]
# Get the index of the smallest distance
return np.argmin(distances)

## Takes a vector x and returns the class of its nearest neighbor in train_data
def NN_classifier(x):
# Get the index of the the nearest neighbor
index = find_NN(x)
# Return its class
return train_labels[index]


The following figure shows a test example correctly classified by finding the nearest training example and another incorrectly classified.

### Processing the full test set

Now let’s apply our nearest neighbor classifier over the full data set.

Note that to classify each test point, our code takes a full pass over each of the 7500 training examples. Thus we should not expect testing to be very fast.


## Predict on each test data point (and time it!)
t_before = time.time()
test_predictions = [NN_classifier(test_data[i,]) for i in range(len(test_labels))]
t_after = time.time()

## Compute the error
err_positions = np.not_equal(test_predictions, test_labels)
error = float(np.sum(err_positions))/len(test_labels)

print("Error of nearest neighbor classifier: ", error)
print("Classification time (seconds): ", t_after - t_before)



(‘Error of nearest neighbor classifier: ‘, 0.046)
(‘Classification time (seconds): ‘, 41.04900002479553)

The next figure shows the confusion matrix for classification

### Faster nearest neighbor methods

Performing nearest neighbor classification in the way we have presented requires a full pass through the training set in order to classify a single point. If there are N training points in ℝ^d, this takes O(Nd) time.

Fortunately, there are faster methods to perform nearest neighbor look up if we are willing to spend some time pre-processing the training set. scikit-learnhas fast implementations of two useful nearest neighbor data structures: the ball tree and the k-d tree.


from sklearn.neighbors import BallTree

## Build nearest neighbor structure on training data
t_before = time.time()
ball_tree = BallTree(train_data)
t_after = time.time()

## Compute training time
t_training = t_after - t_before
print("Time to build data structure (seconds): ", t_training)

## Get nearest neighbor predictions on testing data
t_before = time.time()
test_neighbors = np.squeeze(ball_tree.query(test_data, k=1, return_distance=False))
ball_tree_predictions = train_labels[test_neighbors]
t_after = time.time()

## Compute testing time
t_testing = t_after - t_before
print("Time to classify test set (seconds): ", t_testing)



(‘Time to build data structure (seconds): ‘, 0.3269999027252197)
(‘Time to classify test set (seconds): ‘, 6.457000017166138)

similarly, with the KdTree data structure we have the following runtime:

(‘Time to build data structure (seconds): ‘, 0.2889997959136963)
(‘Time to classify test set (seconds): ‘, 7.982000112533569)

Next let’s use sklearn’s KNeighborsClassifier to compare with the runtimes.


from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(train_data, train_labels)
predictions = neigh.predict(test_data)



(‘Training Time (seconds): ‘, 0.2999999523162842)
(‘Time to classify test set (seconds): ‘, 8.273000001907349)

The next figure shows the error rate on the test dataset with k-NearestNeighbor classifier with different values of k.

Training the 1-NN classifier on the entire training dataset with 60k images and testing on the entire testset with 10k images yields the the following results:

(‘Training Time (seconds): ‘, 19.694000005722046)
(‘Time to classify test set (seconds): ‘, 707.7590000629425)

with the following accuracy on the test dataset and the confusion matrix:

accuracy: 0.9691 (error 3.09%)

## Gaussian generative models for handwritten digit classification

Recall that the 1-NN classifier yielded a 3.09% test error rate on the MNIST data set of handwritten digits. We will now see that a Gaussian generative model does almost as well, while being significantly faster and more compact.

For this assignment we shall be using the entire MNIST dataset, the training dataset contains 60k images and the test dataset contains 10k images.

### Fit a Gaussian generative model to the training data

The following figure taken from the lecture videos from the same course describes the basic theory.

Let’s Define a function, fit_generative_model, that takes as input a training set (data x and labels y) and fits a Gaussian generative model to it. It should return the parameters of this generative model; for each label j = 0,1,...,9, we have:

• pi[j]: the frequency of that label
• mu[j]: the 784-dimensional mean vector
• sigma[j]: the 784×784 covariance matrix

This means that pi is 10×1, mu is 10×784, and sigma is 10x784x784.

We need to fit a Gaussian generative model. The parameters pi, mu and sigma are computed with corresponding maximum likelihood estimates (MLE) values: empirical count, mean and covariance matrix for each of the class labels from the data. However, now there is an added ingredient.

The empirical covariances are very likely to be singular (or close to singular), which means that we won’t be able to do calculations with them. Thus it is important to regularize these matrices. The standard way of doing this is to add cI to them, where c is some constant and I is the 784-dimensional identity matrix. (To put it another way, we compute the empirical covariances and then increase their diagonal entries by some constant c).

This modification is guaranteed to yield covariance matrices that are non-singular, for any c > 0, no matter how small. But this doesn’t mean that we should make c as small as possible. Indeed, c is now a parameter, and by setting it appropriately, we can improve the performance of the model. We will study regularization in greater detail over the coming weeks.

The following python code snippet shows the function:


def fit_generative_model(x,y):
k = 10 # labels 0,1,...,k-1
d = (x.shape)[1] # number of features
mu = np.zeros((k,d))
sigma = np.zeros((k,d,d))
pi = np.zeros(k)
c = 3500 # regularizer
for label in range(k):
indices = (y == label)
pi[label] = ... # empirical count
mu[label] = ... # empirical mean
sigma[label] = ... # empirical regularized covariance matrix
return mu, sigma, pi



Now let”s visualize the means of the Gaussians for the digits.

Time taken to fit the generative model (in seconds) : 2.60100007057

### Make predictions on test data

Now let’s see how many errors the generative model makes on the test set.
The model makes 438 errors out of 10000 (test accuracy: 95.562%)
Time taken to classify the test data (in seconds): 19.5959999561

The following figure shows the confusion matrix.

## SVM for handwritten digit classification

The entire training dataset from the MNIST dataset  is used to train the SVM model, the training dataset contains 60k images and the test dataset contains 10k images.

First let’s try linear SVM, the following python code:


from sklearn.svm import LinearSVC
clf = LinearSVC(C=C, loss='hinge')
clf.fit(train_data,train_labels)
score = clf.score(test_data,test_labels)



The following figure shows the training and test accuracies of LinearSVC with different values of the hyper-parameter C.

Next let’s try SVM with quadratic kernel, as can be seen it gives 98.06% accuracy on the test dataset with C=1.


from sklearn.svm import SVC
clf = SVC(C=1., kernel='poly', degree=2)
clf.fit(train_data,train_labels)
print clf.score(train_data,train_labels)
print clf.score(test_data,test_labels)



training accuracy: 1.0
test accuracy: 0.9806 (error: 1.94%)

The following figure shows the confusion matrix:

Advertisements

# Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

• First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
• Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
• Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
• The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

## 1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:

or as the following if the explicit bias term is discarded:

where the 0-1 loss is approximated by the hinge-loss.

• Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
• The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
• Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
• The learning rate η is gradually decreased with iteration.

The following figure shows a simplified version of the algorithm:

The following python code implements the algorithm:


class SVMPegasos(LinearClassifier):
"""Implementation of SVM with SGD with PEGASOS Algorithm"""

def __init__(self, n_iter=10, lambda1=1):
self.n_iter = n_iter
self.lambda1 = lambda1

def fit(self, X, Y):
Y = list(Y)
self.find_classes(Y)
# convert all output values to +1 or -1
Yn = [sign(y, self.positive_class) for y in Y]
X = X.toarray()
m, n_features = X.shape[0], X.shape[1]
self.w = numpy.zeros( n_features )
for i in range(self.n_iter):
eta = 1. / (self.lambda1*(i+1))
j = numpy.random.choice(m, 1)[0]
x, y = X[j], Yn[j]
score = self.w.dot(x)
if y*score < 1:
self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
else:
self.w = (1 - eta*self.lambda1)*self.w



### Some Notes

• The optional projection step has been left out (the line in square brackets in the paper).
•  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
•  The number η is the step length in gradient descent.
• The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
• Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
• Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

### Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
• Classification accuracy obtained on the test split was 95%.




## 2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.

### Dataset 1

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
• Classification accuracy obtained on the test split was 95.8%.




### Dataset 2

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
• Classification accuracy obtained on the test split was 100%.




## 3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.





### Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
• Classification accuracy obtained on the test split was 97.5%.




## 4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.

The next figures show the distribution of the word lengths for the texts for different sentiments.

## Steps

• Create sklearn Pipeline to
• Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
• Use sklearn SelectKBest to choose top 5000 features
• Train / test the model
• Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
• Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
• Also Record the time taken to train the classifier.
• Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.

The following python code shows how the text-classification pipeline is implemented with SVMPegasos.


def classify_with_svm():
# read all the documents
X, Y = read_corpus('all_sentiment_shuffled.txt')
# split into training and test parts
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
random_state=0)
classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
tokenizer = lambda x: x)),
('fs', SelectKBest(k=5000)),
('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
t0 = time.time()
classifier.fit(Xtrain, Ytrain)
t1 = time.time()
print('Training time:', t1-t0, 'seconds.')
Yguess = classifier.predict(Xtest)
print('accuracy:', accuracy_score(Ytest, Yguess))



### Some Notes

• The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
• This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
• The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
• For Perceptron the entire training data was passed 10 times to the model while training.
• For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.

## Results

As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.

# Implementing a Soft-Margin Kernelized Support Vector Machine Binary Classifier with Quadratic Programming in R and Python

In this article, couple of implementations of the support vector machine binary classifier with quadratic programming libraries (in R and python respectively) and application on a few datasets are going to be discussed.  The following video lectures / tutorials / links have been very useful for the implementation:

The next figure taken from here describes the basics of Soft-Margin SVM (without kernels).

## SVM in a nutshell

• Given a (training) dataset consisting of positive and negative class instances.
• Objective is to find a maximum-margin classifier, in terms of a hyper-plane (the vectors w and b) that separates the positive and negative instances (in the training dataset).
• If the dataset is noisy (with some overlap in positive and negative samples), there will be some error in classifying them with the hyper-plane.
• In the latter case the objective will be to minimize the errors in classification along with maximizing the margin and the problem becomes a soft-margin SVM (as opposed to the hard margin SVM without slack variables).
• A slack variable per training point is introduced to include the classification errors (for the miss-classified points in the training dataset) in the objective, this can also be thought of adding regularization.
• The optimization problem is quadratic in nature, since it has quadratic objective with linear constraints.
• It is easier to solve the optimization problem in the dual rather than the primal space, since there are less number of variables.
• Hence the optimization problem is often solved in the dual space by converting the minimization to a maximization problem (keeping in mind the weak/strong duality theorem and the complementary slackness conditions), by first constructing the Lagrangian and then using the KKT conditions for a saddle point.
• If the dataset is not linearly separable, the kernel trick is used to conceptually map the datapoints to some higher-dimensions only by computing the (kernel) gram matrix /  dot-product of the datapoints (the matrix needs to positive semi-definite as per Mercer’s theorem).
• Some popular kernel functions are the linear, polynomial, Gaussian (RBF, corresponding to the infinite dimensional space) kernels.
• The dual optimization problem is solved (with standard quadratic programming packages) and the solution is found in terms of a few support vectors (defining the linear/non-liear decision boundary, SVs correspond to the non-zero values of the dual variable / the primal Lagrange multipler), that’s why the name SVM.
• Once the dual optimization problem is solved ,  the values of the primal variables are computed to construct the hyper-plane / decision surface.
• Finally the dual and primal variables (optimum values obtained from the solutions) are used in conjunction to predict the class of a new (unseen) datapoint.
• The hyper-parameters (e.g., C) can be tuned to fit different models and choose the most accurate one from a held-out (validation) dataset.

The following figure describes the soft-margin SVM in a more formal way.

The following figures show how the SVM dual quadratic programming problem can be formulated using the R quadprog QP solver (following the QP formulation in the R package quadprog).

The following figures show how the SVM dual quadratic programming problem can be formulated using the Python CVXOPT QP solver (following the QP formulation in the python library CVXOPT).

The following R code snippet shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

library(quadprog)
library(Matrix)
linear.kernel <- function(x1, x2) {
return (x1%*%x2)
}
svm.fit <- function(X, y, FUN=linear.kernel, C=NULL) {
n.samples <- nrow(X)
n.features <- ncol(X)
# Gram matrix
K <- matrix(rep(0, n.samples*n.samples), nrow=n.samples)
for (i in 1:n.samples){
for (j in 1:n.samples){
K[i,j] <- FUN(X[i,], X[j,])
}
}
Dmat <- outer(y,y) * K
Dmat <- as.matrix(nearPD(Dmat)$mat) # convert Dmat to nearest pd matrix dvec <- rep(1, n.samples) if (!is.null(C)) { # soft-margin Amat <- rbind(y, diag(n.samples), -1*diag(n.samples)) bvec <- c(0, rep(0, n.samples), rep(-C, n.samples)) } else { # hard-margin Amat <- rbind(y, diag(n.samples)) bvec <- c(0, rep(0, n.samples)) } res <- solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1) a = res$solution # Lagrange multipliers
# Support vectors have non-zero Lagrange multipliers
# ...
}


The following python code snippet adapted from here and from  Mathieu Blondel’s Blog, shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

import numpy as np
import cvxopt
def fit(X, y, kernel, C):
n_samples, n_features = X.shape
# Compute the Gram matrix
K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
for j in range(n_samples):
K[i,j] = kernel(X[i], X[j])
# construct P, q, A, b, G, h matrices for CVXOPT
P = cvxopt.matrix(np.outer(y,y) * K)
q = cvxopt.matrix(np.ones(n_samples) * -1)
A = cvxopt.matrix(y, (1,n_samples))
b = cvxopt.matrix(0.0)
if C is None:      # hard-margin SVM
G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
h = cvxopt.matrix(np.zeros(n_samples))
else:              # soft-margin SVM
G = cvxopt.matrix(np.vstack((np.diag(np.ones(n_samples) * -1), np.identity(n_samples))))
h = cvxopt.matrix(np.hstack((np.zeros(n_samples), np.ones(n_samples) * C)))
# solve QP problem
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
# Lagrange multipliers
a = np.ravel(solution['x'])
# Support vectors have non zero lagrange multipliers
sv = a > 1e-5 # some small threshold
# ...


## Notes

1. Since the objective function for QP is convex if and only if the matrix P (in python CVXOPT) or Dmat (in R quadprog) is positive-semidefinite, it needs to be ensured that the corresponding matrix for SVM is psd too.
2. The corresponding matrix is computed from the Kernel gram matrix (which is psd or non-negative-definite by Mercer’s theorem) and the labels from the data. Due to numerical errors, often a few eigenvalues of the matrix tend to be very small negative values.
3. Although python CVXOPT will allow very small numerical errors in P matrix with a warning message, R quardprog will strictly require that the Dmat matrix is strictly positive definite, otherwise it will fail.
4. Hence, with R quadprog the D matrix first needs to be converted to a positive definite matrix using some algorithm (particularly in case when it contains very small negative eigenvalues, which is quite common, since D comes from the data).
5. A small threshold (e.g., 1e-5) is chosen to find the support vectors (corresponding to non-zero Lagrange multipliers, by complementary slackness condition).
6. Cases corresponding to the hard and soft-margin SVMs must be handled separately, otherwise it will lead to inconsistent system of solutions.

## Using the SVM implementations for classification on some datasets

### The datasets

For each dataset, the 80-20 Validation on the dataset is used to

• First fit (train) the model on randomly selected 80% samples of the dataset.
• Predict (test) on the held-out (remaining 20%) of the dataset and compute accuracy.
• Different values of the hyper-parameter C and different kernels are used.
• For the polynomial kernel, polynomial of degree 3 is used and the RBF kernel with the standard deviation of 5 is used, although these hyper-parameters can be tuned too.

### Results

As can be seen from the results below,

• The points with blue circles are the support vectors.
• When the C value is low (close to hard-margin SVM), the model learnt tends to overfit the training data.
• When the C value is high (close to soft-margin SVM), the model learnt tends to be more generalizable (C acts as a regularizer).
• There are more support vectors required to define the decision surface for the hard-margin SVM than the soft-margin SVM for datasets not linearly separable.
• The linear (and sometimes polynomial) kernel performs pretty badly on the datasets that are not linearly separable.
• The decision boundaries are also shown.

# Some Reinforcement Learning: The Greedy and Explore-Exploit Algorithms for the Multi-Armed Bandit Framework in Python

In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.

## The Problem Statement and Some Theory

Given a set of  actions with some unknown reward distributions,  maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately.  This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.

The following figure defines the problem mathematically  and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.

The following figure shows a motivating application of the multi-armed bandit problem in drug discovery.  Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.

The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.

• The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
• The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
• The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas
the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.

The following code from the github repository of the same course shows how the basic bandit Framework can be defined:


import numpy as np
import sys

### Interface
class Environment(object):

def reset(self):
raise NotImplementedError('Inheriting classes must override reset.')

def actions(self):
raise NotImplementedError('Inheriting classes must override actions.')

def step(self):
raise NotImplementedError('Inheriting classes must override step')

class ActionSpace(object):

def __init__(self, actions):
self.actions = actions
self.n = len(actions)

### BanditEnv Environment

class BanditEnv(Environment):

def __init__(self, num_actions = 10, distribution = "bernoulli", evaluation_seed="387"):
super(BanditEnv, self).__init__()

self.action_space = ActionSpace(range(num_actions))
self.distribution = distribution

np.random.seed(evaluation_seed)

self.reward_parameters = None
if distribution == "bernoulli":
self.reward_parameters = np.random.rand(num_actions)
elif distribution == "normal":
self.reward_parameters = (np.random.randn(num_actions),
np.random.rand(num_actions))
elif distribution == "heavy-tail":
self.reward_parameters = np.random.rand(num_actions)
else:
print("Please use a supported reward distribution") #, flush = True)
sys.exit(0)

if distribution != "normal":
self.optimal_arm = np.argmax(self.reward_parameters)
else:
self.optimal_arm = np.argmax(self.reward_parameters[0])

def reset(self):
self.is_reset = True
return None

def compute_gap(self, action):
if self.distribution != "normal":
gap = np.absolute(self.reward_parameters[self.optimal_arm] -
self.reward_parameters[action])
else:
gap = np.absolute(self.reward_parameters[0][self.optimal_arm] -
self.reward_parameters[0][action])
return gap

def step(self, action):
self.is_reset = False

valid_action = True
if (action is None or action = self.action_space.n):
print("Algorithm chose an invalid action; reset reward to -inf")#, flush = True)
reward = float("-inf")
gap = float("inf")
valid_action = False

if self.distribution == "bernoulli":
if valid_action:
reward = np.random.binomial(1, self.reward_parameters[action])
gap = self.reward_parameters[self.optimal_arm] -
self.reward_parameters[action]
elif self.distribution == "normal":
if valid_action:
reward = self.reward_parameters[0][action] + self.reward_parameters[1][action] * np.random.randn()
gap = self.reward_parameters[0][self.optimal_arm] - self.reward_parameters[0][action]
elif self.distribution == "heavy-tail":
if valid_action:
reward = self.reward_parameters[action] + np.random.standard_cauchy()
gap = self.reward_parameters[self.optimal_arm] - self.reward_parameters[action] #HACK to compute expected gap
else:
print("Please use a supported reward distribution")#, flush = True)
sys.exit(0)

return(None, reward, self.is_reset, '')

#Policy interface
class Policy:
#num_actions: (int) Number of arms [indexed by 0 ... num_actions-1]
def __init__(self, num_actions):
self.num_actions = num_actions

def act(self):
pass

def feedback(self, action, reward):
pass



Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).

In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.

Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.

Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.

• The next figure shows the reward distributions for a 5-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..4.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
2. Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
3. Both Greedy and Round-Robin has linear regrets (w.r.t. timesteps).
4. In this case the greedy algorithm even with sub-optimal action still performs relatively better than the round-robin.

## Round-Robin

• The next figure shows the reward distributions for a 10-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter ε) and the Optimistic-Greedy (with different values of the hyper-parameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Epsilon-Greedy algorithm behaves exactly like Greedy when ε = 0 and behaves randomly when ε = 1. For both of these cases, the ε-greedy algorithm has linear regret.
2.  ε = 0.1 and ε = 0.15 find the optimal arm 6 eventually and they have sub-linear regrets.
3. Optimistic-Greedy algorithm behaves exactly like Greedy when R = 0 and behaves randomly when R = 10000. For both of these cases, the ε-greedy algorithm has linear regret.
4. R = 3 and R = 5 find the optimal arm 6 eventually and they have sub-linear regrets(w.r.t. timesteps).

ε = 0

ε = 0.05

ε = 0.1

ε = 0.15

ε = 1.0

## Optimistic Greedy

R = 0

R = 1

R = 3

R = 5

R = 10000

The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.

• The next figure shows the reward distributions for a 10-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Again, let’s assume that we don’t know p_i values, we  implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
2. Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
3. For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the non-informative flat prior.
4. Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
5. The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.

# Learning Distributed Word  Representations with Neural Network: an implementation in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by  Prof.  Geoffrey Hinton from the University of Toronto in 2012.  This problem also appeared as an assignment in this course from the same university.  The problem description is taken from the assignment pdf.

## Problem Statement

In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:

• The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.
• The 4-grams are chosen so that all the words involved come from a small
vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.
• Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

load data.mat
data.vocab
ans =
{
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
.
.
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left
}

• The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.
• Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .

• The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.
• Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.


load data.mat
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

% 3-gram features for a training data-tuple
train_x(:,13,14)
%ans =
%46
%58
%32
data.vocab{train_x(:,13,14)}
%ans = now
%ans = where
%ans = do

% target for the same data tuple from training dataset
train_t(:,13,14)
%ans = 91
data.vocab{train_t(:,13,14)}
%ans = we


• The validation and test data are also similar. They contain 46,568 4-grams each.
• Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.
• The data needs to get loaded and then separated into inputs and target. After that,  mini-batches of size 100 for the training set are created.
• First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.
• At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.
• The training method will output a ‘model’ (weight matrices, biases for each layer in the network).

## Description of the Network

• As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.
• The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.
• This layer is connected to the hidden layer, which in turn is connected to the output layer.
• The output layer is a softmax over the 250 words.
• The training consists of two steps:  (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.

## Forward Propagation

• The forward propagation is pretty straight-forward and can be implemented as shown in the following code:

function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
hid_to_output_weights, hid_bias, output_bias)
% This method forward propagates through a neural network.
% Inputs:
% input_batch: The input data as a matrix of size numwords X batchsize where,
% numwords is the number of words, batchsize is the number of data points.
% So, if input_batch(i, j) = k then the ith word in data point j is word
% index k of the vocabulary.
%
% word_embedding_weights: Word embedding as a matrix of size
% vocab_size X numhid1, where vocab_size is the size of the vocabulary
% numhid1 is the dimensionality of the embedding space.
%
% embed_to_hid_weights: Weights between the word embedding layer and hidden
% layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
% number of hidden units.
%
% hid_to_output_weights: Weights between the hidden layer and output softmax
% unit as a matrix of size numhid2 X vocab_size
%
% hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
%
% output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
%
% Outputs:
% embedding_layer_state: State of units in the embedding layer as a matrix of
% size numhid1*numwords X batchsize
%
% hidden_layer_state: State of units in the hidden layer as a matrix of size
% numhid2 X batchsize
%
% output_layer_state: State of units in the output layer as a matrix of size
% vocab_size X batchsize
%

[numwords, batchsize] = size(input_batch);
[vocab_size, numhid1] = size(word_embedding_weights);
numhid2 = size(embed_to_hid_weights, 2);

%% COMPUTE STATE OF WORD EMBEDDING LAYER.
% Look up the inputs word indices in the word_embedding_weights matrix.
embedding_layer_state = reshape(...
word_embedding_weights(reshape(input_batch, 1, []),:)',...
numhid1 * numwords, []);

%% COMPUTE STATE OF HIDDEN LAYER.
% Compute inputs to hidden units.
inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
repmat(hid_bias, 1, batchsize);

% Apply logistic activation function.
hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize);

%% COMPUTE STATE OF OUTPUT LAYER.
% Compute inputs to softmax.
inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize);

% Subtract maximum.
% Remember that adding or subtracting the same constant from each input to a
% softmax unit does not affect the outputs. Here we are subtracting maximum to
% make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their
% exponents.
inputs_to_softmax = inputs_to_softmax...
- repmat(max(inputs_to_softmax), vocab_size, 1);

% Compute exp.
output_layer_state = exp(inputs_to_softmax);

% Normalize to get probability distribution.
output_layer_state = output_layer_state ./ repmat(...
sum(output_layer_state, 1), vocab_size, 1);



## Back-Propagation

•  The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.

• As the model trains it prints out some numbers that tell how well the training is going.
• The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.
• After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.
• The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.
• At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.

# Some Applications

## 1. Predict next word

• Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.
• The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.
• It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.

## 2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random  text:

1. Given 3 words to start from, initialize the text with those 3 words.
2. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words.
3. Choose one of the most probable words predicted randomly and insert it at the end of the text.
4. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:


function gen_rand_text(words, model, k=3)

probs = [];
i = 4;
while (i < 20 || word != '.')
[word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k);                   words = {words{:}, word};
probs = [probs; prob];
i = i + 1;
end
fprintf(1, "%s ", words{:}) ;
fprintf(1, '\n');
fprintf(1, "%.2f ", round(probs.*100)./100) ;
fprintf(1, '\n');

end



Starting with the words  'i was going‘, here are some texts that were generated using the model:

Starting with the words  ‘life in new‘, here is a piece of text that was generated using the model:

## 3. Find nearest words

•  The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.
• As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.
• For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.

## 4. Visualization in 2-dimension with t-SNE

•  In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.
• As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.
• Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).

## 5. Solving Word-Analogy Problem

•  with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation.  Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.
• The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.

• The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.

analogy('year', 'years', 'day', model); % singular-plural relation
%year:years::day:days
%dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186

analogy('on', 'off', 'new', model) % antonyms relation
%on:off::new:old
%dist_E('on','off')=2.013958, dist_E('new','old')=2.265665

analogy('use', 'used', 'do', model) % present-past relation
%use:used::do:did
%dist_E('use','used')=2.556175, dist_E('do','did')=2.456098

analogy('he', 'his', 'they', model) % pronoun-relations
%he:his::they:their
%dist_E('he','his')=3.824808, dist_E('they','their')=3.825453

analogy('today', 'yesterday', 'now', model)
%today:yesterday::now:then
%dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935


## Model Selection

• Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.
• The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures,  the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).
• The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.

# Autonomous Driving – Car detection with YOLO Model with Keras in Python

In this article, object detection using the very powerful YOLO model will be described, particularly in the context of car detection for autonomous driving. This problem appeared as an assignment in the coursera course Convolution Networks which is a part of the Deep Learning Specialization (taught by Prof. Andrew Ng.,  from Stanford and deeplearning.ai, the lecture videos corresponding to the YOLO algorithm can be found here).  The problem description is taken straightaway from the assignment.

Given a set of images (a car detection dataset), the goal is to detect objects (cars) in those images using a pre-trained YOLO (You Only Look Once) model, with bounding boxes. Many of the ideas are from the two original YOLO papers: Redmon et al., 2016  and Redmon and Farhadi, 2016 .

## Some Theory

Let’s first clear the concepts regarding classification, localization, detection and how the object detection problem can be transformed to supervised machine learning problem and subsequently can be solved using a deep convolution neural network. As can be seen from the next figure,

• Image classification with localization aims to find the location of an object in an image by not only classifying the image (e.g., a binary classification problem: whether there is a car in an image or not), but also finding a bounding box around the object, if one found.
• Detection goes a level further by aiming to identify multiple instances of same/ different types of objects, by marking their locations (the localization problem usually tries to find a single object location).
• The localization problem can be converted to a supervised machine learning multi-class classification problem in the following way: in addition to the class label of the object to be identified, the output vector corresponding to an input training image must also contain the location (bounding box coordinates relative to image size) of the object.
• A typical output data vector will contain 8 entries for a 4-class classification, as shown in the next figure, the first entry will correspond to whether or not an object of any from the 3 classes of objects. In case one is present in an image, the next 4 entries will define the bounding box containing the object, followed by 3 binary values for the 3 class labels indicating the class of the object. In case none of the objects are present, the first entry will be 0 and the others will be ignored.

• Now moving from localization to detection, one can proceed in two steps as shown below in the next figure: first use small tightly cropped images to train a convolution neural net for image classification and then use sliding windows of different window sizes (smaller to larger) to classify a test image within that window using the convnet learnt and run the windows sequentially through the entire image, but it’s infeasibly slow computationally.
• However, as shown in the next figure, the convolutional implementation of the sliding windows by replacing the fully-connected layers by 1×1 filters makes it possible to simultaneously classify the image-subset inside all possible sliding windows parallelly, making it much more efficient computationally.

• The convolutional sliding windows, although computationally much more efficient, still has the problem of detecting the accurate bounding boxes, since the boxes don’t align with the sliding windows and the object shapes also tend to be different.
• YOLO algorithm overcomes this limitation by dividing a training image into grids and assigning an object to a grid if and only if the center of the object falls inside the grid, that way each object in a training image can get assigned to exactly one grid and then the corresponding bounding box is represented by the coordinates relative to the grid. The next figure described the details of the algorithm.
• In the test images, multiple adjacent grids may think that an object actually belongs to them, in order to resolve the iou (intersection of union) measure is used to find the maximum overlap and the non-maximum-suppression algorithm is used to discard all the other bounding boxes with low-confidence of containing an object, keeping the one with the highest confidence among the competing ones and discard the others.
• Still there is a problem of multiple objects falling in the same grid. Multiple anchor boxes (of different shapes) are used to resolve the problem, each anchor box of a particular shape being likely to eventually detect  an object of a particular shape.

The following figure shows the slides taken from the presentation You Only Look Once: Unified, Real-Time Object Detection in the CVPR 2016 summarizing the algorithm:

## Problem Statement

Let’s assume that we are working on a self-driving car. As a critical component of this project, we’d like to first build a car detection system. To collect data, we’ve mounted a camera to the hood (meaning the front) of the car, which takes pictures of the road ahead every few seconds while we drive around.

The above pictures are taken from a car-mounted camera while driving around Silicon Valley.  We would like to especially thank drive.ai for providing this dataset! Drive.ai is a company building the brains of self-driving vehicles.

We’ve gathered all these images into a folder and have labelled them by drawing bounding boxes around every car we found. Here’s an example of what our bounding boxes look like.

### Definition of a box

If we have 80 classes that we want YOLO to recognize, we can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. Here we will use both representations, depending on which is more convenient for a particular step.

In this exercise, we shall learn how YOLO works, then apply it to car detection. Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for our use.  The instructions for how to do it can be obtained from here and here.

## YOLO

YOLO (“you only look once“) is a popular algorithm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

### Model details

First things to know:

• The input is a batch of images of shape (m, 608, 608, 3).
• The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If we expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So we can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Let’s look in greater detail at what this encoding represents.

### Encoding architecture for YOLO

If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

### Flattening the last two last dimensions

Now, for each box (of each cell) we will compute the following element-wise product and extract a probability that the box contains a certain class.

### Find the class detected by each box

Here’s one way to visualize what YOLO is predicting on an image:

• For each of the 19×19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
• Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:

Each of the 19×19 grid cells colored according to which class has the largest predicted probability in that cell.

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:

Each cell gives us 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking once at the image (one forward pass through the network)! Different colors denote different classes.

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, we’ll use non-max suppression. Specifically, we’ll carry out these steps:

• Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class).
• Select only one box when several boxes overlap with each other and detect the same object.

## Filtering with a threshold on class scores

We are going to apply a first filter by thresholding. We would like to get rid of any box for which the class “score” is less than a chosen threshold.

The model gives us a total of 19x19x5x85 numbers, with each box described by 85 numbers. It’ll be convenient to rearrange the (19,19,5,85) (or (19,19,425)) dimensional tensor into the following variables:

• box_confidence: tensor of shape (19×19,5,1) containing pc (confidence probability that there’s some object) for each of the 5 boxes predicted in each of the 19×19 cells.
• boxes: tensor of shape (19×19,5,4) containing (bx,by,bh,bw) for each of the 5 boxes per cell.
• box_class_probs: tensor of shape (19×19,5,80) containing the detection probabilities (c1,c2,…c80) for each of the 80 classes for each of the 5 boxes per cell.

Exercise: Implement yolo_filter_boxes().

• Compute box scores by doing the element-wise product as described in the above figure.
• For each box, find:
• the index of the class with the maximum box score.
• the corresponding box score.
• Create a mask by using a threshold.  The mask should be True for the boxes you want to keep.
• Use TensorFlow to apply the mask to box_class_scores, boxes and box_classes to filter out the boxes we don’t want.
We should be left with just the subset of boxes we want to keep.

Let’s first load the packages and dependencies that are going to be useful.

import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.io
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D
from keras.models import load_model, Model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body




def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
"""Filters YOLO boxes by thresholding on object and class confidence.

Arguments:
box_confidence -- tensor of shape (19, 19, 5, 1)
boxes -- tensor of shape (19, 19, 5, 4)
box_class_probs -- tensor of shape (19, 19, 5, 80)
threshold -- real value, if [ highest class probability score = threshold)

# Step 4: Apply the mask to scores, boxes and classes

return scores, boxes, classes



### Non-max suppression

Even after filtering by thresholding over the classes scores, we still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).

n this example, the model has predicted 3 cars, but it’s actually 3 predictions of the same car. Running non-max suppression (NMS) will select only the most accurate (highest probability) one of the 3 boxes.

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

### Definition of “Intersection over Union”

Exercise: Implement iou(). Some hints:

• In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
• To calculate the area of a rectangle we need to multiply its height (y2 – y1) by its width (x2 – x1)
• We’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
xi1 = maximum of the x1 coordinates of the two boxes
yi1 = maximum of the y1 coordinates of the two boxes
xi2 = minimum of the x2 coordinates of the two boxes
yi2 = minimum of the y2 coordinates of the two boxes

In this code, we use the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) the lower-right corner.


def iou(box1, box2):
"""Implement the intersection over union (IoU) between box1 and box2

Arguments:
box1 -- first box, list object with coordinates (x1, y1, x2, y2)
box2 -- second box, list object with coordinates (x1, y1, x2, y2)
"""

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.

# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)

# compute the IoU

return iou



We are now ready to implement non-max suppression. The key steps are:

• Select the box that has the highest score.
• Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
• Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

Exercise: Implement yolo_non_max_suppression() using TensorFlow. TensorFlow has two built-in functions that are used to implement non-max suppression (so we don’t actually need to use your iou() implementation):

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
"""
Applies Non-max suppression (NMS) to set of boxes

Arguments:
scores -- tensor of shape (None,), output of yolo_filter_boxes()
boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
classes -- tensor of shape (None,), output of yolo_filter_boxes()
max_boxes -- integer, maximum number of predicted boxes you'd like
iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

Returns:
scores -- tensor of shape (, None), predicted score for each box
boxes -- tensor of shape (4, None), predicted box coordinates
classes -- tensor of shape (, None), predicted class for each box

Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
function will transpose the shapes of scores, boxes, classes. This is made for convenience.
"""

max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

# Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep

# Use K.gather() to select only nms_indices from scores, boxes and classes

return scores, boxes, classes



### Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions we’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail we have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which are provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh)

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape)

YOLO’s network was trained to run on 608×608 images. If we are testing this data on a different size image – for example, the car detection dataset had 720×1280 images – his step rescales the boxes so that they can be plotted on top of the original 720×1280 image.


def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
"""
Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.

Arguments:
yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
box_confidence: tensor of shape (None, 19, 19, 5, 1)
box_xy: tensor of shape (None, 19, 19, 5, 2)
box_wh: tensor of shape (None, 19, 19, 5, 2)
box_class_probs: tensor of shape (None, 19, 19, 5, 80)
image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
max_boxes -- integer, maximum number of predicted boxes you'd like
score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

Returns:
scores -- tensor of shape (None, ), predicted score for each box
boxes -- tensor of shape (None, 4), predicted box coordinates
classes -- tensor of shape (None,), predicted class for each box
"""

# Retrieve outputs of the YOLO model

# Convert boxes to be ready for filtering functions

# Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold

# Scale boxes back to original image shape.

# Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold

return scores, boxes, classes



### Summary for YOLO:

• Input image (608, 608, 3)
• The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
• After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
• Each cell in a 19×19 grid over the input image gives 425 numbers.
• 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
• 85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we’d like to detect.
• We then select only few boxes based on:
• Score-thresholding: throw away boxes that have detected a class with a score less than the threshold.
• Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes.
• This gives us YOLO’s final output.

### Test YOLO pretrained model on images

In this part, we are going to use a pre-trained model and test it on the car detection dataset. As usual, we start by creating a session to start your graph. Run the following cell.

sess = K.get_session()

### Defining classes, anchors and image shape.

Recall that we are trying to detect 80 classes, and are using 5 anchor boxes. We have gathered the information about the 80 classes and 5 boxes in two files “coco_classes.txt” and “yolo_anchors.txt”. Let’s load these quantities into the model by running the next cell.

The car detection dataset has 720×1280 images, which we’ve pre-processed into 608×608 images.

class_names = read_classes(“coco_classes.txt”)
anchors = read_anchors(“yolo_anchors.txt”)
image_shape = (720., 1280.)

### Loading a pretrained model

Training a YOLO model takes a very long time and requires a fairly large dataset of labelled bounding boxes for a large range of target classes. We are going to load an existing pretrained Keras YOLO model stored in “yolo.h5”. (These weights come from the official YOLO website, and were converted using a function written by Allan Zelener.  Technically, these are the parameters from the “YOLOv2” model, but we will more simply refer to it as “YOLO” in this notebook.)

yolo_model = load_model(“yolo.h5”)

This loads the weights of a trained YOLO model. Here’s a summary of the layers our model contains.

yolo_model.summary()

____________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
===========================================================================
input_1 (InputLayer) (None, 608, 608, 3) 0
____________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
____________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 608, 608, 32) 128 conv2d_1[0][0]
____________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
____________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
____________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
____________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 304, 304, 64) 256 conv2d_2[0][0]
____________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
____________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
____________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 152, 152, 128 73728 max_pooling2d_2[0][0]
____________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 152, 152, 128 512 conv2d_3[0][0]
____________________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_3[0][0]
____________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
____________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 152, 152, 64) 256 conv2d_4[0][0]
____________________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
____________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 152, 152, 128 73728 leaky_re_lu_4[0][0]
____________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 152, 152, 128 512 conv2d_5[0][0]
____________________________________________________________________________________________
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_5[0][0]
____________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
____________________________________________________________________________________________
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
____________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 76, 76, 256) 1024 conv2d_6[0][0]
____________________________________________________________________________________________
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
____________________________________________________________________________________________
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
____________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 76, 76, 128) 512 conv2d_7[0][0]
____________________________________________________________________________________________
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
____________________________________________________________________________________________
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
____________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 76, 76, 256) 1024 conv2d_8[0][0]
____________________________________________________________________________________________
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
____________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
____________________________________________________________________________________________
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
____________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 38, 38, 512) 2048 conv2d_9[0][0]
____________________________________________________________________________________________
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
____________________________________________________________________________________________
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
____________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 38, 38, 256) 1024 conv2d_10[0][0]
____________________________________________________________________________________________
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
____________________________________________________________________________________________
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
____________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 38, 38, 512) 2048 conv2d_11[0][0]
____________________________________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
____________________________________________________________________________________________
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
____________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 38, 38, 256) 1024 conv2d_12[0][0]
____________________________________________________________________________________________
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
____________________________________________________________________________________________
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
____________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 38, 38, 512) 2048 conv2d_13[0][0]
____________________________________________________________________________________________
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
____________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
____________________________________________________________________________________________
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
____________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 19, 19, 1024) 4096 conv2d_14[0][0]
____________________________________________________________________________________________
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
____________________________________________________________________________________________
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
____________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 19, 19, 512) 2048 conv2d_15[0][0]
____________________________________________________________________________________________
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
____________________________________________________________________________________________
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
____________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 19, 19, 1024) 4096 conv2d_16[0][0]
____________________________________________________________________________________________
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
____________________________________________________________________________________________
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
____________________________________________________________________________________________
batch_normalization_17 (BatchNo (None, 19, 19, 512) 2048 conv2d_17[0][0]
____________________________________________________________________________________________
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
____________________________________________________________________________________________
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
____________________________________________________________________________________________
batch_normalization_18 (BatchNo (None, 19, 19, 1024) 4096 conv2d_18[0][0]
____________________________________________________________________________________________
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
____________________________________________________________________________________________
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
____________________________________________________________________________________________
batch_normalization_19 (BatchNo (None, 19, 19, 1024) 4096 conv2d_19[0][0]
____________________________________________________________________________________________
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
____________________________________________________________________________________________
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
____________________________________________________________________________________________
batch_normalization_21 (BatchNo (None, 38, 38, 64) 256 conv2d_21[0][0]
____________________________________________________________________________________________
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
____________________________________________________________________________________________
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
____________________________________________________________________________________________
batch_normalization_20 (BatchNo (None, 19, 19, 1024) 4096 conv2d_20[0][0]
____________________________________________________________________________________________
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
____________________________________________________________________________________________
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
____________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
leaky_re_lu_20[0][0]
____________________________________________________________________________________________
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
____________________________________________________________________________________________
batch_normalization_22 (BatchNo (None, 19, 19, 1024) 4096 conv2d_22[0][0]
____________________________________________________________________________________________
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
____________________________________________________________________________________________
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
===========================================================================
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672
____________________________________________________________________________________________

Reminder: this model converts a pre-processed batch of input images (shape: (m, 608, 608, 3)) into a tensor of shape (m, 19, 19, 5, 85) as explained in the above Figure.

### Convert output of the model to usable bounding box tensors

The output of yolo_model is a (m, 19, 19, 5, 85) tensor that needs to pass through non-trivial processing and conversion. The following code does this.

yolo_outputs = yolo_head(yolo_model.output, anchors, len(class_names))

We added yolo_outputs to your graph. This set of 4 tensors is ready to be used as input by our yolo_eval function.

### Filtering boxes

yolo_outputs gave us all the predicted boxes of yolo_model in the correct format. We’re now ready to perform filtering and select only the best boxes. Lets now call yolo_eval, which you had previously implemented, to do this.

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

### Run the graph on an image

Let the fun begin. We have created a (sess) graph that can be summarized as follows:

1. yolo_model.input is given to yolo_model. The model is used to compute the output yolo_model.output
2. yolo_model.output is processed by yolo_head. It gives us yolo_outputs
3. yolo_outputs goes through a filtering function, yolo_eval. It outputs your predictions: scores, boxes, classes

Exercise: Implement predict() which runs the graph to test YOLO on an image. We shall need to run a TensorFlow session, to have it compute scores, boxes, classes.

The code below also uses the following function:

image, image_data = preprocess_image(“images/” + image_file, model_image_size = (608, 608))

which outputs:

• image: a python (PIL) representation of your image used for drawing boxes. You won’t need to use it.
• image_data: a numpy-array representing the image. This will be the input to the CNN.

Important note: when a model uses BatchNorm (as is the case in YOLO), we will need to pass an additional placeholder in the feed_dict {K.learning_phase(): 0}.


def predict(sess, image_file):
"""
Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.

Arguments:
sess -- your tensorflow/Keras session containing the YOLO graph
image_file -- name of an image stored in the "images" folder.

Returns:
out_scores -- tensor of shape (None, ), scores of the predicted boxes
out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
out_classes -- tensor of shape (None, ), class index of the predicted boxes

Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes.
"""

# Preprocess your image

# Run the session with the correct tensors and choose the correct placeholders in the
# feed_dict. We'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})

# Print predictions info
print('Found {} boxes for {}'.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image
image.save(os.path.join("out", image_file), quality=90)
# Display the results
output_image = scipy.misc.imread(os.path.join("out", image_file))
imshow(output_image)

return out_scores, out_boxes, out_classes



Let’s Run the following cell on the following “test.jpg” image to verify that our function is correct.

Input

out_scores, out_boxes, out_classes = predict(sess, “test.jpg”)

The following figure shows the output after car detection. Each of the bounding boxes have the name of the object detected on the top left along with the confidence value.

Output (with detected cars with YOLO)

Found 7 boxes for test.jpg
car 0.60 (925, 285) (1045, 374)
car 0.66 (706, 279) (786, 350)
bus 0.67 (5, 266) (220, 407)
car 0.70 (947, 324) (1280, 705)
car 0.74 (159, 303) (346, 440)
car 0.80 (761, 282) (942, 412)
car 0.89 (367, 300) (745, 648)

The following animation shows the output Images with detected objects (cars) using YOLO for a set of input images.

What we should remember:

• YOLO is a state-of-the-art object detection model that is fast and accurate.
• It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
• The encoding can be seen as a grid where each of the 19×19 cells contains information about 5 boxes.
• You filter through all the boxes using non-max suppression. Specifically:
Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes.
• Intersection over Union (IoU) thresholding to eliminate overlapping boxes.
• Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise.

References: The ideas presented in this notebook came primarily from the two YOLO papers. The implementation here also took significant inspiration and used many components from Allan Zelener’s github repository. The pretrained weights used in this exercise came from the official YOLO website.

1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi – You Only Look Once: Unified, Real-Time Object Detection (2015)
2. Joseph Redmon, Ali Farhadi – YOLO9000: Better, Faster, Stronger (2016)
3. Allan Zelener – YAD2K: Yet Another Darknet 2 Keras
4. The official YOLO website .

Car detection dataset: Creative Commons License.

The Drive.ai Sample Dataset (provided by drive.ai) is licensed under a Creative Commons Attribution 4.0 International License.

# Implementing Lucas-Kanade Optical Flow algorithm in Python

In this article an implementation of the Lucas-Kanade optical flow algorithm is going to be described. This problem appeared as an assignment in this computer vision course from UCSD. The inputs will be sequences of images (subsequent frames from a video) and the algorithm will output an optical flow field (u, v) and trace the motion of the moving objects. The problem description is taken from the assignment itself.

## Problem Statement

#### Single-Scale Optical Flow

• Let’s implement the single-scale Lucas-Kanade optical flow algorithm. This involves finding the motion (u, v) that minimizes the sum-squared error of the brightness constancy equations for each pixel in a window.  The algorithm will be implemented as a function with the following inputs:

def optical_flow(I1, I2, window_size, tau) # returns (u, v)

• Here, u and v are the x and y components of the optical flow, I1 and I2 are two images taken at times t = 1 and t = 2 respectively, and window_size is a 1 × 2 vector storing the width and height of the window used during flow computation.
• In addition to these inputs, a theshold τ should be added, such that if τ is larger than the smallest eigenvalue of A’A, then the the optical flow at that position should not be computed. Recall that the optical flow is only valid in regions where

has rank 2, which is what the threshold is checking. A typical value for τ is 0.01.

• We should try experimenting with different window sizes and find out the tradeoffs associated with using a small vs. a large window size.
• The following figure describes the algorithm, which considers a nxn (n>=3) window around each pixel and solves a least-square problem to find the best flow vectors for the pixel.

• The following code-snippet shows how the algorithm is implemented in python for a gray-level image.
import numpy as np
from scipy import signal
def optical_flow(I1g, I2g, window_size, tau=1e-2):

kernel_x = np.array([[-1., 1.], [-1., 1.]])
kernel_y = np.array([[-1., -1.], [1., 1.]])
kernel_t = np.array([[1., 1.], [1., 1.]])#*.25
w = window_size/2 # window_size is odd, all the pixels with offset in between [-w, w] are inside the window
I1g = I1g / 255. # normalize pixels
I2g = I2g / 255. # normalize pixels
# Implement Lucas Kanade
# for each point, calculate I_x, I_y, I_t
mode = 'same'
fx = signal.convolve2d(I1g, kernel_x, boundary='symm', mode=mode)
fy = signal.convolve2d(I1g, kernel_y, boundary='symm', mode=mode)
ft = signal.convolve2d(I2g, kernel_t, boundary='symm', mode=mode) +
signal.convolve2d(I1g, -kernel_t, boundary='symm', mode=mode)
u = np.zeros(I1g.shape)
v = np.zeros(I1g.shape)
# within window window_size * window_size
for i in range(w, I1g.shape[0]-w):
for j in range(w, I1g.shape[1]-w):
Ix = fx[i-w:i+w+1, j-w:j+w+1].flatten()
Iy = fy[i-w:i+w+1, j-w:j+w+1].flatten()
It = ft[i-w:i+w+1, j-w:j+w+1].flatten()
#b = ... # get b here
#A = ... # get A here
# if threshold τ is larger than the smallest eigenvalue of A'A:
nu = ... # get velocity here
u[i,j]=nu[0]
v[i,j]=nu[1]

return (u,v)



## Some Results

• The following figures and animations show the results of the algorithm on a few image sequences. Some of these input image sequences / videos are from the course and some are collected from the internet.
• As can be seen, the algorithm performs best if the motion of the moving object(s) in between consecutive frames is slow. To the contrary, if the motion is large, the algorithm fails and we should implement / use multiple-scale version Lucas-Kanade with image pyramids.
• Finally,  with small window size,  the algorithm captures subtle motions but not large motions. With large size it happens the other way.

Input Sequences

Output Optical Flow with different window sizes

window size = 15

window size = 21

Input Sequences

Output Optical Flow

Input Sequences (hamburg taxi)

Output Optical Flow

Input Sequences

Output Optical Flow

Input Sequences

Output Optical Flow

Input Sequences

Output Optical Flow

Input Sequences

Output Optical Flow

Input Sequences

Output Optical Flow

Input Sequences
Output Optical Flow

Input Sequences

Output Optical Flow

Output Optical Flow

Input Sequences

Output Optical Flow with window size 45

Output Optical Flow with window size 10

Output Optical Flow with window size 25

Output Optical Flow with window size 45

# Efficient Graph-Based Image Segmentation in Python

In this article, an implementation of an efficient graph-based image segmentation technique will be described, this algorithm was proposed by Felzenszwalb et. al. from MIT in this paper.  The slides on this paper can be found from this link from the Stanford Vision Lab too. The algorithm is closely related to Kruskal’s algorithm for constructing a minimum spanning tree of a graph, as stated by the author and hence can  be
implemented to run in O(m log m) time, where m is the number of edges in the graph.

## Problem Definition and the basic idea (from the paper)

• Let G = (V, E) be an undirected graph with vertices vi ∈ V, the set of elements to be segmented, and edges (vi, vj ) ∈ E corresponding to pairs of neighboring vertices. Each edge (vi, vj ) ∈ E has a corresponding weight w((vi, vj)), which is a non-negative measure of the dissimilarity between neighboring elements vi and vj.

• In the case of image segmentation, the elements in V are pixels and the weight of an edge is some measure of the dissimilarity between the two pixels connected by that edge (e.g., the difference in intensity, color, motion, location or some other local attribute).

• Particularly for the implementation described here, an edge weight function based on the absolute intensity difference (in the yiq space) between the pixels connected by an edge, w((vi, vj )) = |I(pi) − I(pj )|.

• In the graph-based approach, a segmentation S is a partition of V into components
such that each component (or region) C ∈ S corresponds to a connected component
in a graph G0 = (V, E0), where E0 ⊆ E.

• In other words, any segmentation is induced by a subset of the edges in E. There are different ways to measure the quality of a segmentation but in general we want the elements in a component to be similar, and elements in different components to be dissimilar.

• This means that edges between two vertices in the same component should have relatively low weights, and edges between vertices in different components should have higher weights.

• The next figure shows the steps in the algorithm. The algorithm is very similar to Kruskal’s algorithm for computing the MST for an undirected graph.

• The threshold function τ controls the degree to which the difference between two
components must be greater than their internal differences in order for there to be
evidence of a boundary between them.

• For small components, Int(C) is not a good estimate of the local characteristics of the data. In the extreme case, when |C| = 1, Int(C) = 0. Therefore, a threshold function based on the size of the component, τ (C) = k/|C| is needed to be usedwhere |C| denotes the size of C, and k is some constant parameter.

• That is, for small components we require stronger evidence for a boundary. In practice k sets a scale of observation, in that a larger k causes a preference for larger components.

• In general, a Gaussian filter is used to smooth the image slightly before computing the edge weights, in order to compensate for digitization artifacts. We always use a Gaussian with σ = 0.8, which does not produce any visible change to the image but helps remove artifacts.

• The following python code shows how to create the graph.

import numpy as np
from scipy import signal
import matplotlib.image as mpimg

def gaussian_kernel(k, s = 0.5):
# generate a (2k+1)x(2k+1) gaussian kernel with mean=0 and sigma = s
probs = [exp(-z*z/(2*s*s))/sqrt(2*pi*s*s) for z in range(-k,k+1)]
return np.outer(probs, probs)

def create_graph(imfile, k=1., sigma=0.8, sz=1):
# create the pixel graph with edge weights as dissimilarities
rgb = mpimg.imread(imfile)[:,:,:3]
gauss_kernel = gaussian_kernel(sz, sigma)
for i in range(3):
rgb[:,:,i] = signal.convolve2d(rgb[:,:,i], gauss_kernel, boundary='symm', mode='same')
yuv = rgb2yiq(rgb)
(w, h) = yuv.shape[:2]
edges = {}
for i in range(yuv.shape[0]):
for j in range(yuv.shape[1]):
#compute edge weight for nbd pixel nodes for the node i,j
for i1 in range(i-1, i+2):
for j1 in range(j-1, j+2):
if i1 == i and j1 == j: continue

if i1 >= 0 and i1 = 0 and j1 < h:
wt = np.abs(yuv[i,j,0]-yuv[i1,j1,0])
n1, n2 = ij2id(i,j,w,h), ij2id(i1,j1,w,h)
edges[n1, n2] = edges[n2, n1] = wt
return edges


## Some Results

• The images are taken from the paper itself or from the internet. The following figures and animations show the result of segmentation as a result of iterative merging of the components (by choosing least weight edges), depending on the internal difference of the components.

• Although in the paper the author described the best value of the parameter k to be around 300, but  since in this implementation the pixel RGB values are normalized (to have values in between 0 – 1) and then converted to YIQ values and the YIQ intensities are used for computing the weights (which are typically very small), the value of k that works best in this scenario is 0.001-0.01.

• As we can see from the below results, higher the value of the parameter k, larger the size of the final component and lesser the number of components in the result.

• The minimum spanning tree creation is also shown, the red edges shown in the figures are the edges chosen by the algorithm to merge the components.

## Input Image

Output Images for two different values of the parameter k

## Input Image

Output Images for two different values of the parameter k

## Input Image

Output Segmented Images

## Input Image

Output Images for two different values of the parameter k

## Input Image

Segmented Output Image

# Interactive Image Segmentation with Graph-Cut in Python

In this article, interactive image segmentation with graph-cut is going to be discussed. and it will be used to segment the source object from the background in an image. This segmentation technique was proposed by Boycov and Jolli in this paper. This problem appeared as a homework assignment here., and also in this lecture video from the Coursera image processing course by Duke university.

## Problem Statement: Interactive graph-cut segmentation

Let’s implement “intelligent paint” interactive segmentation tool using graph cuts algorithm on a weighted image grid. Our task will be to separate the foreground object from the background in an image.

Since it can be difficult sometimes to automatically define what’s foreground and what’s background for an image, the user is going to help us with a few interactive scribble lines using which our algorithm is going to identify the foreground and the background, after that it will be the algorithms job to obtain a complete segmentation of the foreground from the background image.

The following figures show how an input image gets scribbling from a user with two different colors (which is also going to be input to our algorithm) and the ideal segmented image output.

Scribbled Input Image                                       Expected Segmented Output Image

## The Graph-Cut Algorithm

The following describes how the segmentation problem is transformed into a graph-cut problem:

1. Let’s first define the Directed Graph G = (V, E) as follows:

1. Each of the pixels in the image is going to be a vertex in the graph. There will be another couple of special terminal vertices: a source vertex (corresponds to the foreground object) and a sink vertex (corresponds to the background object in the image). Hence, |V(G)| = width x height + 2.

2. Next, let’s defines the edges of the graph. As obvious, there is going to be two types of edges: terminal (edges that connect the terminal nodes to the non-terminal nodes) and non-terminal (edges that connect the non-terminal nodes only).

3. There will be a directed edge from the terminal node source to each of non-terminal nodes in the graph. Similarly,  a directed edge will be there from each non-terminal node (pixel) to the other terminal node sink. These are going to be all the terminal edges and hence, |E_T(G)| =  2 x width x height.

4. Each of the non-terminal nodes (pixels) are going to be connected by edges with the nodes corresponding to the neighboring pixels (defined by 4 or 8 neighborhood of a pixel).  Hence, |E_N(G)| =  |Nbd| x width x height.

2. Now let’s describe how to compute the edge weights in this graph.

1. In order to compute the terminal edge weights, we need to estimate the feature distributions first, i.e., starting with the assumption that each of the nodes corresponding to the scribbled pixels have the probability 1.0 (since we want the solution to respect the regional hard constraints marked by the user-seeds / scribbles) to be in foreground or background object in the image (distinguished by the scribble color, e.g.), we have to compute the probability that a node belongs to the foreground (or background) for all the other non-terminal nodes.

2. The simplest way to compute $P_F$ and $P_B$ is to first fit a couple of  Gaussian distributions on the scribbles by computing the parameters (μ, ∑)
with MLE from the scribbled pixel intensities and then computing the (class-conditional) probabilities from the individual pdfs (followed by a normalization) for each of the pixels as shown in the next figures. The following code fragment show how the pdfs are being computed.

import numpy as np
from collections import defaultdict

def compute_pdfs(imfile, imfile_scrib):
'''
# Compute foreground and background pdfs
# input image and the image with user scribbles
'''
rgb = mpimg.imread(imfile)[:,:,:3]
yuv = rgb2yiq(rgb)
rgb_s = mpimg.imread(imfile_scrib)[:,:,:3]
yuv_s = rgb2yiq(rgb_s)
# find the scribble pixels
scribbles = find_marked_locations(rgb, rgb_s)
imageo = np.zeros(yuv.shape)
# separately store background and foreground scribble pixels in the dictionary comps
comps = defaultdict(lambda:np.array([]).reshape(0,3))
for (i, j) in scribbles:
imageo[i,j,:] = rgbs[i,j,:]
# scribble color as key of comps
comps[tuple(imageo[i,j,:])] = np.vstack([comps[tuple(imageo[i,j,:])], yuv[i,j,:]])
mu, Sigma = {}, {}
# compute MLE parameters for Gaussians
for c in comps:
mu[c] = np.mean(comps[c], axis=0)
Sigma[c] = np.cov(comps[c].T)
return (mu, Sigma)


3. In order to compute the non-terminal edge weights, we need to compute the similarities in between a pixel node and the nodes corresponding to its neighborhood pixels, e.g., with the formula shown in the next figures (e.g., how similar neighborhood pixels are in RGB / YIQ space).

3. Now that the underlying graph is defined, the segmentation of the foreground from the background image boils down to computing the min-cut in the graph or equivalently computing the max-flow (the dual problem) from the source to sink.

4. The intuition is that the min-cut solution will keep the pixels with high probabilities to belong to the side of the source (foreground) node and likewise the background pixels on the other side of the cut near the sink (background) node, since it’s going to respect the (relatively) high-weight edges (by not going through the highly-similar pixels).

5. There are several standard algorithms, e.g., Ford-Fulkerson (by finding an augmenting path with O(E max| f |) time complexity) or Edmonds-Karp (by using bfs to find the shortest augmenting path, with O(VE2) time complexity) to solve the max-flow problem, typical implementations of these algorithms run pretty fast, in polynomial time in V, E.  Here we are going to use a different implementation (with pymaxflow) based on Vladimir Kolmogorov, which is shown to run faster on many images empirically in this paper.

## Results

The following figures / animations show the interactive-segmentation results (computed probability densities, subset of the flow-graph & min-cut, final segmented image) on a few images,  some of them taken from the above-mentioned courses / videos, some of them taken from Berkeley Vision dataset.

Input Image

Input Image with Scribbles

Fitted Densities from Color Scribbles

A Tiny Sub-graph with Min-Cut

Input Image

Input Image with Scribbles

Fitted Densities from Color Scribbles

A Tiny Sub-graph with Min-Cut

Input Image (liver)

Input Image with Scribbles

Fitted Densities from Color Scribbles

Input Image

Input Image with Scribbles

Fitted Densities from Color Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribbles

Input Image

Input Image with Scribbles

Fitted Densities from Color Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image                                                         Input Image with Scribbles

Input Image

Input Image with Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribbles

Fitted Densities from Color Scribbles

Input Image

Input Image with Scribbles

Fitted Densities from Color Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image (UMBC Campus Map)

Input Image with Scribbles

Input Image

Input Image with Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut (with blue foreground nodes)

## Changing the background of an image (obtained using graph-cut segmentation) with another image’s background with cut & paste

The following figures / animation show how the background of a given image can be replaced by a new image using cut & paste (by replacing the corresponding pixels in the new image corresponding to foreground), once the foreground in the original image gets identified after segmentation.

Original Input Image

New Background

# Image Colorization Using Optimization in Python

This article is inspired by this SIGGRAPH paper by Levin et. al, for which they took this patent , the paper was referred to in the course CS1114 from Cornell.  This method is also discussed in the coursera online image processing course by NorthWestern University. Some part of the problem description is taken from the paper itself. Also, one can refer to the implementation provided by the authors in matlab, the following link and the following python implementation in github.

## The Problem

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.

## The Algorithm

YUV/YIQ color space is chosen to work in, this is commonly used in video, where Y is the monochromatic luminance channel, which we will refer to simply as intensity, while U and V are the chrominance channels, encoding the color.

The algorithm is given as input an intensity volume Y(x,y,t) and outputs two color volumes U(x,y,t) and V(x,y,t). To simplify notation the boldface letters are used (e.g. r,s) to denote $\left(x,y,t \right)$ triplets. Thus, Y(r) is the intensity of a particular pixel.

Now, One needs to impose the constraint that two neighboring pixels r,s should have similar colors if their intensities are similar. Thus, the problem is formulated as the following optimization problem that aims to minimize the difference between the  colorU(r) at pixel r and the weighted average of the colors at neighboring pixels, where w(r,s) is a weighting function that sums to one, large when Y(r) is similar to Y(s), and small when the two intensities are different.

When the intensity is constant the color should be constant, and when the intensity is an edge the color should also be an edge. Since the cost functions are quadratic and
the constraints are linear, this optimization problem yields a large, sparse system of linear equations, which may be solved using a number of standard methods.

As discussed in the paper, this algorithm is closely related to algorithms proposed for other tasks in image processing. In image segmentation algorithms based on normalized cuts [Shi and Malik 1997], one attempts to find the second smallest eigenvector of the matrix D − W where W is a npixels×npixels matrix whose elements are the pairwise affinities between pixels (i.e., the r,s entry of the matrix is w(r,s)) and D is a diagonal matrix whose diagonal elements are the sum of the affinities (in this case it is always 1). The second smallest eigenvector of any symmetric matrix A is a unit norm vector x that minimizes $x^{T}Ax$ and is orthogonal to the first eigenvector. By direct  inspection, the quadratic form minimized by normalized cuts is exactly the cost function J, that is $x^{T}(D-W)x =J(x)$. Thus, this algorithm minimizes the same cost function but under different constraints. In image denoising algorithms based on anisotropic diffusion [Perona and Malik 1989; Tang et al. 2001] one often minimizes a  function
similar to equation 1, but the function is applied to the image intensity as well.

The following figures show an original gray-scale image and the marked image with color-scribbles that are going to be used to compute the output colored image.

Original Gray-scale Image Input

Gray-scale image Input Marked with Color-Scribbles

## Implementation of the Algorithm

Here are the the steps for the algorithm:

1. Convert both the original gray-scale image and the marked image (marked with color scribbles for a few pixels by the artist) to from RGB to YUV / YIQ color space.
2. Compute the difference image from the marked and the gray-scale image. The pixels that differ  are going to be pinned and they will appear in the output, they are directly going to be used as stand-alone constraints in the minimization problem.
3. We need to compute the weight matrix W that depends on the similarities in the neighbor intensities for a pixel from the original gray-scale image.
4. The optimization problem finally boils down to solving the system of linear equations of the form $WU = b$ that has a closed form least-square solution
$U = W^{+}b = {(W^{T}W)}^{-1}W^{T}b$. Same thing is to be repeated for the V channel too.
5. However the W matrix is going to be very huge and sparse, hence sparse-matrix based implementations must be used to obtain an efficient solution. However, in this python implementation in github, the scipy sparse lil_matrix was used when constructing the sparse matrices, which is quite slow, we can construct more efficient scipy csc matrix rightaway, by using a dictionary to store the weights initially. It is much faster. The python code in the next figure shows my implementation for computing the weight matrix W.
6. Once W is computed it’s just a matter of obtaining the least-square solution, by computing the pseudo-inverse, which can be more efficiently computed with LU factorization and a sparse LU solver, as in this  python implementation in github.
7. Once the solution of the optimization problem is obtained in YUV / YIQ space, it needs to be converted back to RGB. The following formula is used for conversion.

import scipy.sparse as sp
from collections import defaultdict

def compute_W(Y, colored):

(w, h) = Y.shape
W = defaultdict()
for i in range(w):
for j in range(h):
if not (i, j) in colored: # pixel i,j in color scribble
(N, sigma) = get_nbrs(Y, i, j, w, h) #r = (i, j)
Z = 0.
id1 = ij2id(i,j,w,h) # linearized id
for (i1, j1) in N: #s = (i1, j1)
id2 = ij2id(i1,j1,w,h)
W[id1,id2] = np.exp(-(Y[i,j]-Y[i1,j1])**2/(sigma**2)) if sigma > 0 else 0.
Z += W[id1,id2]
if Z > 0:
for (i1, j1) in N: #s = (i1, j1)
id2 = ij2id(i1,j1,w,h)
W[id1,id2] /= -Z
for i in range(w):
for j in range(h):
id = ij2id(i,j,w,h)
W[id,id] = 1.

rows, cols = zip(*(W.keys())) #keys
data = W.values() #[W[k] for k in keys]
return sp.csc_matrix((data, (rows, cols)), shape=(w*h, w*h)) #W



## Results

The following images and animations show the results obtained with the optimization algorithm. Most of the following images are taken from the paper itself.

Original Gray-scale Image Input                 Gray-scale image Input Marked

Color image Output

The next animations show how the incremental scribbling results in better and better color images.

Original Gray-scale Image Input

As can be seen from the following animation, the different parts of the building get colored as more and more color-hints are scribbled / annotated.

Gray-scale image Input Marked

Color image Output

Original Gray-scale Image Input

Gray-scale image Input Marked

Color image Output

Original Gray-scale Image Input (me)

Gray-scale image Input Marked incrementally

Color image Output

Original Gray-scale Image Input

Gray-scale image Input Marked

Color image Output

Original Gray-scale Image Input

Gray-scale image Input Marked

Color image Output