Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

  • First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
  • Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
  • Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
  • The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:


or as the following if the explicit bias term is discarded:


where the 0-1 loss is approximated by the hinge-loss.


  • Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
  • The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
  • Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
  • The learning rate η is gradually decreased with iteration.


The following figure shows a simplified version of the algorithm:




The following python code implements the algorithm:

class SVMPegasos(LinearClassifier):
    """Implementation of SVM with SGD with PEGASOS Algorithm"""

    def __init__(self, n_iter=10, lambda1=1):
       self.n_iter = n_iter
       self.lambda1 = lambda1

    def fit(self, X, Y):
       Y = list(Y)
       # convert all output values to +1 or -1
       Yn = [sign(y, self.positive_class) for y in Y]
       X = X.toarray()
       m, n_features = X.shape[0], X.shape[1]
       self.w = numpy.zeros( n_features )
       for i in range(self.n_iter):
           eta = 1. / (self.lambda1*(i+1))
           j = numpy.random.choice(m, 1)[0]
           x, y = X[j], Yn[j]
           score =
           if y*score < 1:
              self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
              self.w = (1 - eta*self.lambda1)*self.w

Some Notes

  • The optional projection step has been left out (the line in square brackets in the paper).
  •  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
  •  The number η is the step length in gradient descent.
  • The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
  • Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
  • Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
  • Classification accuracy obtained on the test split was 95%.


2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.


Using the PEGASOS implementation to classify the following linearly non-separable datasets

Dataset 1 flame

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
  • Classification accuracy obtained on the test split was 95.8%.


Dataset 2

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
  • Classification accuracy obtained on the test split was 100%.


3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.





Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
  • Classification accuracy obtained on the test split was 97.5%.


4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.


The next figures show the distribution of the word lengths for the texts for different sentiments.



  • Create sklearn Pipeline to
    • Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
    • Use sklearn SelectKBest to choose top 5000 features
    • Train / test the model
  • Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
  • Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
  • Also Record the time taken to train the classifier.
  • Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.


The following python code shows how the text-classification pipeline is implemented with SVMPegasos.

def classify_with_svm():
    # read all the documents
    X, Y = read_corpus('all_sentiment_shuffled.txt')
    # split into training and test parts
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
    classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
    tokenizer = lambda x: x)),
    ('fs', SelectKBest(k=5000)),
    ('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
    t0 = time.time(), Ytrain)
    t1 = time.time()
    print('Training time:', t1-t0, 'seconds.')
    Yguess = classifier.predict(Xtest)
    print('accuracy:', accuracy_score(Ytest, Yguess))


Some Notes

  • The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
  • This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
  • The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
  • For Perceptron the entire training data was passed 10 times to the model while training.
  • For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.


As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.





Implementing a Soft-Margin Kernelized Support Vector Machine Binary Classifier with Quadratic Programming in R and Python

In this article, couple of implementations of the support vector machine binary classifier with quadratic programming libraries (in R and python respectively) and application on a few datasets are going to be discussed.  The following video lectures / tutorials / links have been very useful for the implementation:

The next figure taken from here describes the basics of Soft-Margin SVM (without kernels).

SVM in a nutshell

  • Given a (training) dataset consisting of positive and negative class instances.
  • Objective is to find a maximum-margin classifier, in terms of a hyper-plane (the vectors w and b) that separates the positive and negative instances (in the training dataset).
  • If the dataset is noisy (with some overlap in positive and negative samples), there will be some error in classifying them with the hyper-plane.
  • In the latter case the objective will be to minimize the errors in classification along with maximizing the margin and the problem becomes a soft-margin SVM (as opposed to the hard margin SVM without slack variables).
  • A slack variable per training point is introduced to include the classification errors (for the miss-classified points in the training dataset) in the objective, this can also be thought of adding regularization.
  • The optimization problem is quadratic in nature, since it has quadratic objective with linear constraints.
  • It is easier to solve the optimization problem in the dual rather than the primal space, since there are less number of variables.
  • Hence the optimization problem is often solved in the dual space by converting the minimization to a maximization problem (keeping in mind the weak/strong duality theorem and the complementary slackness conditions), by first constructing the Lagrangian and then using the KKT conditions for a saddle point.
  • If the dataset is not linearly separable, the kernel trick is used to conceptually map the datapoints to some higher-dimensions only by computing the (kernel) gram matrix /  dot-product of the datapoints (the matrix needs to positive semi-definite as per Mercer’s theorem).
  • Some popular kernel functions are the linear, polynomial, Gaussian (RBF, corresponding to the infinite dimensional space) kernels.
  • The dual optimization problem is solved (with standard quadratic programming packages) and the solution is found in terms of a few support vectors (defining the linear/non-liear decision boundary, SVs correspond to the non-zero values of the dual variable / the primal Lagrange multipler), that’s why the name SVM.
  • Once the dual optimization problem is solved ,  the values of the primal variables are computed to construct the hyper-plane / decision surface.
  • Finally the dual and primal variables (optimum values obtained from the solutions) are used in conjunction to predict the class of a new (unseen) datapoint.
  • The hyper-parameters (e.g., C) can be tuned to fit different models and choose the most accurate one from a held-out (validation) dataset.


The following figure describes the soft-margin SVM in a more formal way.


The following figures show how the SVM dual quadratic programming problem can be formulated using the R quadprog QP solver (following the QP formulation in the R package quadprog).


The following figures show how the SVM dual quadratic programming problem can be formulated using the Python CVXOPT QP solver (following the QP formulation in the python library CVXOPT).


The following R code snippet shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

linear.kernel <- function(x1, x2) {
 return (x1%*%x2)
} <- function(X, y, FUN=linear.kernel, C=NULL) {
 n.samples <- nrow(X)
 n.features <- ncol(X)
 # Gram matrix
 K <- matrix(rep(0, n.samples*n.samples), nrow=n.samples)
 for (i in 1:n.samples){
  for (j in 1:n.samples){
   K[i,j] <- FUN(X[i,], X[j,])
 Dmat <- outer(y,y) * K
 Dmat <- as.matrix(nearPD(Dmat)$mat) # convert Dmat to nearest pd matrix
 dvec <- rep(1, n.samples)
 if (!is.null(C)) { # soft-margin
  Amat <- rbind(y, diag(n.samples), -1*diag(n.samples))
  bvec <- c(0, rep(0, n.samples), rep(-C, n.samples))
 } else {           # hard-margin
  Amat <- rbind(y, diag(n.samples))
  bvec <- c(0, rep(0, n.samples))
 res <- solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1)
 a = res$solution # Lagrange multipliers
 # Support vectors have non-zero Lagrange multipliers
 # ...




The following python code snippet adapted from here and from  Mathieu Blondel’s Blog, shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

import numpy as np
import cvxopt
def fit(X, y, kernel, C):
    n_samples, n_features = X.shape
    # Compute the Gram matrix
    K = np.zeros((n_samples, n_samples))
    for i in range(n_samples):
       for j in range(n_samples):
           K[i,j] = kernel(X[i], X[j])
    # construct P, q, A, b, G, h matrices for CVXOPT
    P = cvxopt.matrix(np.outer(y,y) * K)
    q = cvxopt.matrix(np.ones(n_samples) * -1)
    A = cvxopt.matrix(y, (1,n_samples))
    b = cvxopt.matrix(0.0)
    if C is None:      # hard-margin SVM
       G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
       h = cvxopt.matrix(np.zeros(n_samples))
    else:              # soft-margin SVM
       G = cvxopt.matrix(np.vstack((np.diag(np.ones(n_samples) * -1), np.identity(n_samples))))
       h = cvxopt.matrix(np.hstack((np.zeros(n_samples), np.ones(n_samples) * C)))
    # solve QP problem
    solution = cvxopt.solvers.qp(P, q, G, h, A, b)
    # Lagrange multipliers
    a = np.ravel(solution['x'])
    # Support vectors have non zero lagrange multipliers
    sv = a > 1e-5 # some small threshold
    # ...



  1. Since the objective function for QP is convex if and only if the matrix P (in python CVXOPT) or Dmat (in R quadprog) is positive-semidefinite, it needs to be ensured that the corresponding matrix for SVM is psd too.
  2. The corresponding matrix is computed from the Kernel gram matrix (which is psd or non-negative-definite by Mercer’s theorem) and the labels from the data. Due to numerical errors, often a few eigenvalues of the matrix tend to be very small negative values.
  3. Although python CVXOPT will allow very small numerical errors in P matrix with a warning message, R quardprog will strictly require that the Dmat matrix is strictly positive definite, otherwise it will fail.
  4. Hence, with R quadprog the D matrix first needs to be converted to a positive definite matrix using some algorithm (particularly in case when it contains very small negative eigenvalues, which is quite common, since D comes from the data).
  5. A small threshold (e.g., 1e-5) is chosen to find the support vectors (corresponding to non-zero Lagrange multipliers, by complementary slackness condition).
  6. Cases corresponding to the hard and soft-margin SVMs must be handled separately, otherwise it will lead to inconsistent system of solutions.


Using the SVM implementations for classification on some datasets

The datasets



For each dataset, the 80-20 Validation on the dataset is used to

  • First fit (train) the model on randomly selected 80% samples of the dataset.
  • Predict (test) on the held-out (remaining 20%) of the dataset and compute accuracy.
  • Different values of the hyper-parameter C and different kernels are used.
  • For the polynomial kernel, polynomial of degree 3 is used and the RBF kernel with the standard deviation of 5 is used, although these hyper-parameters can be tuned too.



As can be seen from the results below,

  • The points with blue circles are the support vectors.
  • When the C value is low (close to hard-margin SVM), the model learnt tends to overfit the training data.
  • When the C value is high (close to soft-margin SVM), the model learnt tends to be more generalizable (C acts as a regularizer).
  • There are more support vectors required to define the decision surface for the hard-margin SVM than the soft-margin SVM for datasets not linearly separable.
  • The linear (and sometimes polynomial) kernel performs pretty badly on the datasets that are not linearly separable.
  • The decision boundaries are also shown.


With the Python (CVXOPT) implementation






With the R (quadprog) implementation












Some Reinforcement Learning: The Greedy and Explore-Exploit Algorithms for the Multi-Armed Bandit Framework in Python

In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.

The Problem Statement and Some Theory

Given a set of  actions with some unknown reward distributions,  maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately.  This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.

The following figure defines the problem mathematically  and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.

The following figure shows a motivating application of the multi-armed bandit problem in drug discovery.  Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.

The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.

  • The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
  • The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
  • The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas
    the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.


The following code from the github repository of the same course shows how the basic bandit Framework can be defined:

import numpy as np
import sys

### Interface
class Environment(object):

def reset(self):
raise NotImplementedError('Inheriting classes must override reset.')

def actions(self):
raise NotImplementedError('Inheriting classes must override actions.')

def step(self):
raise NotImplementedError('Inheriting classes must override step')

class ActionSpace(object):

def __init__(self, actions):
self.actions = actions
self.n = len(actions)

### BanditEnv Environment

class BanditEnv(Environment):

def __init__(self, num_actions = 10, distribution = "bernoulli", evaluation_seed="387"):
super(BanditEnv, self).__init__()

self.action_space = ActionSpace(range(num_actions))
self.distribution = distribution


self.reward_parameters = None
if distribution == "bernoulli":
self.reward_parameters = np.random.rand(num_actions)
elif distribution == "normal":
self.reward_parameters = (np.random.randn(num_actions),
elif distribution == "heavy-tail":
self.reward_parameters = np.random.rand(num_actions)
print("Please use a supported reward distribution") #, flush = True)

if distribution != "normal":
self.optimal_arm = np.argmax(self.reward_parameters)
self.optimal_arm = np.argmax(self.reward_parameters[0])

def reset(self):
self.is_reset = True
return None

def compute_gap(self, action):
if self.distribution != "normal":
gap = np.absolute(self.reward_parameters[self.optimal_arm] -
gap = np.absolute(self.reward_parameters[0][self.optimal_arm] -
return gap

def step(self, action):
self.is_reset = False

valid_action = True
if (action is None or action = self.action_space.n):
print("Algorithm chose an invalid action; reset reward to -inf")#, flush = True)
reward = float("-inf")
gap = float("inf")
valid_action = False

if self.distribution == "bernoulli":
if valid_action:
reward = np.random.binomial(1, self.reward_parameters[action])
gap = self.reward_parameters[self.optimal_arm] -
elif self.distribution == "normal":
if valid_action:
reward = self.reward_parameters[0][action] + self.reward_parameters[1][action] * np.random.randn()
gap = self.reward_parameters[0][self.optimal_arm] - self.reward_parameters[0][action]
elif self.distribution == "heavy-tail":
if valid_action:
reward = self.reward_parameters[action] + np.random.standard_cauchy()
gap = self.reward_parameters[self.optimal_arm] - self.reward_parameters[action] #HACK to compute expected gap
print("Please use a supported reward distribution")#, flush = True)

return(None, reward, self.is_reset, '')

#Policy interface
class Policy:
#num_actions: (int) Number of arms [indexed by 0 ... num_actions-1]
def __init__(self, num_actions):
self.num_actions = num_actions

def act(self):

def feedback(self, action, reward):

Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).

In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.

Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.

Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.



  • The next figure shows the reward distributions for a 5-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..4.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.



Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
  2. Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
  3. Both Greedy and Round-Robin has linear regrets (w.r.t. timesteps).
  4. In this case the greedy algorithm even with sub-optimal action still performs relatively better than the round-robin.






  • The next figure shows the reward distributions for a 10-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.



Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter ε) and the Optimistic-Greedy (with different values of the hyper-parameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Epsilon-Greedy algorithm behaves exactly like Greedy when ε = 0 and behaves randomly when ε = 1. For both of these cases, the ε-greedy algorithm has linear regret.
  2.  ε = 0.1 and ε = 0.15 find the optimal arm 6 eventually and they have sub-linear regrets. 
  3. Optimistic-Greedy algorithm behaves exactly like Greedy when R = 0 and behaves randomly when R = 10000. For both of these cases, the ε-greedy algorithm has linear regret.
  4. R = 3 and R = 5 find the optimal arm 6 eventually and they have sub-linear regrets(w.r.t. timesteps).


ε = 0



ε = 0.05



ε = 0.1


ε = 0.15


ε = 1.0



Optimistic Greedy


R = 0


R = 1




R = 3



R = 5



R = 10000



The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.


  • The next figure shows the reward distributions for a 10-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.


Again, let’s assume that we don’t know p_i values, we  implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
  2. Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
  3. For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the non-informative flat prior.
  4. Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
  5. The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.




Thompson Beta


Thompson Normal


Learning Distributed Word  Representations with Neural Network: an implementation in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by  Prof.  Geoffrey Hinton from the University of Toronto in 2012.  This problem also appeared as an assignment in this course from the same university.  The problem description is taken from the assignment pdf.


Problem Statement

In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:


  • The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.
  • The 4-grams are chosen so that all the words involved come from a small
    vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.
  • Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

load data.mat
ans =
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left

  • The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.
  • Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .

  • The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.
  • Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.


load data.mat
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

% 3-gram features for a training data-tuple
%ans =
%ans = now
%ans = where
%ans = do

% target for the same data tuple from training dataset
%ans = 91
%ans = we

  • The validation and test data are also similar. They contain 46,568 4-grams each.
  • Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.
  • The data needs to get loaded and then separated into inputs and target. After that,  mini-batches of size 100 for the training set are created.
  • First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.
  • At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.
  • The training method will output a ‘model’ (weight matrices, biases for each layer in the network).


Description of the Network


  • As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.
  • The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.
  • This layer is connected to the hidden layer, which in turn is connected to the output layer.
  • The output layer is a softmax over the 250 words.
  • The training consists of two steps:  (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.


Forward Propagation

  • The forward propagation is pretty straight-forward and can be implemented as shown in the following code:
    function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
     fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
     hid_to_output_weights, hid_bias, output_bias)
    % This method forward propagates through a neural network.
    % Inputs:
    % input_batch: The input data as a matrix of size numwords X batchsize where,
    % numwords is the number of words, batchsize is the number of data points.
    % So, if input_batch(i, j) = k then the ith word in data point j is word
    % index k of the vocabulary.
    % word_embedding_weights: Word embedding as a matrix of size
    % vocab_size X numhid1, where vocab_size is the size of the vocabulary
    % numhid1 is the dimensionality of the embedding space.
    % embed_to_hid_weights: Weights between the word embedding layer and hidden
    % layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
    % number of hidden units.
    % hid_to_output_weights: Weights between the hidden layer and output softmax
    % unit as a matrix of size numhid2 X vocab_size
    % hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
    % output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
    % Outputs:
    % embedding_layer_state: State of units in the embedding layer as a matrix of
    % size numhid1*numwords X batchsize
    % hidden_layer_state: State of units in the hidden layer as a matrix of size
    % numhid2 X batchsize
    % output_layer_state: State of units in the output layer as a matrix of size
    % vocab_size X batchsize
    [numwords, batchsize] = size(input_batch);
    [vocab_size, numhid1] = size(word_embedding_weights);
    numhid2 = size(embed_to_hid_weights, 2);
    % Look up the inputs word indices in the word_embedding_weights matrix.
    embedding_layer_state = reshape(...
     word_embedding_weights(reshape(input_batch, 1, []),:)',...
     numhid1 * numwords, []);
    % Compute inputs to hidden units.
    inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
     repmat(hid_bias, 1, batchsize);
    % Apply logistic activation function.
    hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize);
    % Compute inputs to softmax.
    inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize);
    % Subtract maximum.
    % Remember that adding or subtracting the same constant from each input to a
    % softmax unit does not affect the outputs. Here we are subtracting maximum to
    % make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their
    % exponents.
    inputs_to_softmax = inputs_to_softmax...
     - repmat(max(inputs_to_softmax), vocab_size, 1);
    % Compute exp.
    output_layer_state = exp(inputs_to_softmax);
    % Normalize to get probability distribution.
    output_layer_state = output_layer_state ./ repmat(...
     sum(output_layer_state, 1), vocab_size, 1);




  •  The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.



  • As the model trains it prints out some numbers that tell how well the training is going.
  • The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.
  • After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.
  • The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.
  • At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.


Some Applications

1. Predict next word

  • Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.
  • The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.
  • It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.




2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random  text:

  1. Given 3 words to start from, initialize the text with those 3 words.
  2. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words.
  3. Choose one of the most probable words predicted randomly and insert it at the end of the text.
  4. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:

function gen_rand_text(words, model, k=3)

probs = [];
i = 4;
while (i < 20 || word != '.')
[word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k);                   words = {words{:}, word};
probs = [probs; prob];
i = i + 1;
fprintf(1, "%s ", words{:}) ;
fprintf(1, '\n');
fprintf(1, "%.2f ", round(probs.*100)./100) ;
fprintf(1, '\n');


Starting with the words  'i was going‘, here are some texts that were generated using the model:


Starting with the words  ‘life in new‘, here is a piece of text that was generated using the model:


3. Find nearest words

  •  The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.
  • As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.
  • For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.




4. Visualization in 2-dimension with t-SNE

  •  In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.
  • As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.
  • Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).



5. Solving Word-Analogy Problem

  •  with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation.  Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.
  • The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.


  • The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.
    analogy('year', 'years', 'day', model); % singular-plural relation
    %dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186
    analogy('on', 'off', 'new', model) % antonyms relation
    %dist_E('on','off')=2.013958, dist_E('new','old')=2.265665
    analogy('use', 'used', 'do', model) % present-past relation
    %dist_E('use','used')=2.556175, dist_E('do','did')=2.456098
    analogy('he', 'his', 'they', model) % pronoun-relations
    %dist_E('he','his')=3.824808, dist_E('they','their')=3.825453
    analogy('today', 'yesterday', 'now', model)
    %dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935


Model Selection

  • Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.
  • The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures,  the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).
  • The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.



Autonomous Driving – Car detection with YOLO Model with Keras in Python

In this article, object detection using the very powerful YOLO model will be described, particularly in the context of car detection for autonomous driving. This problem appeared as an assignment in the coursera course Convolution Networks which is a part of the Deep Learning Specialization (taught by Prof. Andrew Ng.,  from Stanford and, the lecture videos corresponding to the YOLO algorithm can be found here).  The problem description is taken straightaway from the assignment.

Given a set of images (a car detection dataset), the goal is to detect objects (cars) in those images using a pre-trained YOLO (You Only Look Once) model, with bounding boxes. Many of the ideas are from the two original YOLO papers: Redmon et al., 2016  and Redmon and Farhadi, 2016 .

Some Theory

Let’s first clear the concepts regarding classification, localization, detection and how the object detection problem can be transformed to supervised machine learning problem and subsequently can be solved using a deep convolution neural network. As can be seen from the next figure,

  • Image classification with localization aims to find the location of an object in an image by not only classifying the image (e.g., a binary classification problem: whether there is a car in an image or not), but also finding a bounding box around the object, if one found.
  • Detection goes a level further by aiming to identify multiple instances of same/ different types of objects, by marking their locations (the localization problem usually tries to find a single object location).
  • The localization problem can be converted to a supervised machine learning multi-class classification problem in the following way: in addition to the class label of the object to be identified, the output vector corresponding to an input training image must also contain the location (bounding box coordinates relative to image size) of the object.
  • A typical output data vector will contain 8 entries for a 4-class classification, as shown in the next figure, the first entry will correspond to whether or not an object of any from the 3 classes of objects. In case one is present in an image, the next 4 entries will define the bounding box containing the object, followed by 3 binary values for the 3 class labels indicating the class of the object. In case none of the objects are present, the first entry will be 0 and the others will be ignored.



  • Now moving from localization to detection, one can proceed in two steps as shown below in the next figure: first use small tightly cropped images to train a convolution neural net for image classification and then use sliding windows of different window sizes (smaller to larger) to classify a test image within that window using the convnet learnt and run the windows sequentially through the entire image, but it’s infeasibly slow computationally.
  • However, as shown in the next figure, the convolutional implementation of the sliding windows by replacing the fully-connected layers by 1×1 filters makes it possible to simultaneously classify the image-subset inside all possible sliding windows parallelly, making it much more efficient computationally.



  • The convolutional sliding windows, although computationally much more efficient, still has the problem of detecting the accurate bounding boxes, since the boxes don’t align with the sliding windows and the object shapes also tend to be different.
  • YOLO algorithm overcomes this limitation by dividing a training image into grids and assigning an object to a grid if and only if the center of the object falls inside the grid, that way each object in a training image can get assigned to exactly one grid and then the corresponding bounding box is represented by the coordinates relative to the grid. The next figure described the details of the algorithm.
  • In the test images, multiple adjacent grids may think that an object actually belongs to them, in order to resolve the iou (intersection of union) measure is used to find the maximum overlap and the non-maximum-suppression algorithm is used to discard all the other bounding boxes with low-confidence of containing an object, keeping the one with the highest confidence among the competing ones and discard the others.
  • Still there is a problem of multiple objects falling in the same grid. Multiple anchor boxes (of different shapes) are used to resolve the problem, each anchor box of a particular shape being likely to eventually detect  an object of a particular shape.



The following figure shows the slides taken from the presentation You Only Look Once: Unified, Real-Time Object Detection in the CVPR 2016 summarizing the algorithm:


Problem Statement

Let’s assume that we are working on a self-driving car. As a critical component of this project, we’d like to first build a car detection system. To collect data, we’ve mounted a camera to the hood (meaning the front) of the car, which takes pictures of the road ahead every few seconds while we drive around.

The above pictures are taken from a car-mounted camera while driving around Silicon Valley.  We would like to especially thank for providing this dataset! is a company building the brains of self-driving vehicles.


We’ve gathered all these images into a folder and have labelled them by drawing bounding boxes around every car we found. Here’s an example of what our bounding boxes look like.

Definition of a box


If we have 80 classes that we want YOLO to recognize, we can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. Here we will use both representations, depending on which is more convenient for a particular step.

In this exercise, we shall learn how YOLO works, then apply it to car detection. Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for our use.  The instructions for how to do it can be obtained from here and here.



YOLO (“you only look once“) is a popular algorithm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

Model details

First things to know:

  • The input is a batch of images of shape (m, 608, 608, 3).
  • The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If we expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So we can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Let’s look in greater detail at what this encoding represents.

Encoding architecture for YOLO


If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

Flattening the last two last dimensions



Now, for each box (of each cell) we will compute the following element-wise product and extract a probability that the box contains a certain class.

Find the class detected by each box


Here’s one way to visualize what YOLO is predicting on an image:

  • For each of the 19×19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
  • Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:


Each of the 19×19 grid cells colored according to which class has the largest predicted probability in that cell.

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:


Each cell gives us 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking once at the image (one forward pass through the network)! Different colors denote different classes.

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, we’ll use non-max suppression. Specifically, we’ll carry out these steps:

  • Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class).
  • Select only one box when several boxes overlap with each other and detect the same object.


Filtering with a threshold on class scores

We are going to apply a first filter by thresholding. We would like to get rid of any box for which the class “score” is less than a chosen threshold.

The model gives us a total of 19x19x5x85 numbers, with each box described by 85 numbers. It’ll be convenient to rearrange the (19,19,5,85) (or (19,19,425)) dimensional tensor into the following variables:

  • box_confidence: tensor of shape (19×19,5,1) containing pc (confidence probability that there’s some object) for each of the 5 boxes predicted in each of the 19×19 cells.
  • boxes: tensor of shape (19×19,5,4) containing (bx,by,bh,bw) for each of the 5 boxes per cell.
  • box_class_probs: tensor of shape (19×19,5,80) containing the detection probabilities (c1,c2,…c80) for each of the 80 classes for each of the 5 boxes per cell.

Exercise: Implement yolo_filter_boxes().

  • Compute box scores by doing the element-wise product as described in the above figure.
  • For each box, find:
    • the index of the class with the maximum box score.
    • the corresponding box score.
  • Create a mask by using a threshold.  The mask should be True for the boxes you want to keep.
  • Use TensorFlow to apply the mask to box_class_scores, boxes and box_classes to filter out the boxes we don’t want.
    We should be left with just the subset of boxes we want to keep.

Let’s first load the packages and dependencies that are going to be useful.

import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D
from keras.models import load_model, Model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body


def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
 """Filters YOLO boxes by thresholding on object and class confidence.

 box_confidence -- tensor of shape (19, 19, 5, 1)
 boxes -- tensor of shape (19, 19, 5, 4)
 box_class_probs -- tensor of shape (19, 19, 5, 80)
 threshold -- real value, if [ highest class probability score = threshold)

 # Step 4: Apply the mask to scores, boxes and classes

return scores, boxes, classes


Non-max suppression

Even after filtering by thresholding over the classes scores, we still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).


n this example, the model has predicted 3 cars, but it’s actually 3 predictions of the same car. Running non-max suppression (NMS) will select only the most accurate (highest probability) one of the 3 boxes.

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

Definition of “Intersection over Union”



Exercise: Implement iou(). Some hints:

  • In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
  • To calculate the area of a rectangle we need to multiply its height (y2 – y1) by its width (x2 – x1)
  • We’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
    xi1 = maximum of the x1 coordinates of the two boxes
    yi1 = maximum of the y1 coordinates of the two boxes
    xi2 = minimum of the x2 coordinates of the two boxes
    yi2 = minimum of the y2 coordinates of the two boxes

In this code, we use the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) the lower-right corner.

def iou(box1, box2):
 """Implement the intersection over union (IoU) between box1 and box2

 box1 -- first box, list object with coordinates (x1, y1, x2, y2)
 box2 -- second box, list object with coordinates (x1, y1, x2, y2)

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.

# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)

# compute the IoU

return iou


We are now ready to implement non-max suppression. The key steps are:

  • Select the box that has the highest score.
  • Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
  • Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

Exercise: Implement yolo_non_max_suppression() using TensorFlow. TensorFlow has two built-in functions that are used to implement non-max suppression (so we don’t actually need to use your iou() implementation):

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
 Applies Non-max suppression (NMS) to set of boxes

 scores -- tensor of shape (None,), output of yolo_filter_boxes()
 boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
 classes -- tensor of shape (None,), output of yolo_filter_boxes()
 max_boxes -- integer, maximum number of predicted boxes you'd like
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 scores -- tensor of shape (, None), predicted score for each box
 boxes -- tensor of shape (4, None), predicted box coordinates
 classes -- tensor of shape (, None), predicted class for each box

 Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
 function will transpose the shapes of scores, boxes, classes. This is made for convenience.

 max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
 K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

 # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep

 # Use K.gather() to select only nms_indices from scores, boxes and classes

 return scores, boxes, classes


Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions we’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail we have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which are provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh)

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape)

YOLO’s network was trained to run on 608×608 images. If we are testing this data on a different size image – for example, the car detection dataset had 720×1280 images – his step rescales the boxes so that they can be plotted on top of the original 720×1280 image.

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
 Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.

 yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
 box_confidence: tensor of shape (None, 19, 19, 5, 1)
 box_xy: tensor of shape (None, 19, 19, 5, 2)
 box_wh: tensor of shape (None, 19, 19, 5, 2)
 box_class_probs: tensor of shape (None, 19, 19, 5, 80)
 image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
 max_boxes -- integer, maximum number of predicted boxes you'd like
 score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 scores -- tensor of shape (None, ), predicted score for each box
 boxes -- tensor of shape (None, 4), predicted box coordinates
 classes -- tensor of shape (None,), predicted class for each box

 # Retrieve outputs of the YOLO model

 # Convert boxes to be ready for filtering functions 

 # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold

 # Scale boxes back to original image shape.

 # Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold 

 return scores, boxes, classes


Summary for YOLO:

  • Input image (608, 608, 3)
  • The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
  • After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
    • Each cell in a 19×19 grid over the input image gives 425 numbers.
    • 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
    • 85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we’d like to detect.
  • We then select only few boxes based on:
    • Score-thresholding: throw away boxes that have detected a class with a score less than the threshold.
    • Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes.
  • This gives us YOLO’s final output.


Test YOLO pretrained model on images

In this part, we are going to use a pre-trained model and test it on the car detection dataset. As usual, we start by creating a session to start your graph. Run the following cell.

sess = K.get_session()

Defining classes, anchors and image shape.

Recall that we are trying to detect 80 classes, and are using 5 anchor boxes. We have gathered the information about the 80 classes and 5 boxes in two files “coco_classes.txt” and “yolo_anchors.txt”. Let’s load these quantities into the model by running the next cell.

The car detection dataset has 720×1280 images, which we’ve pre-processed into 608×608 images.

class_names = read_classes(“coco_classes.txt”)
anchors = read_anchors(“yolo_anchors.txt”)
image_shape = (720., 1280.)


Loading a pretrained model

Training a YOLO model takes a very long time and requires a fairly large dataset of labelled bounding boxes for a large range of target classes. We are going to load an existing pretrained Keras YOLO model stored in “yolo.h5”. (These weights come from the official YOLO website, and were converted using a function written by Allan Zelener.  Technically, these are the parameters from the “YOLOv2” model, but we will more simply refer to it as “YOLO” in this notebook.)

yolo_model = load_model(“yolo.h5”)

This loads the weights of a trained YOLO model. Here’s a summary of the layers our model contains.


Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) (None, 608, 608, 3) 0
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
batch_normalization_1 (BatchNor (None, 608, 608, 32) 128 conv2d_1[0][0]
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
batch_normalization_2 (BatchNor (None, 304, 304, 64) 256 conv2d_2[0][0]
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
conv2d_3 (Conv2D) (None, 152, 152, 128 73728 max_pooling2d_2[0][0]
batch_normalization_3 (BatchNor (None, 152, 152, 128 512 conv2d_3[0][0]
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_3[0][0]
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
batch_normalization_4 (BatchNor (None, 152, 152, 64) 256 conv2d_4[0][0]
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
conv2d_5 (Conv2D) (None, 152, 152, 128 73728 leaky_re_lu_4[0][0]
batch_normalization_5 (BatchNor (None, 152, 152, 128 512 conv2d_5[0][0]
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_5[0][0]
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
batch_normalization_6 (BatchNor (None, 76, 76, 256) 1024 conv2d_6[0][0]
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
batch_normalization_7 (BatchNor (None, 76, 76, 128) 512 conv2d_7[0][0]
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
batch_normalization_8 (BatchNor (None, 76, 76, 256) 1024 conv2d_8[0][0]
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
batch_normalization_9 (BatchNor (None, 38, 38, 512) 2048 conv2d_9[0][0]
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
batch_normalization_10 (BatchNo (None, 38, 38, 256) 1024 conv2d_10[0][0]
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
batch_normalization_11 (BatchNo (None, 38, 38, 512) 2048 conv2d_11[0][0]
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
batch_normalization_12 (BatchNo (None, 38, 38, 256) 1024 conv2d_12[0][0]
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
batch_normalization_13 (BatchNo (None, 38, 38, 512) 2048 conv2d_13[0][0]
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
batch_normalization_14 (BatchNo (None, 19, 19, 1024) 4096 conv2d_14[0][0]
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
batch_normalization_15 (BatchNo (None, 19, 19, 512) 2048 conv2d_15[0][0]
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
batch_normalization_16 (BatchNo (None, 19, 19, 1024) 4096 conv2d_16[0][0]
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
batch_normalization_17 (BatchNo (None, 19, 19, 512) 2048 conv2d_17[0][0]
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
batch_normalization_18 (BatchNo (None, 19, 19, 1024) 4096 conv2d_18[0][0]
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
batch_normalization_19 (BatchNo (None, 19, 19, 1024) 4096 conv2d_19[0][0]
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
batch_normalization_21 (BatchNo (None, 38, 38, 64) 256 conv2d_21[0][0]
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
batch_normalization_20 (BatchNo (None, 19, 19, 1024) 4096 conv2d_20[0][0]
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
batch_normalization_22 (BatchNo (None, 19, 19, 1024) 4096 conv2d_22[0][0]
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672


Reminder: this model converts a pre-processed batch of input images (shape: (m, 608, 608, 3)) into a tensor of shape (m, 19, 19, 5, 85) as explained in the above Figure.

Convert output of the model to usable bounding box tensors

The output of yolo_model is a (m, 19, 19, 5, 85) tensor that needs to pass through non-trivial processing and conversion. The following code does this.

yolo_outputs = yolo_head(yolo_model.output, anchors, len(class_names))

We added yolo_outputs to your graph. This set of 4 tensors is ready to be used as input by our yolo_eval function.

Filtering boxes

yolo_outputs gave us all the predicted boxes of yolo_model in the correct format. We’re now ready to perform filtering and select only the best boxes. Lets now call yolo_eval, which you had previously implemented, to do this.

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

Run the graph on an image

Let the fun begin. We have created a (sess) graph that can be summarized as follows:

  1. yolo_model.input is given to yolo_model. The model is used to compute the output yolo_model.output
  2. yolo_model.output is processed by yolo_head. It gives us yolo_outputs
  3. yolo_outputs goes through a filtering function, yolo_eval. It outputs your predictions: scores, boxes, classes

Exercise: Implement predict() which runs the graph to test YOLO on an image. We shall need to run a TensorFlow session, to have it compute scores, boxes, classes.

The code below also uses the following function:

image, image_data = preprocess_image(“images/” + image_file, model_image_size = (608, 608))

which outputs:

  • image: a python (PIL) representation of your image used for drawing boxes. You won’t need to use it.
  • image_data: a numpy-array representing the image. This will be the input to the CNN.

Important note: when a model uses BatchNorm (as is the case in YOLO), we will need to pass an additional placeholder in the feed_dict {K.learning_phase(): 0}.

def predict(sess, image_file):
Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.

sess -- your tensorflow/Keras session containing the YOLO graph
image_file -- name of an image stored in the "images" folder.

out_scores -- tensor of shape (None, ), scores of the predicted boxes
out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
out_classes -- tensor of shape (None, ), class index of the predicted boxes

Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes.

 # Preprocess your image

 # Run the session with the correct tensors and choose the correct placeholders in the
 # feed_dict. We'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})

 # Print predictions info
print('Found {} boxes for {}'.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image"out", image_file), quality=90)
# Display the results
output_image = scipy.misc.imread(os.path.join("out", image_file))

 return out_scores, out_boxes, out_classes

Let’s Run the following cell on the following “test.jpg” image to verify that our function is correct.


out_scores, out_boxes, out_classes = predict(sess, “test.jpg”)

The following figure shows the output after car detection. Each of the bounding boxes have the name of the object detected on the top left along with the confidence value.

Output (with detected cars with YOLO)

Found 7 boxes for test.jpg
car 0.60 (925, 285) (1045, 374)
car 0.66 (706, 279) (786, 350)
bus 0.67 (5, 266) (220, 407)
car 0.70 (947, 324) (1280, 705)
car 0.74 (159, 303) (346, 440)
car 0.80 (761, 282) (942, 412)
car 0.89 (367, 300) (745, 648)


The following animation shows the output Images with detected objects (cars) using YOLO for a set of input images.


What we should remember:

  • YOLO is a state-of-the-art object detection model that is fast and accurate.
  • It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
  • The encoding can be seen as a grid where each of the 19×19 cells contains information about 5 boxes.
  • You filter through all the boxes using non-max suppression. Specifically:
    Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes.
  • Intersection over Union (IoU) thresholding to eliminate overlapping boxes.
  • Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise.


References: The ideas presented in this notebook came primarily from the two YOLO papers. The implementation here also took significant inspiration and used many components from Allan Zelener’s github repository. The pretrained weights used in this exercise came from the official YOLO website.

  1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi – You Only Look Once: Unified, Real-Time Object Detection (2015)
  2. Joseph Redmon, Ali Farhadi – YOLO9000: Better, Faster, Stronger (2016)
  3. Allan Zelener – YAD2K: Yet Another Darknet 2 Keras
  4. The official YOLO website .

Car detection dataset: Creative Commons License.

The Sample Dataset (provided by is licensed under a Creative Commons Attribution 4.0 International License.

Implementing Lucas-Kanade Optical Flow algorithm in Python

In this article an implementation of the Lucas-Kanade optical flow algorithm is going to be described. This problem appeared as an assignment in this computer vision course from UCSD. The inputs will be sequences of images (subsequent frames from a video) and the algorithm will output an optical flow field (u, v) and trace the motion of the moving objects. The problem description is taken from the assignment itself.


Problem Statement

Single-Scale Optical Flow

  • Let’s implement the single-scale Lucas-Kanade optical flow algorithm. This involves finding the motion (u, v) that minimizes the sum-squared error of the brightness constancy equations for each pixel in a window.  The algorithm will be implemented as a function with the following inputs:

     def optical_flow(I1, I2, window_size, tau) # returns (u, v)

  • Here, u and v are the x and y components of the optical flow, I1 and I2 are two images taken at times t = 1 and t = 2 respectively, and window_size is a 1 × 2 vector storing the width and height of the window used during flow computation.
  • In addition to these inputs, a theshold τ should be added, such that if τ is larger than the smallest eigenvalue of A’A, then the the optical flow at that position should not be computed. Recall that the optical flow is only valid in regions where

has rank 2, which is what the threshold is checking. A typical value for τ is 0.01.

  • We should try experimenting with different window sizes and find out the tradeoffs associated with using a small vs. a large window size.
  • The following figure describes the algorithm, which considers a nxn (n>=3) window around each pixel and solves a least-square problem to find the best flow vectors for the pixel.


  • The following code-snippet shows how the algorithm is implemented in python for a gray-level image.
import numpy as np
from scipy import signal
def optical_flow(I1g, I2g, window_size, tau=1e-2):

    kernel_x = np.array([[-1., 1.], [-1., 1.]])
    kernel_y = np.array([[-1., -1.], [1., 1.]])
    kernel_t = np.array([[1., 1.], [1., 1.]])#*.25
    w = window_size/2 # window_size is odd, all the pixels with offset in between [-w, w] are inside the window
    I1g = I1g / 255. # normalize pixels
    I2g = I2g / 255. # normalize pixels
    # Implement Lucas Kanade
    # for each point, calculate I_x, I_y, I_t
    mode = 'same'
    fx = signal.convolve2d(I1g, kernel_x, boundary='symm', mode=mode)
    fy = signal.convolve2d(I1g, kernel_y, boundary='symm', mode=mode)
    ft = signal.convolve2d(I2g, kernel_t, boundary='symm', mode=mode) +
         signal.convolve2d(I1g, -kernel_t, boundary='symm', mode=mode)
    u = np.zeros(I1g.shape)
    v = np.zeros(I1g.shape)
    # within window window_size * window_size
    for i in range(w, I1g.shape[0]-w):
        for j in range(w, I1g.shape[1]-w):
            Ix = fx[i-w:i+w+1, j-w:j+w+1].flatten()
            Iy = fy[i-w:i+w+1, j-w:j+w+1].flatten()
            It = ft[i-w:i+w+1, j-w:j+w+1].flatten()
            #b = ... # get b here
            #A = ... # get A here
            # if threshold τ is larger than the smallest eigenvalue of A'A:
            nu = ... # get velocity here

    return (u,v)


Some Results

  • The following figures and animations show the results of the algorithm on a few image sequences. Some of these input image sequences / videos are from the course and some are collected from the internet.
  • As can be seen, the algorithm performs best if the motion of the moving object(s) in between consecutive frames is slow. To the contrary, if the motion is large, the algorithm fails and we should implement / use multiple-scale version Lucas-Kanade with image pyramids.
  • Finally,  with small window size,  the algorithm captures subtle motions but not large motions. With large size it happens the other way.

Input Sequences



Output Optical Flow with different window sizes

window size = 15


window size = 21



Input Sequences

Output Optical Flow


Input Sequences (hamburg taxi)


Output Optical Flowtaxi_opt


Input Sequences


Output Optical Flow

Input Sequences


Output Optical Flowseq_opt

Input Sequences    fount3.gif


Output Optical Flowfount_opt

Input Sequences

Output Optical Flow

Input Sequencessynth

Output Optical Flowsynth_opt

Input Sequencescars1
Output Optical Flowcars1_optcars1_cmap

Input Sequencescars2

Output Optical Flowcars2_opt

Output Optical Flowcars2_opt2cars2_cmap

Input Sequences



Output Optical Flow with window size 45

Output Optical Flow with window size 10

Output Optical Flow with window size 25
Output Optical Flow with window size 45cars3_opt2_45


Efficient Graph-Based Image Segmentation in Python

In this article, an implementation of an efficient graph-based image segmentation technique will be described, this algorithm was proposed by Felzenszwalb et. al. from MIT in this paper.  The slides on this paper can be found from this link from the Stanford Vision Lab too. The algorithm is closely related to Kruskal’s algorithm for constructing a minimum spanning tree of a graph, as stated by the author and hence can  be
implemented to run in O(m log m) time, where m is the number of edges in the graph.

Problem Definition and the basic idea (from the paper)

  • Let G = (V, E) be an undirected graph with vertices vi ∈ V, the set of elements to be segmented, and edges (vi, vj ) ∈ E corresponding to pairs of neighboring vertices. Each edge (vi, vj ) ∈ E has a corresponding weight w((vi, vj)), which is a non-negative measure of the dissimilarity between neighboring elements vi and vj.

  • In the case of image segmentation, the elements in V are pixels and the weight of an edge is some measure of the dissimilarity between the two pixels connected by that edge (e.g., the difference in intensity, color, motion, location or some other local attribute).

  • Particularly for the implementation described here, an edge weight function based on the absolute intensity difference (in the yiq space) between the pixels connected by an edge, w((vi, vj )) = |I(pi) − I(pj )|.

  • In the graph-based approach, a segmentation S is a partition of V into components
    such that each component (or region) C ∈ S corresponds to a connected component
    in a graph G0 = (V, E0), where E0 ⊆ E.

  • In other words, any segmentation is induced by a subset of the edges in E. There are different ways to measure the quality of a segmentation but in general we want the elements in a component to be similar, and elements in different components to be dissimilar.

  • This means that edges between two vertices in the same component should have relatively low weights, and edges between vertices in different components should have higher weights.

  • The next figure shows the steps in the algorithm. The algorithm is very similar to Kruskal’s algorithm for computing the MST for an undirected graph.


  • The threshold function τ controls the degree to which the difference between two
    components must be greater than their internal differences in order for there to be
    evidence of a boundary between them.

  • For small components, Int(C) is not a good estimate of the local characteristics of the data. In the extreme case, when |C| = 1, Int(C) = 0. Therefore, a threshold function based on the size of the component, τ (C) = k/|C| is needed to be usedwhere |C| denotes the size of C, and k is some constant parameter.

  • That is, for small components we require stronger evidence for a boundary. In practice k sets a scale of observation, in that a larger k causes a preference for larger components.

  • In general, a Gaussian filter is used to smooth the image slightly before computing the edge weights, in order to compensate for digitization artifacts. We always use a Gaussian with σ = 0.8, which does not produce any visible change to the image but helps remove artifacts.

  • The following python code shows how to create the graph.

import numpy as np
from scipy import signal
import matplotlib.image as mpimg

def gaussian_kernel(k, s = 0.5):
    # generate a (2k+1)x(2k+1) gaussian kernel with mean=0 and sigma = s
    probs = [exp(-z*z/(2*s*s))/sqrt(2*pi*s*s) for z in range(-k,k+1)]
return np.outer(probs, probs)

def create_graph(imfile, k=1., sigma=0.8, sz=1):
    # create the pixel graph with edge weights as dissimilarities
     rgb = mpimg.imread(imfile)[:,:,:3]
     gauss_kernel = gaussian_kernel(sz, sigma)
     for i in range(3):
         rgb[:,:,i] = signal.convolve2d(rgb[:,:,i], gauss_kernel, boundary='symm', mode='same')
     yuv = rgb2yiq(rgb)
     (w, h) = yuv.shape[:2]
     edges = {}
     for i in range(yuv.shape[0]):
         for j in range(yuv.shape[1]):
             #compute edge weight for nbd pixel nodes for the node i,j
             for i1 in range(i-1, i+2):
                 for j1 in range(j-1, j+2):
                     if i1 == i and j1 == j: continue

                     if i1 >= 0 and i1 = 0 and j1 < h:
                        wt = np.abs(yuv[i,j,0]-yuv[i1,j1,0])
                        n1, n2 = ij2id(i,j,w,h), ij2id(i1,j1,w,h)
                        edges[n1, n2] = edges[n2, n1] = wt
     return edges


Some Results

  • The images are taken from the paper itself or from the internet. The following figures and animations show the result of segmentation as a result of iterative merging of the components (by choosing least weight edges), depending on the internal difference of the components.

  • Although in the paper the author described the best value of the parameter k to be around 300, but  since in this implementation the pixel RGB values are normalized (to have values in between 0 – 1) and then converted to YIQ values and the YIQ intensities are used for computing the weights (which are typically very small), the value of k that works best in this scenario is 0.001-0.01.

  • As we can see from the below results, higher the value of the parameter k, larger the size of the final component and lesser the number of components in the result.

  • The minimum spanning tree creation is also shown, the red edges shown in the figures are the edges chosen by the algorithm to merge the components.

Input Image


Output Images for two different values of the parameter k



out_ 0.001player.png

out_ 0.010player.png

The forest created after a few iterations


Input Image


Output Images for two different values of the parameter k


out_ 0.010hill.png


out_ 0.001hill.png

The forest created after a few iterations


Input Image


Output Segmented Images

parrot_k_0.001out_ 0.001parrot_Dark2.pngout_ 0.001parrot_hot.png

Input Image


Segmented Output Images

out_ 0.001road2_Set1.png

Input Image


Output Images for two different values of the parameter k


out_ 0.001road.png
road_k_0.01out_ 0.010road.png

The forest created after a few iterations



  Input Image (UMBC)


Segmented Output Images

umbc_k_0.001out_ 0.001umbc.png

    Input Image


Segmented Output Image with k=0.001

out_ 0.001nj.png

    Input Image


Segmented Output Image with k=0.001


    Input Image


Segmented Output Image with k=0.001

out_ 0.001flowers2.png

    Input Image (liver)


Segmented Output Images 

out_ 0.001liver.pngout_ 0.010liver.png

    Input Image


Segmented Output Images with different values of k


out_ 0.001building.pngout_ 0.010building.png

    Input Image


Segmented Output Image

out_ 0.001frame056.png

    Input Image


Segmented Output Images for different k



Interactive Image Segmentation with Graph-Cut in Python

In this article, interactive image segmentation with graph-cut is going to be discussed. and it will be used to segment the source object from the background in an image. This segmentation technique was proposed by Boycov and Jolli in this paper. This problem appeared as a homework assignment here., and also in this lecture video from the Coursera image processing course by Duke university.

Problem Statement: Interactive graph-cut segmentation

Let’s implement “intelligent paint” interactive segmentation tool using graph cuts algorithm on a weighted image grid. Our task will be to separate the foreground object from the background in an image.

Since it can be difficult sometimes to automatically define what’s foreground and what’s background for an image, the user is going to help us with a few interactive scribble lines using which our algorithm is going to identify the foreground and the background, after that it will be the algorithms job to obtain a complete segmentation of the foreground from the background image.

The following figures show how an input image gets scribbling from a user with two different colors (which is also going to be input to our algorithm) and the ideal segmented image output.

Scribbled Input Image                                       Expected Segmented Output Image       bj


The Graph-Cut Algorithm


The following describes how the segmentation problem is transformed into a graph-cut problem:

  1. Let’s first define the Directed Graph G = (V, E) as follows:

    1. Each of the pixels in the image is going to be a vertex in the graph. There will be another couple of special terminal vertices: a source vertex (corresponds to the foreground object) and a sink vertex (corresponds to the background object in the image). Hence, |V(G)| = width x height + 2.

    2. Next, let’s defines the edges of the graph. As obvious, there is going to be two types of edges: terminal (edges that connect the terminal nodes to the non-terminal nodes) and non-terminal (edges that connect the non-terminal nodes only).

    3. There will be a directed edge from the terminal node source to each of non-terminal nodes in the graph. Similarly,  a directed edge will be there from each non-terminal node (pixel) to the other terminal node sink. These are going to be all the terminal edges and hence, |E_T(G)| =  2 x width x height.

    4. Each of the non-terminal nodes (pixels) are going to be connected by edges with the nodes corresponding to the neighboring pixels (defined by 4 or 8 neighborhood of a pixel).  Hence, |E_N(G)| =  |Nbd| x width x height.

  2. Now let’s describe how to compute the edge weights in this graph.

    1. In order to compute the terminal edge weights, we need to estimate the feature distributions first, i.e., starting with the assumption that each of the nodes corresponding to the scribbled pixels have the probability 1.0 (since we want the solution to respect the regional hard constraints marked by the user-seeds / scribbles) to be in foreground or background object in the image (distinguished by the scribble color, e.g.), we have to compute the probability that a node belongs to the foreground (or background) for all the other non-terminal nodes.

    2. The simplest way to compute P_F and P_B is to first fit a couple of  Gaussian distributions on the scribbles by computing the parameters (μ, ∑)
      with MLE from the scribbled pixel intensities and then computing the (class-conditional) probabilities from the individual pdfs (followed by a normalization) for each of the pixels as shown in the next figures. The following code fragment show how the pdfs are being computed.

      import numpy as np
      from collections import defaultdict
      def compute_pdfs(imfile, imfile_scrib):
          # Compute foreground and background pdfs
          # input image and the image with user scribbles
           rgb = mpimg.imread(imfile)[:,:,:3]
           yuv = rgb2yiq(rgb)
           rgb_s = mpimg.imread(imfile_scrib)[:,:,:3]
           yuv_s = rgb2yiq(rgb_s)
           # find the scribble pixels
           scribbles = find_marked_locations(rgb, rgb_s)
           imageo = np.zeros(yuv.shape)
           # separately store background and foreground scribble pixels in the dictionary comps
           comps = defaultdict(lambda:np.array([]).reshape(0,3))
           for (i, j) in scribbles:
                imageo[i,j,:] = rgbs[i,j,:]
                # scribble color as key of comps
                comps[tuple(imageo[i,j,:])] = np.vstack([comps[tuple(imageo[i,j,:])], yuv[i,j,:]])
                mu, Sigma = {}, {}
           # compute MLE parameters for Gaussians
           for c in comps:
                mu[c] = np.mean(comps[c], axis=0)
                Sigma[c] = np.cov(comps[c].T)
           return (mu, Sigma)

      3. In order to compute the non-terminal edge weights, we need to compute the similarities in between a pixel node and the nodes corresponding to its neighborhood pixels, e.g., with the formula shown in the next figures (e.g., how similar neighborhood pixels are in RGB / YIQ space).

  3. Now that the underlying graph is defined, the segmentation of the foreground from the background image boils down to computing the min-cut in the graph or equivalently computing the max-flow (the dual problem) from the source to sink.

  4. The intuition is that the min-cut solution will keep the pixels with high probabilities to belong to the side of the source (foreground) node and likewise the background pixels on the other side of the cut near the sink (background) node, since it’s going to respect the (relatively) high-weight edges (by not going through the highly-similar pixels).

  5. There are several standard algorithms, e.g., Ford-Fulkerson (by finding an augmenting path with O(E max| f |) time complexity) or Edmonds-Karp (by using bfs to find the shortest augmenting path, with O(VE2) time complexity) to solve the max-flow problem, typical implementations of these algorithms run pretty fast, in polynomial time in V, E.  Here we are going to use a different implementation (with pymaxflow) based on Vladimir Kolmogorov, which is shown to run faster on many images empirically in this paper.







The following figures / animations show the interactive-segmentation results (computed probability densities, subset of the flow-graph & min-cut, final segmented image) on a few images,  some of them taken from the above-mentioned courses / videos, some of them taken from Berkeley Vision dataset.

Input Imagedeer

Input Image with Scribblesdeer_scribed

Fitted Densities from Color Scribblesgaussian_deer.png


A Tiny Sub-graph with Min-Cut


Input Image

Input Image with Scribbleseagle_scribed

Fitted Densities from Color Scribblesgaussian_eagle



A Tiny Sub-graph with Min-Cut flow_eagle

Input Image (liver)                                             

Input Image with Scribblesliver_scribed

Fitted Densities from Color Scribbles


Input Imagefishes2

Input Image with Scribbles

Fitted Densities from Color Scribblesgaussian_fishes2


A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribblespanda_scribed


Input Imagebunny

Input Image with Scribblesbunny_scribed

Fitted Densities from Color Scribbles



A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut


Input Image                                                         Input Image with Scribbles
me3    me3_scribed


Input Imageinput1

Input Image with Scribblesinput1_scribed


A Tiny Sub-graph of the flow-graph with Min-Cut


Input Imagerose

Input Image with Scribbles

Fitted Densities from Color Scribbles


Input Image

Input Image with Scribbleswhale_scribed

Fitted Densities from Color Scribblesgaussian_whale

A Tiny Sub-graph of the flow-graph with Min-Cut


Input Image (UMBC Campus Map)umbc_campus

Input Image with Scribbles


Input Imagehorses

Input Image with Scribbles


A Tiny Sub-graph of the flow-graph with Min-Cut (with blue foreground nodes)


Changing the background of an image (obtained using graph-cut segmentation) with another image’s background with cut & paste

The following figures / animation show how the background of a given image can be replaced by a new image using cut & paste (by replacing the corresponding pixels in the new image corresponding to foreground), once the foreground in the original image gets identified after segmentation.

Original Input Imagehorses

New Backgroundsky.png




Image Colorization Using Optimization in Python


This article is inspired by this SIGGRAPH paper by Levin et. al, for which they took this patent , the paper was referred to in the course CS1114 from Cornell.  This method is also discussed in the coursera online image processing course by NorthWestern University. Some part of the problem description is taken from the paper itself. Also, one can refer to the implementation provided by the authors in matlab, the following link and the following python implementation in github.


The Problem

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.


The Algorithm

YUV/YIQ color space is chosen to work in, this is commonly used in video, where Y is the monochromatic luminance channel, which we will refer to simply as intensity, while U and V are the chrominance channels, encoding the color.

The algorithm is given as input an intensity volume Y(x,y,t) and outputs two color volumes U(x,y,t) and V(x,y,t). To simplify notation the boldface letters are used (e.g. r,s) to denote \left(x,y,t \right) triplets. Thus, Y(r) is the intensity of a particular pixel.

Now, One needs to impose the constraint that two neighboring pixels r,s should have similar colors if their intensities are similar. Thus, the problem is formulated as the following optimization problem that aims to minimize the difference between the  colorU(r) at pixel r and the weighted average of the colors at neighboring pixels, where w(r,s) is a weighting function that sums to one, large when Y(r) is similar to Y(s), and small when the two intensities are different. 


When the intensity is constant the color should be constant, and when the intensity is an edge the color should also be an edge. Since the cost functions are quadratic and
the constraints are linear, this optimization problem yields a large, sparse system of linear equations, which may be solved using a number of standard methods.

As discussed in the paper, this algorithm is closely related to algorithms proposed for other tasks in image processing. In image segmentation algorithms based on normalized cuts [Shi and Malik 1997], one attempts to find the second smallest eigenvector of the matrix D − W where W is a npixels×npixels matrix whose elements are the pairwise affinities between pixels (i.e., the r,s entry of the matrix is w(r,s)) and D is a diagonal matrix whose diagonal elements are the sum of the affinities (in this case it is always 1). The second smallest eigenvector of any symmetric matrix A is a unit norm vector x that minimizes x^{T}Ax and is orthogonal to the first eigenvector. By direct  inspection, the quadratic form minimized by normalized cuts is exactly the cost function J, that is x^{T}(D-W)x =J(x). Thus, this algorithm minimizes the same cost function but under different constraints. In image denoising algorithms based on anisotropic diffusion [Perona and Malik 1989; Tang et al. 2001] one often minimizes a  function
similar to equation 1, but the function is applied to the image intensity as well.

The following figures show an original gray-scale image and the marked image with color-scribbles that are going to be used to compute the output colored image.

Original Gray-scale Image Input

Gray-scale image Input Marked with Color-Scribbles 

Implementation of the Algorithm

Here are the the steps for the algorithm:

  1. Convert both the original gray-scale image and the marked image (marked with color scribbles for a few pixels by the artist) to from RGB to YUV / YIQ color space.
  2. Compute the difference image from the marked and the gray-scale image. The pixels that differ  are going to be pinned and they will appear in the output, they are directly going to be used as stand-alone constraints in the minimization problem.
  3. We need to compute the weight matrix W that depends on the similarities in the neighbor intensities for a pixel from the original gray-scale image.
  4. The optimization problem finally boils down to solving the system of linear equations of the form WU = b that has a closed form least-square solution
    U = W^{+}b = {(W^{T}W)}^{-1}W^{T}b. Same thing is to be repeated for the V channel too.
  5. However the W matrix is going to be very huge and sparse, hence sparse-matrix based implementations must be used to obtain an efficient solution. However, in this python implementation in github, the scipy sparse lil_matrix was used when constructing the sparse matrices, which is quite slow, we can construct more efficient scipy csc matrix rightaway, by using a dictionary to store the weights initially. It is much faster. The python code in the next figure shows my implementation for computing the weight matrix W.
  6. Once W is computed it’s just a matter of obtaining the least-square solution, by computing the pseudo-inverse, which can be more efficiently computed with LU factorization and a sparse LU solver, as in this  python implementation in github.
  7. Once the solution of the optimization problem is obtained in YUV / YIQ space, it needs to be converted back to RGB. The following formula is used for conversion.rgbyiq.png

import scipy.sparse as sp
from collections import defaultdict

def compute_W(Y, colored):

    (w, h) = Y.shape
    W = defaultdict()
    for i in range(w):
        for j in range(h):
            if not (i, j) in colored: # pixel i,j in color scribble
            (N, sigma) = get_nbrs(Y, i, j, w, h) #r = (i, j)
            Z = 0.
            id1 = ij2id(i,j,w,h) # linearized id
            for (i1, j1) in N: #s = (i1, j1)
                id2 = ij2id(i1,j1,w,h)
                W[id1,id2] = np.exp(-(Y[i,j]-Y[i1,j1])**2/(sigma**2)) if sigma > 0 else 0.
                Z += W[id1,id2]
            if Z > 0:
               for (i1, j1) in N: #s = (i1, j1)
                   id2 = ij2id(i1,j1,w,h)
                   W[id1,id2] /= -Z
     for i in range(w):
        for j in range(h):
            id = ij2id(i,j,w,h)
            W[id,id] = 1.

    rows, cols = zip(*(W.keys())) #keys
    data = W.values() #[W[k] for k in keys]
    return sp.csc_matrix((data, (rows, cols)), shape=(w*h, w*h)) #W



The following images and animations show the results obtained with the optimization algorithm. Most of the following images are taken from the paper itself.

Original Gray-scale Image Input                 Gray-scale image Input Markedbaby  baby_marked

Color image Output

The next animations show how the incremental scribbling results in better and better color images.

Original Gray-scale Image Input 


As can be seen from the following animation, the different parts of the building get colored as more and more color-hints are scribbled / annotated.

Gray-scale image Input Markedhouse_marked

Color image Output


Original Gray-scale Image Input 


Gray-scale image Input Marked

Color image Output

Original Gray-scale Image Input (me)me3_gray

Gray-scale image Input Marked incrementally

Color image Output

Original Gray-scale Image Input

Gray-scale image Input Marked
Color image Output

Original Gray-scale Image Input


Gray-scale image Input Marked


Color image Output



Recursive Graphics, Bi/Tri-linear Interpolation, Anti-aliasing and Image Transformation in Python

The following problem appeared in an assignment in the Princeton course COS 126 . The problem description is taken from the course itself.

Recursive Graphics

Write a program that plots a Sierpinski triangle, as illustrated below. Then develop a program that plots a recursive patterns of your own design.


Part 1. 

The Sierpinski triangle is an example of a fractal pattern. The pattern was described by Polish mathematician Waclaw Sierpinski in 1915, but has appeared in Italian art since the 13th century. Though the Sierpinski triangle looks complex, it can be generated with a short recursive program.

Examples. Below are the target Sierpinski triangles for different values of order N.


Our task is to implement a recursive function sierpinski(). We need to think recursively: our function should draw one black triangle (pointed downwards) and then call itself recursively 3 times (with an appropriate stopping condition). When writing our program, we should exercise modular design.

The following code shows an implementation:

class Sierpinski:
#Height of an equilateral triangle whose sides are of the specified length.
def height (self, length):
return sqrt(3) * length / 2.
#Draws a filled equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def filledTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length/2., x-length/2.]), np.array([y, y+h, y+h]), alpha=1)
#Draws an empty equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def emptyTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length, x-length]), np.array([y+2*h, y, y]), alpha=0)
# Draws a Sierpinski triangle of order n, such that the largest filled
# triangle has bottom vertex (x, y) and sides of the specified length.
def sierpinski(self, n, x, y, length):
self.filledTriangle(x, y, length)
if n > 1:
self.sierpinski(n-1, x-length/2., y, length/2.)
self.sierpinski(n-1, x+length/2., y, length/2.)
self.sierpinski(n-1, x, y+self.height(length), length/2.)

The following animation shows how such a triangle of order 5 is drawn recursively.

The following animation shows how such a triangle of order 6 is drawn recursively.


A diversion: fractal dimension.

Formally, we can define the Hausdorff dimension or similarity dimension of a self-similar figure by partitioning the figure into a number of self-similar pieces of smaller size. We define the dimension to be the log (# self similar pieces) / log (scaling factor in each spatial direction). For example, we can decompose the unit square into 4 smaller squares, each of side length 1/2; or we can decompose it into 25 squares, each of side length 1/5. Here, the number of self-similar pieces is 4 (or 25) and the scaling factor is 2 (or 5). Thus, the dimension of a square is 2 since log (4) / log(2) = log (25) / log (5) = 2. We can decompose the unit cube into 8 cubes, each of side length 1/2; or we can decompose it into 125 cubes, each of side length 1/5. Therefore, the dimension of a cube is log(8) / log (2) = log(125) / log(5) = 3.

We can also apply this definition directly to the (set of white points in) Sierpinski triangle. We can decompose the unit Sierpinski triangle into 3 Sierpinski triangles, each of side length 1/2. Thus, the dimension of a Sierpinski triangle is log (3) / log (2) ≈ 1.585. Its dimension is fractional—more than a line segment, but less than a square! With Euclidean geometry, the dimension is always an integer; with fractal geometry, it can be something in between. Fractals are similar to many physical objects; for example, the coastline of Britain resembles a fractal, and its fractal dimension has been measured to be approximately 1.25.


Part 2.

Drawing a tree recursively, as described here:


The following shows how a tree of order 10 is drawn:



The next problem appeared in an assignment in the Cornell course CS1114 . The problem description is taken from the course itself.

Bilinear Interpolation

Let’s consider a 2D matrix of values at integer grid locations (e.g., a grayscale image). To interpolate values on a 2D grid, we can use the 2D analogue of linear interpolation: bilinear interpolation. In this case, there are four neighbors for each possible point we’d like to interpolation, and the intensity values of these four neighbors are all combined to compute the interpolated intensity, as shown in the next figure.


In the figure, the Q values represent intensities. To combine these intensities, we perform linear interpolation in multiple directions: we first interpolate in the x direction (to get the value at the blue points), then in the y direction (to get the value at the green points).

Image transformations

Next, we’ll use the interpolation function to help us implement image transformations.

A 2D affine transformation can be represented with a 3 ×3 matrix T:


Recall that the reason why this matrix is 3×3, rather than 2 ×2, is that we operate in homogeneous coordinates; that is, we add an extra 1 on the end of our 2D coordinates (i.e., (x,y) becomes (x,y,1)), in order to represent translations with a matrix multiplication. To apply a transformation T to a pixel, we multiply T by the pixel’s location:


The following figure shows a few such transformation matrices:


To apply a transformation T to an entire image I, we could apply the transformation to each of I’s pixels to map them to the output image. However, this forward warping procedure has several problems. Instead, we’ll use inverse mapping to warp the pixels of the output image back to the input image. Because this won’t necessarily hit an integer-valued location, we’ll need to use the (bi-linear) interpolation to determine the intensity of the input image at the desired location, as shown in the next figure.


To demo the transformation function, let’s implement the following on a gray scale bird image:

1. Horizontal flipping


2. Scaling by a factor of 0.5

3. Rotation by 45 degrees around the center of the image


The next animations show rotation and sheer transformations with the Lena image:



Next, let’s implement a function to transform RGB images. To do this, we need to simply call transform image three times, once for each channel, then put the results together into a single image.  Next figures and animations show some results on an RGB image.


non-linear transformations

The next figure shows the transform functions from here:


The next figures and animations show the application of the above non-linear transforms on the Lena image.









Some more non-linear transforms:




There is a problem with our interpolation method above: it is not very good at shrinking images, due to aliasing. For instance, if let’s try to down-sample the following bricks image by a factor of 0.4, we get the image shown in the following figure: notice the strange banding effects in the output image.

Original Image


Down-sampled Image with Bilinear Interpolation

The problem is that a single pixel in the output image corresponds to about 2.8 pixels in the input image, but we are sampling the value of a single pixel—we should really be averaging over a small area.

To overcome this problem, we will create a data structure that will let us (approximately) average over any possible square regions of pixels in the input image: an image stack. An image stack is a 3D matrix that we can think of as, not surprisingly, a stack of images, one on top of the other. The top image in the cube will be the original input image. Images further down the stack will be the input image with progressively larger amounts of blur. The size of the matrix will be rows × cols × num levels, where the original (grayscale) image has size rows×cols and there are num levels images in the stack.

Before we use the stack, we must write a function to create it, which takes as input a (grayscale) image and a number of levels in the stack, and returns a 3D matrix stack corresponding to the stack. Again, the first image on the stack will be the original image. Every other image in the stack will be a blurred version of the previous image. A good blur kernel to use is:


Now, for image k in the stack, we know that every pixel is a (weighted) average of some number of pixels (a k × k patch, roughly speaking) in the input image. Thus, if we down-sample the image by a factor of k, we want to sample pixels from level k of the stack.

⇒  Let’s write the following function to create image stack that takes a grayscale image and a number max levels, and returns an image stack.

from scipy import signal

def create_image_stack(img, max_level):
K = np.ones((3,3)) / 9.
image_stack = np.zeros((img.shape[0], img.shape[1], max_level))
image_stack[:,:,0] = img
for l in range(1, max_level):
image_stack[:,:,l] = signal.convolve2d(image_stack[:,:,l-1], K,
boundary=’symm’, mode=’same’)
return image_stack

The next animation shows the image stack created from the bricks image.



Trilinear Interpolation

Now, what happens if we down-sample the image by a fractional factor, such as 3.6? Unfortunately, there is no level 3.6 of the stack. Fortunately, we have a tool to solve this problem: interpolation. We now potentially need to sample a value at position (row,col,k) of the image stack, where all three coordinates are fractional. We therefore something more powerful than bilinear interpolation: trilinear interpolation! Each position we want to sample now has eight neighbors, and we’ll combine all of their values together in a weighted sum.

This sounds complicated, but we can write this in terms of our existing functions. In particular, we now interpolate separately along different dimensions: trilinear interpolation can be implemented with two calls to bilinear interpolation and one call to linear interpolation.

Let’s implement a function trilerp like the following that takes an image stack, and a row, column, and stack level k, and returns the interpolated value.

def trilerp (img_stack, x, y, k):

if k < 1: k = 1
if  k == int(k):
return bilerp(img_stack[:,:,k-1], x, y)
f_k, c_k = int(floor(k)), int(ceil(k))
v_f_k = bilerp(img_stack[:,:,f_k-1], x, y)
v_c_k = bilerp(img_stack[:,:,c_k-1], x, y)
return linterp(k, f_k, c_k, v_f_k, v_c_k)

Now we can finally implement a transformation function that does proper anti-aliasing. In order to do this, let’s implement a function that will

  • First compute the image stack.
  • Then compute, for the transformation T, how much T is scaling down the image. If T is defined by the six values a,b,c,d,e,f above, then, to a first approximation, the downscale factor is:
    However, if k < 1 (corresponding to scaling up the image), we still want to sample from level 1. This situation reverts to normal bilinear interpolation.
  • Next call the trilerp function on the image stack, instead of bilerp on the input image.

The next figure shows the output image obtained image transformation with proper anti-aliasing:

Down-sampled Image with Anti-aliasing using Trilinear Interpolationout_scale_t.png

As we can see from the above output, the aliasing artifact has disappeared.

The same results are obtained on the color image, as shown below, by applying the trilerp function on the color channels separately and creating separate image stacks for different color channels.

Original Image

Down-sampled Image with Bilinear Interpolation

Down-sampled Image with Anti-aliasing

The following animation shows the branding artifacts created when using bilinear interpolation for  different scale factors and how they are removed with anti-aliasing.

Down-sampled Images with Bilinear Interpolation


Down-sampled Images with Anti-aliasing