Solving Some Image Processing and Computer Vision Problems with Python libraries

In this article, a few image processing / computer vision problems and their solutions  with python libraries (scikit-image, cv2) will be discussed. Some of the problems are from the exercises from this book (available on amazon). This blog will be continued here.

Removing Gaussian Noise from images by computing mean and median images with scikit-image

  1. Start with an input image.
  2. Create n (e.g, n=100) noisy images by adding i.i.d. Gaussian noise (with zero mean) to the original image, with scikit-image.
  3. Compute the mean (median) of the noisy images.
  4. Compare PSNR with the original image.
  5. Vary n and compare the results.
from skimage import img_as_float
from skimage.util import random_noise
from skimage.measure import compare_psnr
from import imread
import matplotlib.pylab as plt
import numpy as np

im = img_as_float(imread('../new images/parrot.jpg')) # original image
# generate n noisy images from the original image by adding Gaussian noise
n = 25
images = np.zeros((n, im.shape[0], im.shape[1], im.shape[2]))
sigma = 0.2
for i in range(n):
    images[i,...] = random_noise(im, var=sigma**2)

im_mean = images.mean(axis=0)
im_median = np.median(images, axis=0)
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, hspace=.01)
plt.subplot(221), plt.imshow(im), plt.axis('off'), plt.title('Original image', size=20)
plt.subplot(222), plt.imshow(images[0]), plt.axis('off'), plt.title('Noisy PSNR: ' + str(compare_psnr(im, images[0])), size=20)
plt.subplot(223), plt.imshow(im_mean), plt.axis('off'), plt.title('Mean PSNR: ' + str(compare_psnr(im, im_mean)), size=20)
plt.subplot(224), plt.imshow(im_median), plt.axis('off'), plt.title('Median PSNR: ' + str(compare_psnr(im, im_median)), size=20)

The next figure shows the original image, a noisy image generated from it by adding Gaussian noise (with 0 mean) to it and the images obtained by taking mean / median over all the n noisy images generated. As can be seen, the Gaussian noise in the images gets cancelled out by taking mean / median.

with n = 25


with n=100

plt.hist(images[:,100,100,0], color='red', alpha=0.2, label='red')
plt.hist(images[:,100,100,1], color='green', alpha=0.2, label='green')
plt.hist(images[:,100,100,2], color='blue', alpha=0.2, label='blue')

The next figure shows how a pixel value (that can be considered a random variable) for a particular location in different noisy images follows approximately a Gaussian distribution.

Distribution of a pixel value at location (100,100) in the noisy images


ns = [25, 50, 100, 200]
# mean_psnrs contain the PSNR values for different n
plt.plot(ns, mean_psnrs, '.--', label='PSNR (mean)')
plt.plot(ns, median_psnrs, '.--', label='PSNR (median)')
plt.xlabel('n'),  plt.ylabel('PSNR')

The following figure shows that the PSNR improves with large n (since by SLLN / WLLN, the sample mean converges to population mean 0 of the Gaussian noise). Also, for median the improvement in the image quality is higher for larger values of n.


Tracking Pedestrians with HOG-SVM with OpenCV / scikit-image

  1. Start with a video with pedestrians.
  2. Capture the video / extract frames from the video.
  3. For each frame
    1. Create HOG scale pyramid of the frame image.
    2. At each scale, use a sliding window to extract the corresponding block from the frame, compute the HOG descriptor features.
    3. Use cv2‘s HOGDescriptor_getDefaultPeopleDetector() – a pre-trained SVM classifier on the HOG descriptor to classify whether the corresponding block contains a pedestrian or not.
    4. Run non-max-suppression to get rid of multiple detection of the same person.
    5. Use cv2‘s  detectMultiScale() function to implement steps 3-4.

The code is adapted from the code here and here.

# HOG descriptor using default people (pedestrian) detector
hog = cv2.HOGDescriptor()

# run detection, using a spatial stride of 4 pixels,
# a scale stride of 1.02, and zero grouping of rectangles
# (to demonstrate that HOG will detect at potentially
# multiple places in the scale pyramid)
(foundBoundingBoxes, weights) = hog.detectMultiScale(frame, winStride=(4, 4), padding=(8, 8), scale=1.02, finalThreshold=0, useMeanshiftGrouping=False)

# convert bounding boxes from format (x1, y1, w, h) to (x1, y1, x2, y2)
rects = np.array([[x, y, x + w, y + h] for (x, y, w, h) in foundBoundingBoxes])

# run non-max suppression on the boxes based on an overlay of 65%
nmsBoundingBoxes = non_max_suppression(rects, probs=None, overlapThresh=0.65)

cv2 functions are used to extract HOG descriptor features and pedestrian detection with SVM,  whereas scikit-image functions are used to visualize the HOG features. The animations below display the original video, what HOG sees and  the detected pedestrians after non-max suppression. Notice there are a few false positive detection.

Original Videoped1

HOG-descriptor features video (what HOG sees)pedh1Original Video with detected Pedestrians ped1o

Face Detection with HaarCascade pre-trained AdaBoost classifiers with OpenCV

  1. Capture video with webcam with cv2.VideoCapture().
  2. For each frame, use the pre-trained Adaboost Cascade classifiers (the haarcascade_frontalface_default classifier for face detection and haarcascade_eye_tree_eyeglasses classifier for better detection of the eyes with glasses, from the corresponding xml files that come with cv2’s installation) using Haar-like features with cv2.CascadeClassifier().
  3. First detect the face(s) with the detectMultiScale() function and draw a bounding box. Then detect the eyes inside a detected face with the same function.
  4. The following python code snippet shows how to detect faces and eyes with cv2. The code is adapted from here.


# read the cascade classifiers from the xml files from the correct path into face_cascade  # and eye_cascade
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# return bounding box of the face(s) if one is detected
faces = face_cascade.detectMultiScale(gray, 1.03, 5)
for (x,y,w,h) in faces:
   frame = cv2.rectangle(frame,(x,y),(x+w,y+h),(255,0,0),2)
   roi_gray = gray[y:y+h, x:x+w]
   roi_color = frame[y:y+h, x:x+w]
   eyes = eye_cascade.detectMultiScale(roi_gray)
   for (ex,ey,ew,eh) in eyes:

The next animation shows the results of face detection when scalefactor 1.03 was used to create the scale pyramid.  As can be seen, the eyes with the glasses on and some small faces from the photos are not detected at this scale.


The next animation shows the results of face detection when scalefactor 1.3 was used to create the scale pyramid.  As can be seen, the eyes with/without the glasses on as well as most of the small faces from the photos are detected at this scale most of the time.



Object Tracking with OpenCV trackers


Object Saliency Detection with OpenCV


Linear / QR Barcode Generation and Detection with OpenCV




Image Segmentation with Random Walk with scikit-image



Image Segmentation with Grab-Cut with OpenCV


Segmentation with SLIC + RAG in scikit-image


Homography with scikit-image / OpenCV











Face Morphing (Beier-Neely morphing) with Pystasm

me14     sheldon



Object Detection with YOLO DarkNet / Keras / OpenCV (Deep Learning model)


Semantic Segmentation with ENet / DeepLab (Deep Learning  model)

Input video and the segmented Output video

Input video and the segmented Output video

Neural Style Transfer with OpenCV / Torch (Deep Learning Model)

Input  & Output with style (Starry night)me_style

Text Detection with EAST (Deep Learning Model)




Image Colorization with Deep Learning (OpenCV / Caffe)



OCR + Text Recognition with EAST + Tesseract




Detection of a Human Object with HOG Descriptor Features using SVM (Primal QuadProg implementation using CVXOPT) in Python

In this article, first how to extract the HOG descriptor from an image will be discuss. Then how a support vector machine binary classifier can be trained on a dataset containing labeled images (using the extracted HOG descriptor features) and later how the SVM model can be used (along with a sliding window) to predict whether or not a human object exists in a test image will be described.  How SVM can be represented as a Primal Quadratic Programming problem and can be solved with CVXOPT that will also be discussed. This problem appeared as an assignment problem in this Computer Vision course from UCF.

Problem 1: Compute HOG features

Let’s first Implement Histogram of Orientated Gradients (HOG). The dataset to be used is the INRIA Person Dataset from here. The dataset consists of positive and negative examples for training as well as testing images. Let us do the following:

i. Take 2003 positive training images of size 96×160
ii. Take 997 negative training images of size 96×160
iii. Compute HOG for positive and negative examples.
iv. Show the visualization of HOG for some positive and negative examples.

The Histograms of Oriented Gradients for Human Detection (HOG) is a very heavily cited paper by N. Dalal and B. Triggs from CVPR 2005. The following figure shows the  algorithm proposed by them can be used to compute the HOG features for a 96×160 image:


The next python code snippet shows some helper functions to compute the hog features:

import numpy as np
from scipy import signal
import scipy.misc

def s_x(img):
    kernel = np.array([[-1, 0, 1]])
    imgx = signal.convolve2d(img, kernel, boundary='symm', mode='same')
    return imgx
def s_y(img):
    kernel = np.array([[-1, 0, 1]]).T
    imgy = signal.convolve2d(img, kernel, boundary='symm', mode='same')
    return imgy

def grad(img):
    imgx = s_x(img)
    imgy = s_y(img)
    s = np.sqrt(imgx**2 + imgy**2)
    theta = np.arctan2(imgx, imgy) #imgy, imgx)
    theta[theta<0] = np.pi + theta[theta<0]
    return (s, theta)


The following figures animations show some positive and negative training examples along with the HOG features computed using the algorithm.


Positive Example 1

The next animation shows how the HOG features are computed using the above algorithm.


Positive Example 2


The next animation shows how the HOG features are computed using the above algorithm.



Positive Example 3


The next animation shows how the HOG features are computed using the above algorithm.

Negative Example 1


The next animation shows how the HOG features are computed using the above algorithm.


Problem 2: Use sklearn’s SVC and 80-20 validation to compute accuracy on the held-out training images dataset using the extracted HOG features.

Before implementing SVC on our own with primal quadratic programming solver, let’s use the scikit-learn SVC implementation (with linear kernel) to train a support vector classifier on the training positive and negative examples using the HOG features extracted  from the training images with 80-20 validation and compute accuracy of classification on the held-out images.

The following python code does exactly that, with the X matrix containing the 1620 HOG features extracted from each image and the corresponding label (pos/neg, depending on whether human is present or not), with 98.5% accuracy on the held-out dataset.

import time
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.8, random_state=123)
timestamp1 = time.time()
clf = SVC(C=1, kernel='linear'), ytrain)
print("%d support vectors out of %d points" % (len(clf.support_vectors_), len(Xtrain)))
timestamp2 = time.time()
print "sklearn LinearSVC took %.2f seconds" % (timestamp2 - timestamp1)
ypred = clf.predict(Xtest)
print('accuracy', accuracy_score(ytest, ypred))

430 support vectors out of 2400 points
sklearn LinearSVC took 3.40 seconds
accuracy 0.985

The next figures show the confusion matrices for the prediction on the held-out dataset with the SVC model learnt.




Problem 3: Implement SVM by solving the Primal form of the problem using Quadratic Programming

Let’s implement Support Vector Machine (SVM) using Quadratic Programming. We shall use python’s CVXOPT package for this purpose. Let’s do the following:

i. Try to understand each input term in cvxopt.solvers.qp.
ii. Formulate soft- margin primal SVM in term of inputs of cvxopt.solvers.qp
iii. Show ‘P’, ‘Q’, ‘G”, ‘h’, ‘A’ and ‘b’ Matrices.
iv. Obtain parameter vector ‘w’ and bias term ‘b’ using cvxopt.solvers.qp


To be done


Problem 4: Detect Human in testing images using trained model (‘w’, ‘b’) from the last problem

Let’s use the coefficients learnt by the SVM model from the training dataset and do the following:

i. Take at least 5 testing images from Test/pos.
ii. Test the trained model over testing images. Testing can be performed using
w*feature vector + b.
iii. Use sliding window approach to obtain detection at each location in the image.
iv. Perform non-maximal suppression and choose the highest scored location.
v. Display the bounding box at the final detection.


To be done




Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

  • First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
  • Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
  • Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
  • The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:


or as the following if the explicit bias term is discarded:


where the 0-1 loss is approximated by the hinge-loss.


  • Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
  • The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
  • Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
  • The learning rate η is gradually decreased with iteration.


The following figure shows a simplified version of the algorithm:




The following python code implements the algorithm:

class SVMPegasos(LinearClassifier):
    """Implementation of SVM with SGD with PEGASOS Algorithm"""

    def __init__(self, n_iter=10, lambda1=1):
       self.n_iter = n_iter
       self.lambda1 = lambda1

    def fit(self, X, Y):
       Y = list(Y)
       # convert all output values to +1 or -1
       Yn = [sign(y, self.positive_class) for y in Y]
       X = X.toarray()
       m, n_features = X.shape[0], X.shape[1]
       self.w = numpy.zeros( n_features )
       for i in range(self.n_iter):
           eta = 1. / (self.lambda1*(i+1))
           j = numpy.random.choice(m, 1)[0]
           x, y = X[j], Yn[j]
           score =
           if y*score < 1:
              self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
              self.w = (1 - eta*self.lambda1)*self.w

Some Notes

  • The optional projection step has been left out (the line in square brackets in the paper).
  •  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
  •  The number η is the step length in gradient descent.
  • The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
  • Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
  • Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
  • Classification accuracy obtained on the test split was 95%.


2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.


Using the PEGASOS implementation to classify the following linearly non-separable datasets

Dataset 1 flame

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
  • Classification accuracy obtained on the test split was 95.8%.


Dataset 2

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
  • Classification accuracy obtained on the test split was 100%.


3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.





Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
  • Classification accuracy obtained on the test split was 97.5%.


4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.


The next figures show the distribution of the word lengths for the texts for different sentiments.



  • Create sklearn Pipeline to
    • Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
    • Use sklearn SelectKBest to choose top 5000 features
    • Train / test the model
  • Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
  • Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
  • Also Record the time taken to train the classifier.
  • Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.


The following python code shows how the text-classification pipeline is implemented with SVMPegasos.

def classify_with_svm():
    # read all the documents
    X, Y = read_corpus('all_sentiment_shuffled.txt')
    # split into training and test parts
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
    classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
    tokenizer = lambda x: x)),
    ('fs', SelectKBest(k=5000)),
    ('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
    t0 = time.time(), Ytrain)
    t1 = time.time()
    print('Training time:', t1-t0, 'seconds.')
    Yguess = classifier.predict(Xtest)
    print('accuracy:', accuracy_score(Ytest, Yguess))


Some Notes

  • The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
  • This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
  • The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
  • For Perceptron the entire training data was passed 10 times to the model while training.
  • For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.


As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.




Learning Distributed Word  Representations with Neural Network: an implementation in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by  Prof.  Geoffrey Hinton from the University of Toronto in 2012.  This problem also appeared as an assignment in this course from the same university.  The problem description is taken from the assignment pdf.


Problem Statement

In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:


  • The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.
  • The 4-grams are chosen so that all the words involved come from a small
    vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.
  • Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

load data.mat
ans =
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left

  • The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.
  • Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .

  • The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.
  • Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.


load data.mat
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

% 3-gram features for a training data-tuple
%ans =
%ans = now
%ans = where
%ans = do

% target for the same data tuple from training dataset
%ans = 91
%ans = we

  • The validation and test data are also similar. They contain 46,568 4-grams each.
  • Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.
  • The data needs to get loaded and then separated into inputs and target. After that,  mini-batches of size 100 for the training set are created.
  • First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.
  • At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.
  • The training method will output a ‘model’ (weight matrices, biases for each layer in the network).


Description of the Network


  • As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.
  • The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.
  • This layer is connected to the hidden layer, which in turn is connected to the output layer.
  • The output layer is a softmax over the 250 words.
  • The training consists of two steps:  (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.


Forward Propagation

  • The forward propagation is pretty straight-forward and can be implemented as shown in the following code:
    function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
     fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
     hid_to_output_weights, hid_bias, output_bias)
    % This method forward propagates through a neural network.
    % Inputs:
    % input_batch: The input data as a matrix of size numwords X batchsize where,
    % numwords is the number of words, batchsize is the number of data points.
    % So, if input_batch(i, j) = k then the ith word in data point j is word
    % index k of the vocabulary.
    % word_embedding_weights: Word embedding as a matrix of size
    % vocab_size X numhid1, where vocab_size is the size of the vocabulary
    % numhid1 is the dimensionality of the embedding space.
    % embed_to_hid_weights: Weights between the word embedding layer and hidden
    % layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
    % number of hidden units.
    % hid_to_output_weights: Weights between the hidden layer and output softmax
    % unit as a matrix of size numhid2 X vocab_size
    % hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
    % output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
    % Outputs:
    % embedding_layer_state: State of units in the embedding layer as a matrix of
    % size numhid1*numwords X batchsize
    % hidden_layer_state: State of units in the hidden layer as a matrix of size
    % numhid2 X batchsize
    % output_layer_state: State of units in the output layer as a matrix of size
    % vocab_size X batchsize
    [numwords, batchsize] = size(input_batch);
    [vocab_size, numhid1] = size(word_embedding_weights);
    numhid2 = size(embed_to_hid_weights, 2);
    % Look up the inputs word indices in the word_embedding_weights matrix.
    embedding_layer_state = reshape(...
     word_embedding_weights(reshape(input_batch, 1, []),:)',...
     numhid1 * numwords, []);
    % Compute inputs to hidden units.
    inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
     repmat(hid_bias, 1, batchsize);
    % Apply logistic activation function.
    hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize);
    % Compute inputs to softmax.
    inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize);
    % Subtract maximum.
    % Remember that adding or subtracting the same constant from each input to a
    % softmax unit does not affect the outputs. Here we are subtracting maximum to
    % make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their
    % exponents.
    inputs_to_softmax = inputs_to_softmax...
     - repmat(max(inputs_to_softmax), vocab_size, 1);
    % Compute exp.
    output_layer_state = exp(inputs_to_softmax);
    % Normalize to get probability distribution.
    output_layer_state = output_layer_state ./ repmat(...
     sum(output_layer_state, 1), vocab_size, 1);




  •  The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.



  • As the model trains it prints out some numbers that tell how well the training is going.
  • The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.
  • After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.
  • The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.
  • At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.


Some Applications

1. Predict next word

  • Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.
  • The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.
  • It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.




2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random  text:

  1. Given 3 words to start from, initialize the text with those 3 words.
  2. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words.
  3. Choose one of the most probable words predicted randomly and insert it at the end of the text.
  4. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:

function gen_rand_text(words, model, k=3)

probs = [];
i = 4;
while (i < 20 || word != '.')
[word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k);                   words = {words{:}, word};
probs = [probs; prob];
i = i + 1;
fprintf(1, "%s ", words{:}) ;
fprintf(1, '\n');
fprintf(1, "%.2f ", round(probs.*100)./100) ;
fprintf(1, '\n');


Starting with the words  'i was going‘, here are some texts that were generated using the model:


Starting with the words  ‘life in new‘, here is a piece of text that was generated using the model:


3. Find nearest words

  •  The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.
  • As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.
  • For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.




4. Visualization in 2-dimension with t-SNE

  •  In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.
  • As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.
  • Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).



5. Solving Word-Analogy Problem

  •  with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation.  Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.
  • The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.


  • The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.
    analogy('year', 'years', 'day', model); % singular-plural relation
    %dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186
    analogy('on', 'off', 'new', model) % antonyms relation
    %dist_E('on','off')=2.013958, dist_E('new','old')=2.265665
    analogy('use', 'used', 'do', model) % present-past relation
    %dist_E('use','used')=2.556175, dist_E('do','did')=2.456098
    analogy('he', 'his', 'they', model) % pronoun-relations
    %dist_E('he','his')=3.824808, dist_E('they','their')=3.825453
    analogy('today', 'yesterday', 'now', model)
    %dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935


Model Selection

  • Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.
  • The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures,  the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).
  • The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.



Implementing Lucas-Kanade Optical Flow algorithm in Python

In this article an implementation of the Lucas-Kanade optical flow algorithm is going to be described. This problem appeared as an assignment in this computer vision course from UCSD. The inputs will be sequences of images (subsequent frames from a video) and the algorithm will output an optical flow field (u, v) and trace the motion of the moving objects. The problem description is taken from the assignment itself.


Problem Statement

Single-Scale Optical Flow

  • Let’s implement the single-scale Lucas-Kanade optical flow algorithm. This involves finding the motion (u, v) that minimizes the sum-squared error of the brightness constancy equations for each pixel in a window.  The algorithm will be implemented as a function with the following inputs:

     def optical_flow(I1, I2, window_size, tau) # returns (u, v)

  • Here, u and v are the x and y components of the optical flow, I1 and I2 are two images taken at times t = 1 and t = 2 respectively, and window_size is a 1 × 2 vector storing the width and height of the window used during flow computation.
  • In addition to these inputs, a theshold τ should be added, such that if τ is larger than the smallest eigenvalue of A’A, then the the optical flow at that position should not be computed. Recall that the optical flow is only valid in regions where

has rank 2, which is what the threshold is checking. A typical value for τ is 0.01.

  • We should try experimenting with different window sizes and find out the tradeoffs associated with using a small vs. a large window size.
  • The following figure describes the algorithm, which considers a nxn (n>=3) window around each pixel and solves a least-square problem to find the best flow vectors for the pixel.


  • The following code-snippet shows how the algorithm is implemented in python for a gray-level image.
import numpy as np
from scipy import signal
def optical_flow(I1g, I2g, window_size, tau=1e-2):

    kernel_x = np.array([[-1., 1.], [-1., 1.]])
    kernel_y = np.array([[-1., -1.], [1., 1.]])
    kernel_t = np.array([[1., 1.], [1., 1.]])#*.25
    w = window_size/2 # window_size is odd, all the pixels with offset in between [-w, w] are inside the window
    I1g = I1g / 255. # normalize pixels
    I2g = I2g / 255. # normalize pixels
    # Implement Lucas Kanade
    # for each point, calculate I_x, I_y, I_t
    mode = 'same'
    fx = signal.convolve2d(I1g, kernel_x, boundary='symm', mode=mode)
    fy = signal.convolve2d(I1g, kernel_y, boundary='symm', mode=mode)
    ft = signal.convolve2d(I2g, kernel_t, boundary='symm', mode=mode) +
         signal.convolve2d(I1g, -kernel_t, boundary='symm', mode=mode)
    u = np.zeros(I1g.shape)
    v = np.zeros(I1g.shape)
    # within window window_size * window_size
    for i in range(w, I1g.shape[0]-w):
        for j in range(w, I1g.shape[1]-w):
            Ix = fx[i-w:i+w+1, j-w:j+w+1].flatten()
            Iy = fy[i-w:i+w+1, j-w:j+w+1].flatten()
            It = ft[i-w:i+w+1, j-w:j+w+1].flatten()
            #b = ... # get b here
            #A = ... # get A here
            # if threshold τ is larger than the smallest eigenvalue of A'A:
            nu = ... # get velocity here

    return (u,v)


Some Results

  • The following figures and animations show the results of the algorithm on a few image sequences. Some of these input image sequences / videos are from the course and some are collected from the internet.
  • As can be seen, the algorithm performs best if the motion of the moving object(s) in between consecutive frames is slow. To the contrary, if the motion is large, the algorithm fails and we should implement / use multiple-scale version Lucas-Kanade with image pyramids.
  • Finally,  with small window size,  the algorithm captures subtle motions but not large motions. With large size it happens the other way.

Input Sequences



Output Optical Flow with different window sizes

window size = 15


window size = 21



Input Sequences

Output Optical Flow


Input Sequences (hamburg taxi)


Output Optical Flowtaxi_opt


Input Sequences


Output Optical Flow

Input Sequences


Output Optical Flowseq_opt

Input Sequences    fount3.gif


Output Optical Flowfount_opt

Input Sequences

Output Optical Flow

Input Sequencessynth

Output Optical Flowsynth_opt

Input Sequencescars1
Output Optical Flowcars1_optcars1_cmap

Input Sequencescars2

Output Optical Flowcars2_opt

Output Optical Flowcars2_opt2cars2_cmap

Input Sequences



Output Optical Flow with window size 45

Output Optical Flow with window size 10

Output Optical Flow with window size 25
Output Optical Flow with window size 45cars3_opt2_45


Efficient Graph-Based Image Segmentation in Python

In this article, an implementation of an efficient graph-based image segmentation technique will be described, this algorithm was proposed by Felzenszwalb et. al. from MIT in this paper.  The slides on this paper can be found from this link from the Stanford Vision Lab too. The algorithm is closely related to Kruskal’s algorithm for constructing a minimum spanning tree of a graph, as stated by the author and hence can  be implemented to run in O(m log m) time, where m is the number of edges in the graph.

Problem Definition and the basic idea (from the paper)

  • Let G = (V, E) be an undirected graph with vertices vi ∈ V, the set of elements to be segmented, and edges (vi, vj ) ∈ E corresponding to pairs of neighboring vertices. Each edge (vi, vj ) ∈ E has a corresponding weight w((vi, vj)), which is a non-negative measure of the dissimilarity between neighboring elements vi and vj.

  • In the case of image segmentation, the elements in V are pixels and the weight of an edge is some measure of the dissimilarity between the two pixels connected by that edge (e.g., the difference in intensity, color, motion, location or some other local attribute).

  • Particularly for the implementation described here, an edge weight function based on the absolute intensity difference (in the yiq space) between the pixels connected by an edge, w((vi, vj )) = |I(pi) − I(pj )|.

  • In the graph-based approach, a segmentation S is a partition of V into components
    such that each component (or region) C ∈ S corresponds to a connected component
    in a graph G0 = (V, E0), where E0 ⊆ E.

  • In other words, any segmentation is induced by a subset of the edges in E. There are different ways to measure the quality of a segmentation but in general we want the elements in a component to be similar, and elements in different components to be dissimilar.

  • This means that edges between two vertices in the same component should have relatively low weights, and edges between vertices in different components should have higher weights.

  • The next figure shows the steps in the algorithm. The algorithm is very similar to Kruskal’s algorithm for computing the MST for an undirected graph.


  • The threshold function τ controls the degree to which the difference between two
    components must be greater than their internal differences in order for there to be
    evidence of a boundary between them.

  • For small components, Int(C) is not a good estimate of the local characteristics of the data. In the extreme case, when |C| = 1, Int(C) = 0. Therefore, a threshold function based on the size of the component, τ (C) = k/|C| is needed to be usedwhere |C| denotes the size of C, and k is some constant parameter.

  • That is, for small components we require stronger evidence for a boundary. In practice k sets a scale of observation, in that a larger k causes a preference for larger components.

  • In general, a Gaussian filter is used to smooth the image slightly before computing the edge weights, in order to compensate for digitization artifacts. We always use a Gaussian with σ = 0.8, which does not produce any visible change to the image but helps remove artifacts.

  • The following python code shows how to create the graph.

import numpy as np
from scipy import signal
import matplotlib.image as mpimg

def gaussian_kernel(k, s = 0.5):
    # generate a (2k+1)x(2k+1) gaussian kernel with mean=0 and sigma = s
    probs = [exp(-z*z/(2*s*s))/sqrt(2*pi*s*s) for z in range(-k,k+1)]
return np.outer(probs, probs)

def create_graph(imfile, k=1., sigma=0.8, sz=1):
    # create the pixel graph with edge weights as dissimilarities
     rgb = mpimg.imread(imfile)[:,:,:3]
     gauss_kernel = gaussian_kernel(sz, sigma)
     for i in range(3):
         rgb[:,:,i] = signal.convolve2d(rgb[:,:,i], gauss_kernel, boundary='symm', mode='same')
     yuv = rgb2yiq(rgb)
     (w, h) = yuv.shape[:2]
     edges = {}
     for i in range(yuv.shape[0]):
         for j in range(yuv.shape[1]):
             #compute edge weight for nbd pixel nodes for the node i,j
             for i1 in range(i-1, i+2):
                 for j1 in range(j-1, j+2):
                     if i1 == i and j1 == j: continue

                     if i1 >= 0 and i1 = 0 and j1 < h:
                        wt = np.abs(yuv[i,j,0]-yuv[i1,j1,0])
                        n1, n2 = ij2id(i,j,w,h), ij2id(i1,j1,w,h)
                        edges[n1, n2] = edges[n2, n1] = wt
     return edges


Some Results

  • The images are taken from the paper itself or from the internet. The following figures and animations show the result of segmentation as a result of iterative merging of the components (by choosing least weight edges), depending on the internal difference of the components.

  • Although in the paper the author described the best value of the parameter k to be around 300, but  since in this implementation the pixel RGB values are normalized (to have values in between 0 – 1) and then converted to YIQ values and the YIQ intensities are used for computing the weights (which are typically very small), the value of k that works best in this scenario is 0.001-0.01.

  • As we can see from the below results, higher the value of the parameter k, larger the size of the final component and lesser the number of components in the result.

  • The minimum spanning tree creation is also shown, the red edges shown in the figures are the edges chosen by the algorithm to merge the components.

Input Image


Output Images for two different values of the parameter k



out_ 0.001player.png

out_ 0.010player.png

The forest created after a few iterations


Input Image


Output Images for two different values of the parameter k


out_ 0.010hill.png


out_ 0.001hill.png

The forest created after a few iterations


Input Image


Output Segmented Images

parrot_k_0.001out_ 0.001parrot_Dark2.pngout_ 0.001parrot_hot.png

Input Image


Segmented Output Images

out_ 0.001road2_Set1.png

Input Image


Output Images for two different values of the parameter k


out_ 0.001road.png
road_k_0.01out_ 0.010road.png

The forest created after a few iterations



  Input Image (UMBC)


Segmented Output Images

umbc_k_0.001out_ 0.001umbc.png

    Input Image


Segmented Output Image with k=0.001

out_ 0.001nj.png

    Input Image


Segmented Output Image with k=0.001


    Input Image


Segmented Output Image with k=0.001

out_ 0.001flowers2.png

    Input Image (liver)


Segmented Output Images 

out_ 0.001liver.pngout_ 0.010liver.png

    Input Image


Segmented Output Images with different values of k


out_ 0.001building.pngout_ 0.010building.png

    Input Image


Segmented Output Image

out_ 0.001frame056.png

    Input Image


Segmented Output Images for different k



Interactive Image Segmentation with Graph-Cut in Python

In this article, interactive image segmentation with graph-cut is going to be discussed. and it will be used to segment the source object from the background in an image. This segmentation technique was proposed by Boycov and Jolli in this paper. This problem appeared as a homework assignment here., and also in this lecture video from the Coursera image processing course by Duke university.

Problem Statement: Interactive graph-cut segmentation

Let’s implement “intelligent paint” interactive segmentation tool using graph cuts algorithm on a weighted image grid. Our task will be to separate the foreground object from the background in an image.

Since it can be difficult sometimes to automatically define what’s foreground and what’s background for an image, the user is going to help us with a few interactive scribble lines using which our algorithm is going to identify the foreground and the background, after that it will be the algorithms job to obtain a complete segmentation of the foreground from the background image.

The following figures show how an input image gets scribbling from a user with two different colors (which is also going to be input to our algorithm) and the ideal segmented image output.

Scribbled Input Image                                       Expected Segmented Output Image       bj


The Graph-Cut Algorithm


The following describes how the segmentation problem is transformed into a graph-cut problem:

  1. Let’s first define the Directed Graph G = (V, E) as follows:

    1. Each of the pixels in the image is going to be a vertex in the graph. There will be another couple of special terminal vertices: a source vertex (corresponds to the foreground object) and a sink vertex (corresponds to the background object in the image). Hence, |V(G)| = width x height + 2.

    2. Next, let’s defines the edges of the graph. As obvious, there is going to be two types of edges: terminal (edges that connect the terminal nodes to the non-terminal nodes) and non-terminal (edges that connect the non-terminal nodes only).

    3. There will be a directed edge from the terminal node source to each of non-terminal nodes in the graph. Similarly,  a directed edge will be there from each non-terminal node (pixel) to the other terminal node sink. These are going to be all the terminal edges and hence, |E_T(G)| =  2 x width x height.

    4. Each of the non-terminal nodes (pixels) are going to be connected by edges with the nodes corresponding to the neighboring pixels (defined by 4 or 8 neighborhood of a pixel).  Hence, |E_N(G)| =  |Nbd| x width x height.

  2. Now let’s describe how to compute the edge weights in this graph.

    1. In order to compute the terminal edge weights, we need to estimate the feature distributions first, i.e., starting with the assumption that each of the nodes corresponding to the scribbled pixels have the probability 1.0 (since we want the solution to respect the regional hard constraints marked by the user-seeds / scribbles) to be in foreground or background object in the image (distinguished by the scribble color, e.g.), we have to compute the probability that a node belongs to the foreground (or background) for all the other non-terminal nodes.

    2. The simplest way to compute P_F and P_B is to first fit a couple of  Gaussian distributions on the scribbles by computing the parameters (μ, ∑)
      with MLE from the scribbled pixel intensities and then computing the (class-conditional) probabilities from the individual pdfs (followed by a normalization) for each of the pixels as shown in the next figures. The following code fragment show how the pdfs are being computed.

      import numpy as np
      from collections import defaultdict
      def compute_pdfs(imfile, imfile_scrib):
          # Compute foreground and background pdfs
          # input image and the image with user scribbles
           rgb = mpimg.imread(imfile)[:,:,:3]
           yuv = rgb2yiq(rgb)
           rgb_s = mpimg.imread(imfile_scrib)[:,:,:3]
           yuv_s = rgb2yiq(rgb_s)
           # find the scribble pixels
           scribbles = find_marked_locations(rgb, rgb_s)
           imageo = np.zeros(yuv.shape)
           # separately store background and foreground scribble pixels in the dictionary comps
           comps = defaultdict(lambda:np.array([]).reshape(0,3))
           for (i, j) in scribbles:
                imageo[i,j,:] = rgbs[i,j,:]
                # scribble color as key of comps
                comps[tuple(imageo[i,j,:])] = np.vstack([comps[tuple(imageo[i,j,:])], yuv[i,j,:]])
                mu, Sigma = {}, {}
           # compute MLE parameters for Gaussians
           for c in comps:
                mu[c] = np.mean(comps[c], axis=0)
                Sigma[c] = np.cov(comps[c].T)
           return (mu, Sigma)

      3. In order to compute the non-terminal edge weights, we need to compute the similarities in between a pixel node and the nodes corresponding to its neighborhood pixels, e.g., with the formula shown in the next figures (e.g., how similar neighborhood pixels are in RGB / YIQ space).

  3. Now that the underlying graph is defined, the segmentation of the foreground from the background image boils down to computing the min-cut in the graph or equivalently computing the max-flow (the dual problem) from the source to sink.

  4. The intuition is that the min-cut solution will keep the pixels with high probabilities to belong to the side of the source (foreground) node and likewise the background pixels on the other side of the cut near the sink (background) node, since it’s going to respect the (relatively) high-weight edges (by not going through the highly-similar pixels).

  5. There are several standard algorithms, e.g., Ford-Fulkerson (by finding an augmenting path with O(E max| f |) time complexity) or Edmonds-Karp (by using bfs to find the shortest augmenting path, with O(VE2) time complexity) to solve the max-flow problem, typical implementations of these algorithms run pretty fast, in polynomial time in V, E.  Here we are going to use a different implementation (with pymaxflow) based on Vladimir Kolmogorov, which is shown to run faster on many images empirically in this paper.







The following figures / animations show the interactive-segmentation results (computed probability densities, subset of the flow-graph & min-cut, final segmented image) on a few images,  some of them taken from the above-mentioned courses / videos, some of them taken from Berkeley Vision dataset.

Input Imagedeer

Input Image with Scribblesdeer_scribed

Fitted Densities from Color Scribblesgaussian_deer.png


A Tiny Sub-graph with Min-Cut


Input Image

Input Image with Scribbleseagle_scribed

Fitted Densities from Color Scribblesgaussian_eagle



A Tiny Sub-graph with Min-Cut flow_eagle

Input Image (liver)                                             

Input Image with Scribblesliver_scribed

Fitted Densities from Color Scribbles


Input Imagefishes2

Input Image with Scribbles

Fitted Densities from Color Scribblesgaussian_fishes2


A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribblespanda_scribed


Input Imagebunny

Input Image with Scribblesbunny_scribed

Fitted Densities from Color Scribbles



A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut


Input Image                                                         Input Image with Scribbles
me3    me3_scribed


Input Imageinput1

Input Image with Scribblesinput1_scribed


A Tiny Sub-graph of the flow-graph with Min-Cut


Input Imagerose

Input Image with Scribbles

Fitted Densities from Color Scribbles


Input Image

Input Image with Scribbleswhale_scribed

Fitted Densities from Color Scribblesgaussian_whale

A Tiny Sub-graph of the flow-graph with Min-Cut


Input Image (UMBC Campus Map)umbc_campus

Input Image with Scribbles


Input Imagehorses

Input Image with Scribbles


A Tiny Sub-graph of the flow-graph with Min-Cut (with blue foreground nodes)


Changing the background of an image (obtained using graph-cut segmentation) with another image’s background with cut & paste

The following figures / animation show how the background of a given image can be replaced by a new image using cut & paste (by replacing the corresponding pixels in the new image corresponding to foreground), once the foreground in the original image gets identified after segmentation.

Original Input Imagehorses

New Backgroundsky.png




Image Colorization Using Optimization in Python


This article is inspired by this SIGGRAPH paper by Levin et. al, for which they took this patent , the paper was referred to in the course CS1114 from Cornell.  This method is also discussed in the coursera online image processing course by NorthWestern University. Some part of the problem description is taken from the paper itself. Also, one can refer to the implementation provided by the authors in matlab, the following link and the following python implementation in github.


The Problem

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.


The Algorithm

YUV/YIQ color space is chosen to work in, this is commonly used in video, where Y is the monochromatic luminance channel, which we will refer to simply as intensity, while U and V are the chrominance channels, encoding the color.

The algorithm is given as input an intensity volume Y(x,y,t) and outputs two color volumes U(x,y,t) and V(x,y,t). To simplify notation the boldface letters are used (e.g. r,s) to denote \left(x,y,t \right) triplets. Thus, Y(r) is the intensity of a particular pixel.

Now, One needs to impose the constraint that two neighboring pixels r,s should have similar colors if their intensities are similar. Thus, the problem is formulated as the following optimization problem that aims to minimize the difference between the  colorU(r) at pixel r and the weighted average of the colors at neighboring pixels, where w(r,s) is a weighting function that sums to one, large when Y(r) is similar to Y(s), and small when the two intensities are different. 


When the intensity is constant the color should be constant, and when the intensity is an edge the color should also be an edge. Since the cost functions are quadratic and
the constraints are linear, this optimization problem yields a large, sparse system of linear equations, which may be solved using a number of standard methods.

As discussed in the paper, this algorithm is closely related to algorithms proposed for other tasks in image processing. In image segmentation algorithms based on normalized cuts [Shi and Malik 1997], one attempts to find the second smallest eigenvector of the matrix D − W where W is a npixels×npixels matrix whose elements are the pairwise affinities between pixels (i.e., the r,s entry of the matrix is w(r,s)) and D is a diagonal matrix whose diagonal elements are the sum of the affinities (in this case it is always 1). The second smallest eigenvector of any symmetric matrix A is a unit norm vector x that minimizes x^{T}Ax and is orthogonal to the first eigenvector. By direct  inspection, the quadratic form minimized by normalized cuts is exactly the cost function J, that is x^{T}(D-W)x =J(x). Thus, this algorithm minimizes the same cost function but under different constraints. In image denoising algorithms based on anisotropic diffusion [Perona and Malik 1989; Tang et al. 2001] one often minimizes a  function
similar to equation 1, but the function is applied to the image intensity as well.

The following figures show an original gray-scale image and the marked image with color-scribbles that are going to be used to compute the output colored image.

Original Gray-scale Image Input

Gray-scale image Input Marked with Color-Scribbles 

Implementation of the Algorithm

Here are the the steps for the algorithm:

  1. Convert both the original gray-scale image and the marked image (marked with color scribbles for a few pixels by the artist) to from RGB to YUV / YIQ color space.
  2. Compute the difference image from the marked and the gray-scale image. The pixels that differ  are going to be pinned and they will appear in the output, they are directly going to be used as stand-alone constraints in the minimization problem.
  3. We need to compute the weight matrix W that depends on the similarities in the neighbor intensities for a pixel from the original gray-scale image.
  4. The optimization problem finally boils down to solving the system of linear equations of the form WU = b that has a closed form least-square solution
    U = W^{+}b = {(W^{T}W)}^{-1}W^{T}b. Same thing is to be repeated for the V channel too.
  5. However the W matrix is going to be very huge and sparse, hence sparse-matrix based implementations must be used to obtain an efficient solution. However, in this python implementation in github, the scipy sparse lil_matrix was used when constructing the sparse matrices, which is quite slow, we can construct more efficient scipy csc matrix rightaway, by using a dictionary to store the weights initially. It is much faster. The python code in the next figure shows my implementation for computing the weight matrix W.
  6. Once W is computed it’s just a matter of obtaining the least-square solution, by computing the pseudo-inverse, which can be more efficiently computed with LU factorization and a sparse LU solver, as in this  python implementation in github.
  7. Once the solution of the optimization problem is obtained in YUV / YIQ space, it needs to be converted back to RGB. The following formula is used for conversion.rgbyiq.png

import scipy.sparse as sp
from collections import defaultdict

def compute_W(Y, colored):

    (w, h) = Y.shape
    W = defaultdict()
    for i in range(w):
        for j in range(h):
            if not (i, j) in colored: # pixel i,j in color scribble
            (N, sigma) = get_nbrs(Y, i, j, w, h) #r = (i, j)
            Z = 0.
            id1 = ij2id(i,j,w,h) # linearized id
            for (i1, j1) in N: #s = (i1, j1)
                id2 = ij2id(i1,j1,w,h)
                W[id1,id2] = np.exp(-(Y[i,j]-Y[i1,j1])**2/(sigma**2)) if sigma > 0 else 0.
                Z += W[id1,id2]
            if Z > 0:
               for (i1, j1) in N: #s = (i1, j1)
                   id2 = ij2id(i1,j1,w,h)
                   W[id1,id2] /= -Z
     for i in range(w):
        for j in range(h):
            id = ij2id(i,j,w,h)
            W[id,id] = 1.

    rows, cols = zip(*(W.keys())) #keys
    data = W.values() #[W[k] for k in keys]
    return sp.csc_matrix((data, (rows, cols)), shape=(w*h, w*h)) #W



The following images and animations show the results obtained with the optimization algorithm. Most of the following images are taken from the paper itself.

Original Gray-scale Image Input                 Gray-scale image Input Markedbaby  baby_marked

Color image Output

The next animations show how the incremental scribbling results in better and better color images.

Original Gray-scale Image Input 


As can be seen from the following animation, the different parts of the building get colored as more and more color-hints are scribbled / annotated.

Gray-scale image Input Markedhouse_marked

Color image Output


Original Gray-scale Image Input 


Gray-scale image Input Marked

Color image Output

Original Gray-scale Image Input (me)me3_gray

Gray-scale image Input Marked incrementally

Color image Output

Original Gray-scale Image Input

Gray-scale image Input Marked
Color image Output

Original Gray-scale Image Input


Gray-scale image Input Marked


Color image Output



Recursive Graphics, Bi/Tri-linear Interpolation, Anti-aliasing and Image Transformation in Python

The following problem appeared in an assignment in the Princeton course COS 126 . The problem description is taken from the course itself.

Recursive Graphics

Write a program that plots a Sierpinski triangle, as illustrated below. Then develop a program that plots a recursive patterns of your own design.


Part 1. 

The Sierpinski triangle is an example of a fractal pattern. The pattern was described by Polish mathematician Waclaw Sierpinski in 1915, but has appeared in Italian art since the 13th century. Though the Sierpinski triangle looks complex, it can be generated with a short recursive program.

Examples. Below are the target Sierpinski triangles for different values of order N.


Our task is to implement a recursive function sierpinski(). We need to think recursively: our function should draw one black triangle (pointed downwards) and then call itself recursively 3 times (with an appropriate stopping condition). When writing our program, we should exercise modular design.

The following code shows an implementation:

class Sierpinski:
#Height of an equilateral triangle whose sides are of the specified length.
def height (self, length):
return sqrt(3) * length / 2.
#Draws a filled equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def filledTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length/2., x-length/2.]), np.array([y, y+h, y+h]), alpha=1)
#Draws an empty equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def emptyTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length, x-length]), np.array([y+2*h, y, y]), alpha=0)
# Draws a Sierpinski triangle of order n, such that the largest filled
# triangle has bottom vertex (x, y) and sides of the specified length.
def sierpinski(self, n, x, y, length):
self.filledTriangle(x, y, length)
if n > 1:
self.sierpinski(n-1, x-length/2., y, length/2.)
self.sierpinski(n-1, x+length/2., y, length/2.)
self.sierpinski(n-1, x, y+self.height(length), length/2.)

The following animation shows how such a triangle of order 5 is drawn recursively.

The following animation shows how such a triangle of order 6 is drawn recursively.


A diversion: fractal dimension.

Formally, we can define the Hausdorff dimension or similarity dimension of a self-similar figure by partitioning the figure into a number of self-similar pieces of smaller size. We define the dimension to be the log (# self similar pieces) / log (scaling factor in each spatial direction). For example, we can decompose the unit square into 4 smaller squares, each of side length 1/2; or we can decompose it into 25 squares, each of side length 1/5. Here, the number of self-similar pieces is 4 (or 25) and the scaling factor is 2 (or 5). Thus, the dimension of a square is 2 since log (4) / log(2) = log (25) / log (5) = 2. We can decompose the unit cube into 8 cubes, each of side length 1/2; or we can decompose it into 125 cubes, each of side length 1/5. Therefore, the dimension of a cube is log(8) / log (2) = log(125) / log(5) = 3.

We can also apply this definition directly to the (set of white points in) Sierpinski triangle. We can decompose the unit Sierpinski triangle into 3 Sierpinski triangles, each of side length 1/2. Thus, the dimension of a Sierpinski triangle is log (3) / log (2) ≈ 1.585. Its dimension is fractional—more than a line segment, but less than a square! With Euclidean geometry, the dimension is always an integer; with fractal geometry, it can be something in between. Fractals are similar to many physical objects; for example, the coastline of Britain resembles a fractal, and its fractal dimension has been measured to be approximately 1.25.


Part 2.

Drawing a tree recursively, as described here:


The following shows how a tree of order 10 is drawn:



The next problem appeared in an assignment in the Cornell course CS1114 . The problem description is taken from the course itself.

Bilinear Interpolation

Let’s consider a 2D matrix of values at integer grid locations (e.g., a grayscale image). To interpolate values on a 2D grid, we can use the 2D analogue of linear interpolation: bilinear interpolation. In this case, there are four neighbors for each possible point we’d like to interpolation, and the intensity values of these four neighbors are all combined to compute the interpolated intensity, as shown in the next figure.


In the figure, the Q values represent intensities. To combine these intensities, we perform linear interpolation in multiple directions: we first interpolate in the x direction (to get the value at the blue points), then in the y direction (to get the value at the green points).

Image transformations

Next, we’ll use the interpolation function to help us implement image transformations.

A 2D affine transformation can be represented with a 3 ×3 matrix T:


Recall that the reason why this matrix is 3×3, rather than 2 ×2, is that we operate in homogeneous coordinates; that is, we add an extra 1 on the end of our 2D coordinates (i.e., (x,y) becomes (x,y,1)), in order to represent translations with a matrix multiplication. To apply a transformation T to a pixel, we multiply T by the pixel’s location:


The following figure shows a few such transformation matrices:


To apply a transformation T to an entire image I, we could apply the transformation to each of I’s pixels to map them to the output image. However, this forward warping procedure has several problems. Instead, we’ll use inverse mapping to warp the pixels of the output image back to the input image. Because this won’t necessarily hit an integer-valued location, we’ll need to use the (bi-linear) interpolation to determine the intensity of the input image at the desired location, as shown in the next figure.


To demo the transformation function, let’s implement the following on a gray scale bird image:

1. Horizontal flipping


2. Scaling by a factor of 0.5

3. Rotation by 45 degrees around the center of the image


The next animations show rotation and sheer transformations with the Lena image:



Next, let’s implement a function to transform RGB images. To do this, we need to simply call transform image three times, once for each channel, then put the results together into a single image.  Next figures and animations show some results on an RGB image.


non-linear transformations

The next figure shows the transform functions from here:


The next figures and animations show the application of the above non-linear transforms on the Lena image.









Some more non-linear transforms:




There is a problem with our interpolation method above: it is not very good at shrinking images, due to aliasing. For instance, if let’s try to down-sample the following bricks image by a factor of 0.4, we get the image shown in the following figure: notice the strange banding effects in the output image.

Original Image


Down-sampled Image with Bilinear Interpolation

The problem is that a single pixel in the output image corresponds to about 2.8 pixels in the input image, but we are sampling the value of a single pixel—we should really be averaging over a small area.

To overcome this problem, we will create a data structure that will let us (approximately) average over any possible square regions of pixels in the input image: an image stack. An image stack is a 3D matrix that we can think of as, not surprisingly, a stack of images, one on top of the other. The top image in the cube will be the original input image. Images further down the stack will be the input image with progressively larger amounts of blur. The size of the matrix will be rows × cols × num levels, where the original (grayscale) image has size rows×cols and there are num levels images in the stack.

Before we use the stack, we must write a function to create it, which takes as input a (grayscale) image and a number of levels in the stack, and returns a 3D matrix stack corresponding to the stack. Again, the first image on the stack will be the original image. Every other image in the stack will be a blurred version of the previous image. A good blur kernel to use is:


Now, for image k in the stack, we know that every pixel is a (weighted) average of some number of pixels (a k × k patch, roughly speaking) in the input image. Thus, if we down-sample the image by a factor of k, we want to sample pixels from level k of the stack.

⇒  Let’s write the following function to create image stack that takes a grayscale image and a number max levels, and returns an image stack.

from scipy import signal

def create_image_stack(img, max_level):
K = np.ones((3,3)) / 9.
image_stack = np.zeros((img.shape[0], img.shape[1], max_level))
image_stack[:,:,0] = img
for l in range(1, max_level):
image_stack[:,:,l] = signal.convolve2d(image_stack[:,:,l-1], K,
boundary=’symm’, mode=’same’)
return image_stack

The next animation shows the image stack created from the bricks image.



Trilinear Interpolation

Now, what happens if we down-sample the image by a fractional factor, such as 3.6? Unfortunately, there is no level 3.6 of the stack. Fortunately, we have a tool to solve this problem: interpolation. We now potentially need to sample a value at position (row,col,k) of the image stack, where all three coordinates are fractional. We therefore something more powerful than bilinear interpolation: trilinear interpolation! Each position we want to sample now has eight neighbors, and we’ll combine all of their values together in a weighted sum.

This sounds complicated, but we can write this in terms of our existing functions. In particular, we now interpolate separately along different dimensions: trilinear interpolation can be implemented with two calls to bilinear interpolation and one call to linear interpolation.

Let’s implement a function trilerp like the following that takes an image stack, and a row, column, and stack level k, and returns the interpolated value.

def trilerp (img_stack, x, y, k):

if k < 1: k = 1
if  k == int(k):
return bilerp(img_stack[:,:,k-1], x, y)
f_k, c_k = int(floor(k)), int(ceil(k))
v_f_k = bilerp(img_stack[:,:,f_k-1], x, y)
v_c_k = bilerp(img_stack[:,:,c_k-1], x, y)
return linterp(k, f_k, c_k, v_f_k, v_c_k)

Now we can finally implement a transformation function that does proper anti-aliasing. In order to do this, let’s implement a function that will

  • First compute the image stack.
  • Then compute, for the transformation T, how much T is scaling down the image. If T is defined by the six values a,b,c,d,e,f above, then, to a first approximation, the downscale factor is:
    However, if k < 1 (corresponding to scaling up the image), we still want to sample from level 1. This situation reverts to normal bilinear interpolation.
  • Next call the trilerp function on the image stack, instead of bilerp on the input image.

The next figure shows the output image obtained image transformation with proper anti-aliasing:

Down-sampled Image with Anti-aliasing using Trilinear Interpolationout_scale_t.png

As we can see from the above output, the aliasing artifact has disappeared.

The same results are obtained on the color image, as shown below, by applying the trilerp function on the color channels separately and creating separate image stacks for different color channels.

Original Image

Down-sampled Image with Bilinear Interpolation

Down-sampled Image with Anti-aliasing

The following animation shows the branding artifacts created when using bilinear interpolation for  different scale factors and how they are removed with anti-aliasing.

Down-sampled Images with Bilinear Interpolation


Down-sampled Images with Anti-aliasing


Deep Learning & Art: Neural Style Transfer – An Implementation with Tensorflow (using Transfer Learning with a Pre-trained VGG-19 Network) in Python


This problem appeared as an assignment in the online coursera course Convolution Neural Networks by Prof Andrew Ng, (  The description of the problem is taken straightway from the assignment.

Neural Style Transfer algorithm was created by Gatys et al. (2015) , the paper can be found here .

In this assignment, we shall:

  • Implement the neural style transfer algorithm
  • Generate novel artistic images using our algorithm

Most of the algorithms we’ve studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, we  shall optimize a cost function to get pixel values!

Problem Statement

Neural Style Transfer (NST) is one of the most fun techniques in deep learning. As seen below, it merges two images, namely,

  1. a “content” image (C) and
  2. a “style” image (S),

to create a “generated” image (G). The generated image G combines the “content” of the image C with the “style” of image S.

In this example, we are going to generate an image of the Louvre museum in Paris (content image C), mixed with a painting by Claude Monet, a leader of the impressionist movement (style image S).


Let’s see how we can do this.

Transfer Learning

Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that. The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

Following the original NST paper, we shall use the VGG network. Specifically, we’ll use VGG-19, a 19-layer version of the VGG network. This model has already been trained on the very large ImageNet database, and thus has learned to recognize a variety of low level features (at the earlier layers) and high level features (at the deeper layers). The following figure (taken from the google image search results) shows how a VGG-19 convolution neural net looks like, without the last fully-connected (FC) layers.


We run the following code to load parameters from the pre-trained VGG-19 model serialized in a matlab file. This takes a few seconds.

model = load_vgg_model(“imagenet-vgg-verydeep-19.mat”)
import pprint

{‘avgpool1’: <tf.Tensor ‘AvgPool_5:0’ shape=(1, 150, 200, 64) dtype=float32>,
‘avgpool2’: <tf.Tensor ‘AvgPool_6:0’ shape=(1, 75, 100, 128) dtype=float32>,
‘avgpool3’: <tf.Tensor ‘AvgPool_7:0’ shape=(1, 38, 50, 256) dtype=float32>,
‘avgpool4’: <tf.Tensor ‘AvgPool_8:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘avgpool5’: <tf.Tensor ‘AvgPool_9:0’ shape=(1, 10, 13, 512) dtype=float32>,
‘conv1_1’: <tf.Tensor ‘Relu_16:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv1_2’: <tf.Tensor ‘Relu_17:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv2_1’: <tf.Tensor ‘Relu_18:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv2_2’: <tf.Tensor ‘Relu_19:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv3_1’: <tf.Tensor ‘Relu_20:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_2’: <tf.Tensor ‘Relu_21:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_3’: <tf.Tensor ‘Relu_22:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_4’: <tf.Tensor ‘Relu_23:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv4_1’: <tf.Tensor ‘Relu_24:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_2’: <tf.Tensor ‘Relu_25:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_3’: <tf.Tensor ‘Relu_26:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_4’: <tf.Tensor ‘Relu_27:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv5_1’: <tf.Tensor ‘Relu_28:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_2’: <tf.Tensor ‘Relu_29:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_3’: <tf.Tensor ‘Relu_30:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_4’: <tf.Tensor ‘Relu_31:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘input’: <tensorflow.python.ops.variables.Variable object at 0x7f7a5bf8f7f0>}

The next figure shows the content image (C) – the Louvre museum’s pyramid surrounded by old Paris buildings, against a sunny sky with a few clouds.


For the above content image, the activation outputs from the convolution layers are visualized in the next few figures.









How to ensure that the generated image G matches the content of the image C?

As we know, the earlier (shallower) layers of a ConvNet tend to detect lower-level features such as edges and simple textures, and the later (deeper) layers tend to detect higher-level features such as more complex textures as well as object classes.

We would like the “generated” image G to have similar content as the input image C. Suppose we have chosen some layer’s activations to represent the content of an image. In practice, we shall get the most visually pleasing results if we choose a layer in the middle of the network – neither too shallow nor too deep.


First we need to compute the “content cost” using TensorFlow.

  • The content cost takes a hidden layer activation of the neural network, and measures how different a(C) and a(G) are.
  • When we minimize the content cost later, this will help make sure G
    has similar content as C.

def compute_content_cost(a_C, a_G):
Computes the content cost

a_C — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G

J_content — scalar that we need to compute using equation 1 above.

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape a_C and a_G
a_C_unrolled = tf.reshape(tf.transpose(a_C), (m, n_H * n_W, n_C))
a_G_unrolled = tf.reshape(tf.transpose(a_G), (m, n_H * n_W, n_C))

# compute the cost with tensorflow
J_content = tf.reduce_sum((a_C_unrolled – a_G_unrolled)**2 / (4.* n_H * n_W *  \

return J_content


Computing the style cost

For our running example, we will use the following style image (S). This painting was painted in the style of impressionism, by  Claude Monet .



def gram_matrix(A):
A — matrix of shape (n_C, n_H*n_W)

GA — Gram matrix of A, of shape (n_C, n_C)

GA = tf.matmul(A, tf.transpose(A))
return GA


def compute_layer_style_cost(a_S, a_G):
a_S — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G

J_style_layer — tensor representing a scalar value, style cost defined above by equation (2)

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape the images to have them of shape (n_C, n_H*n_W)
a_S = tf.reshape(tf.transpose(a_S), (n_C, n_H * n_W))
a_G = tf.reshape(tf.transpose(a_G), (n_C, n_H * n_W))

# Computing gram_matrices for both images S and G (≈2 lines)
GS = gram_matrix(a_S)
GG = gram_matrix(a_G)

# Computing the loss
J_style_layer = tf.reduce_sum((GS – GG)**2 / (4.* (n_H * n_W * n_C)**2))

return J_style_layer


  • The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
  • Minimizing the style cost will cause the image G to follow the style of the image S.


Defining the total cost to optimize

Finally, let’s create and implement a cost function that minimizes both the style and the content cost. The formula is:



def total_cost(J_content, J_style, alpha = 10, beta = 40):
Computes the total cost function

J_content — content cost coded above
J_style — style cost coded above
alpha — hyperparameter weighting the importance of the content cost
beta — hyperparameter weighting the importance of the style cost

J — total cost as defined by the formula above.

J = alpha * J_content + beta * J_style
return J

  • The total cost is a linear combination of the content cost J_content(C,G) and the style cost J_style(S,G).
  • α and β are hyperparameters that control the relative weighting between content and style.


Solving the optimization problem

Finally, let’s put everything together to implement Neural Style Transfer!

Here’s what the program will have to do:

  • Create an Interactive Session
  • Load the content image
  • Load the style image
  • Randomly initialize the image to be generated
  • Load the VGG19 model
  • Build the TensorFlow graph:
    • Run the content image through the VGG19 model and compute the content cost.
    • Run the style image through the VGG19 model and compute the style cost
      Compute the total cost.
    • Define the optimizer and the learning rate.
  • Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.

Let’s first load, reshape, and normalize our “content” image (the Louvre museum picture) and “style” image (Claude Monet’s painting).

Now, we initialize the “generated” image as a noisy image created from the content_image. By initializing the pixels of the generated image to be mostly noise but still slightly correlated with the content image, this will help the content of the “generated” image more rapidly match the content of the “content” image. The following figure shows the noisy image:


Next, let’s load the pre-trained VGG-19 model.

To get the program to compute the content cost, we will now assign a_C and a_G to be the appropriate hidden layer activations. We will use layer conv4_2 to compute the content cost. We need to do the following:

  • Assign the content image to be the input to the VGG model.
  • Set a_C to be the tensor giving the hidden layer activation for layer “conv4_2”.
  • Set a_G to be the tensor giving the hidden layer activation for the same layer.
  • Compute the content cost using a_C and a_G.

Next, we need to compute the style cost and compute the total cost J by taking a linear combination of the two. Use alpha = 10 and beta = 40.

Then we are going to  set up the Adam optimizer in TensorFlow, using a learning rate of 2.0.

Finally, we need to initialize the variables of the tensorflow graph, assign the input image (initial generated image) as the input of the VGG19 model and runs the model to minimize the total cost J for a large number of iterations.


The following figures / animations show the generated images (G) with different content (C) and style images (S) at different iterations in the optimization process.



Style (Claud Monet’s The Poppy Field near Argenteuil)


















Style (Van Gogh’s The Starry Night)










Content (Victoria Memorial Hall)


Style (Van Gogh’s The Starry Night)




Content (Taj Mahal)


Style (Van Gogh’s Starry Night Over the Rhone)






Style (Claud Monet’s Sunset in Venice)




Content (Visva Bharati)biswa.jpg

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )




Content (Howrah Bridge)


Style (Van Gogh’s The Starry Night)





Content (Leonardo Da Vinci’s Mona Lisa)


Style (Van Gogh’s The Starry Night)


Content (My sketch: Rabindranath Tagore)


Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )


Content (me)


Style (Van Gogh’s Irises)













Style (Publo Picaso’s Factory at Horto de Ebro)




The following animations show how the generated image changes with the change in VGG-19 convolution layer used for computing content cost.



Style (Van Gogh’s The Starry Night)



convolution layer 3_2 used


convolution layer 4_2 used


convolution layer 5_2 used