Some Reinforcement Learning: The Greedy and Explore-Exploit Algorithms for the Multi-Armed Bandit Framework in Python

In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.

The Problem Statement and Some Theory

Given a set of  actions with some unknown reward distributions,  maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately.  This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.

The following figure defines the problem mathematically  and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.

The following figure shows a motivating application of the multi-armed bandit problem in drug discovery.  Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.

The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.

  • The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
  • The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
  • The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas
    the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.


The following code from the github repository of the same course shows how the basic bandit Framework can be defined:

import numpy as np
import sys

### Interface
class Environment(object):

def reset(self):
raise NotImplementedError('Inheriting classes must override reset.')

def actions(self):
raise NotImplementedError('Inheriting classes must override actions.')

def step(self):
raise NotImplementedError('Inheriting classes must override step')

class ActionSpace(object):

def __init__(self, actions):
self.actions = actions
self.n = len(actions)

### BanditEnv Environment

class BanditEnv(Environment):

def __init__(self, num_actions = 10, distribution = "bernoulli", evaluation_seed="387"):
super(BanditEnv, self).__init__()

self.action_space = ActionSpace(range(num_actions))
self.distribution = distribution


self.reward_parameters = None
if distribution == "bernoulli":
self.reward_parameters = np.random.rand(num_actions)
elif distribution == "normal":
self.reward_parameters = (np.random.randn(num_actions),
elif distribution == "heavy-tail":
self.reward_parameters = np.random.rand(num_actions)
print("Please use a supported reward distribution") #, flush = True)

if distribution != "normal":
self.optimal_arm = np.argmax(self.reward_parameters)
self.optimal_arm = np.argmax(self.reward_parameters[0])

def reset(self):
self.is_reset = True
return None

def compute_gap(self, action):
if self.distribution != "normal":
gap = np.absolute(self.reward_parameters[self.optimal_arm] -
gap = np.absolute(self.reward_parameters[0][self.optimal_arm] -
return gap

def step(self, action):
self.is_reset = False

valid_action = True
if (action is None or action = self.action_space.n):
print("Algorithm chose an invalid action; reset reward to -inf")#, flush = True)
reward = float("-inf")
gap = float("inf")
valid_action = False

if self.distribution == "bernoulli":
if valid_action:
reward = np.random.binomial(1, self.reward_parameters[action])
gap = self.reward_parameters[self.optimal_arm] -
elif self.distribution == "normal":
if valid_action:
reward = self.reward_parameters[0][action] + self.reward_parameters[1][action] * np.random.randn()
gap = self.reward_parameters[0][self.optimal_arm] - self.reward_parameters[0][action]
elif self.distribution == "heavy-tail":
if valid_action:
reward = self.reward_parameters[action] + np.random.standard_cauchy()
gap = self.reward_parameters[self.optimal_arm] - self.reward_parameters[action] #HACK to compute expected gap
print("Please use a supported reward distribution")#, flush = True)

return(None, reward, self.is_reset, '')

#Policy interface
class Policy:
#num_actions: (int) Number of arms [indexed by 0 ... num_actions-1]
def __init__(self, num_actions):
self.num_actions = num_actions

def act(self):

def feedback(self, action, reward):

Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).

In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.

Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.

Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.



  • The next figure shows the reward distributions for a 5-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..4.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.



Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
  2. Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
  3. Both Greedy and Round-Robin has linear regrets (w.r.t. timesteps).
  4. In this case the greedy algorithm even with sub-optimal action still performs relatively better than the round-robin.






  • The next figure shows the reward distributions for a 10-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.



Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter ε) and the Optimistic-Greedy (with different values of the hyper-parameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Epsilon-Greedy algorithm behaves exactly like Greedy when ε = 0 and behaves randomly when ε = 1. For both of these cases, the ε-greedy algorithm has linear regret.
  2.  ε = 0.1 and ε = 0.15 find the optimal arm 6 eventually and they have sub-linear regrets. 
  3. Optimistic-Greedy algorithm behaves exactly like Greedy when R = 0 and behaves randomly when R = 10000. For both of these cases, the ε-greedy algorithm has linear regret.
  4. R = 3 and R = 5 find the optimal arm 6 eventually and they have sub-linear regrets(w.r.t. timesteps).


ε = 0



ε = 0.05



ε = 0.1


ε = 0.15


ε = 1.0



Optimistic Greedy


R = 0


R = 1




R = 3



R = 5



R = 10000



The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.


  • The next figure shows the reward distributions for a 10-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.


Again, let’s assume that we don’t know p_i values, we  implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
  2. Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
  3. For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the non-informative flat prior.
  4. Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
  5. The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.




Thompson Beta


Thompson Normal



Learning Distributed Word  Representations with Neural Network: an implementation in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by  Prof.  Geoffrey Hinton from the University of Toronto in 2012.  This problem also appeared as an assignment in this course from the same university.  The problem description is taken from the assignment pdf.


Problem Statement

In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:


  • The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.
  • The 4-grams are chosen so that all the words involved come from a small
    vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.
  • Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

load data.mat
ans =
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left

  • The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.
  • Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .

  • The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.
  • Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.


load data.mat
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

% 3-gram features for a training data-tuple
%ans =
%ans = now
%ans = where
%ans = do

% target for the same data tuple from training dataset
%ans = 91
%ans = we

  • The validation and test data are also similar. They contain 46,568 4-grams each.
  • Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.
  • The data needs to get loaded and then separated into inputs and target. After that,  mini-batches of size 100 for the training set are created.
  • First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.
  • At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.
  • The training method will output a ‘model’ (weight matrices, biases for each layer in the network).


Description of the Network


  • As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.
  • The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.
  • This layer is connected to the hidden layer, which in turn is connected to the output layer.
  • The output layer is a softmax over the 250 words.
  • The training consists of two steps:  (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.


Forward Propagation

  • The forward propagation is pretty straight-forward and can be implemented as shown in the following code:
    function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
     fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
     hid_to_output_weights, hid_bias, output_bias)
    % This method forward propagates through a neural network.
    % Inputs:
    % input_batch: The input data as a matrix of size numwords X batchsize where,
    % numwords is the number of words, batchsize is the number of data points.
    % So, if input_batch(i, j) = k then the ith word in data point j is word
    % index k of the vocabulary.
    % word_embedding_weights: Word embedding as a matrix of size
    % vocab_size X numhid1, where vocab_size is the size of the vocabulary
    % numhid1 is the dimensionality of the embedding space.
    % embed_to_hid_weights: Weights between the word embedding layer and hidden
    % layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
    % number of hidden units.
    % hid_to_output_weights: Weights between the hidden layer and output softmax
    % unit as a matrix of size numhid2 X vocab_size
    % hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
    % output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
    % Outputs:
    % embedding_layer_state: State of units in the embedding layer as a matrix of
    % size numhid1*numwords X batchsize
    % hidden_layer_state: State of units in the hidden layer as a matrix of size
    % numhid2 X batchsize
    % output_layer_state: State of units in the output layer as a matrix of size
    % vocab_size X batchsize
    [numwords, batchsize] = size(input_batch);
    [vocab_size, numhid1] = size(word_embedding_weights);
    numhid2 = size(embed_to_hid_weights, 2);
    % Look up the inputs word indices in the word_embedding_weights matrix.
    embedding_layer_state = reshape(...
     word_embedding_weights(reshape(input_batch, 1, []),:)',...
     numhid1 * numwords, []);
    % Compute inputs to hidden units.
    inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
     repmat(hid_bias, 1, batchsize);
    % Apply logistic activation function.
    hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize);
    % Compute inputs to softmax.
    inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize);
    % Subtract maximum.
    % Remember that adding or subtracting the same constant from each input to a
    % softmax unit does not affect the outputs. Here we are subtracting maximum to
    % make all inputs <= 0. This prevents overflows when computing their
    % exponents.
    inputs_to_softmax = inputs_to_softmax...
     - repmat(max(inputs_to_softmax), vocab_size, 1);
    % Compute exp.
    output_layer_state = exp(inputs_to_softmax);
    % Normalize to get probability distribution.
    output_layer_state = output_layer_state ./ repmat(...
     sum(output_layer_state, 1), vocab_size, 1);




  •  The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.



  • As the model trains it prints out some numbers that tell how well the training is going.
  • The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.
  • After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.
  • The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.
  • At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.


Some Applications

1. Predict next word

  • Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.
  • The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.
  • It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.




2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random  text:

  1. Given 3 words to start from, initialize the text with those 3 words.
  2. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words.
  3. Choose one of the most probable words predicted randomly and insert it at the end of the text.
  4. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:

function gen_rand_text(words, model, k=3)

probs = [];
i = 4;
while (i < 20 || word != '.')
[word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k);                   words = {words{:}, word};
probs = [probs; prob];
i = i + 1;
fprintf(1, "%s ", words{:}) ;
fprintf(1, '\n');
fprintf(1, "%.2f ", round(probs.*100)./100) ;
fprintf(1, '\n');


Starting with the words  'i was going‘, here are some texts that were generated using the model:


Starting with the words  ‘life in new‘, here is a piece of text that was generated using the model:


3. Find nearest words

  •  The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.
  • As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.
  • For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.




4. Visualization in 2-dimension with t-SNE

  •  In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.
  • As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.
  • Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).



5. Solving Word-Analogy Problem

  •  with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation.  Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.
  • The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.


  • The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.
    analogy('year', 'years', 'day', model); % singular-plural relation
    %dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186
    analogy('on', 'off', 'new', model) % antonyms relation
    %dist_E('on','off')=2.013958, dist_E('new','old')=2.265665
    analogy('use', 'used', 'do', model) % present-past relation
    %dist_E('use','used')=2.556175, dist_E('do','did')=2.456098
    analogy('he', 'his', 'they', model) % pronoun-relations
    %dist_E('he','his')=3.824808, dist_E('they','their')=3.825453
    analogy('today', 'yesterday', 'now', model)
    %dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935


Model Selection

  • Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.
  • The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures,  the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).
  • The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.



Autonomous Driving – Car detection with YOLO Model with Keras in Python

In this article, object detection using the very powerful YOLO model will be described, particularly in the context of car detection for autonomous driving. This problem appeared as an assignment in the coursera course Convolution Networks which is a part of the Deep Learning Specialization (taught by Prof. Andrew Ng.,  from Stanford and, the lecture videos corresponding to the YOLO algorithm can be found here).  The problem description is taken straightaway from the assignment.

Given a set of images (a car detection dataset), the goal is to detect objects (cars) in those images using a pre-trained YOLO (You Only Look Once) model, with bounding boxes. Many of the ideas are from the two original YOLO papers: Redmon et al., 2016  and Redmon and Farhadi, 2016 .

Some Theory

Let’s first clear the concepts regarding classification, localization, detection and how the object detection problem can be transformed to supervised machine learning problem and subsequently can be solved using a deep convolution neural network. As can be seen from the next figure,

  • Image classification with localization aims to find the location of an object in an image by not only classifying the image (e.g., a binary classification problem: whether there is a car in an image or not), but also finding a bounding box around the object, if one found.
  • Detection goes a level further by aiming to identify multiple instances of same/ different types of objects, by marking their locations (the localization problem usually tries to find a single object location).
  • The localization problem can be converted to a supervised machine learning multi-class classification problem in the following way: in addition to the class label of the object to be identified, the output vector corresponding to an input training image must also contain the location (bounding box coordinates relative to image size) of the object.
  • A typical output data vector will contain 8 entries for a 4-class classification, as shown in the next figure, the first entry will correspond to whether or not an object of any from the 3 classes of objects. In case one is present in an image, the next 4 entries will define the bounding box containing the object, followed by 3 binary values for the 3 class labels indicating the class of the object. In case none of the objects are present, the first entry will be 0 and the others will be ignored.



  • Now moving from localization to detection, one can proceed in two steps as shown below in the next figure: first use small tightly cropped images to train a convolution neural net for image classification and then use sliding windows of different window sizes (smaller to larger) to classify a test image within that window using the convnet learnt and run the windows sequentially through the entire image, but it’s infeasibly slow computationally.
  • However, as shown in the next figure, the convolutional implementation of the sliding windows by replacing the fully-connected layers by 1×1 filters makes it possible to simultaneously classify the image-subset inside all possible sliding windows parallelly, making it much more efficient computationally.



  • The convolutional sliding windows, although computationally much more efficient, still has the problem of detecting the accurate bounding boxes, since the boxes don’t align with the sliding windows and the object shapes also tend to be different.
  • YOLO algorithm overcomes this limitation by dividing a training image into grids and assigning an object to a grid if and only if the center of the object falls inside the grid, that way each object in a training image can get assigned to exactly one grid and then the corresponding bounding box is represented by the coordinates relative to the grid. The next figure described the details of the algorithm.
  • In the test images, multiple adjacent grids may think that an object actually belongs to them, in order to resolve the iou (intersection of union) measure is used to find the maximum overlap and the non-maximum-suppression algorithm is used to discard all the other bounding boxes with low-confidence of containing an object, keeping the one with the highest confidence among the competing ones and discard the others.
  • Still there is a problem of multiple objects falling in the same grid. Multiple anchor boxes (of different shapes) are used to resolve the problem, each anchor box of a particular shape being likely to eventually detect  an object of a particular shape.



The following figure shows the slides taken from the presentation You Only Look Once: Unified, Real-Time Object Detection in the CVPR 2016 summarizing the algorithm:


Problem Statement

Let’s assume that we are working on a self-driving car. As a critical component of this project, we’d like to first build a car detection system. To collect data, we’ve mounted a camera to the hood (meaning the front) of the car, which takes pictures of the road ahead every few seconds while we drive around.

The above pictures are taken from a car-mounted camera while driving around Silicon Valley.  We would like to especially thank for providing this dataset! is a company building the brains of self-driving vehicles.


We’ve gathered all these images into a folder and have labelled them by drawing bounding boxes around every car we found. Here’s an example of what our bounding boxes look like.

Definition of a box


If we have 80 classes that we want YOLO to recognize, we can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. Here we will use both representations, depending on which is more convenient for a particular step.

In this exercise, we shall learn how YOLO works, then apply it to car detection. Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for our use.  The instructions for how to do it can be obtained from here and here.



YOLO (“you only look once“) is a popular algorithm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

Model details

First things to know:

  • The input is a batch of images of shape (m, 608, 608, 3).
  • The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If we expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So we can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Let’s look in greater detail at what this encoding represents.

Encoding architecture for YOLO


If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

Flattening the last two last dimensions



Now, for each box (of each cell) we will compute the following element-wise product and extract a probability that the box contains a certain class.

Find the class detected by each box


Here’s one way to visualize what YOLO is predicting on an image:

  • For each of the 19×19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
  • Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:


Each of the 19×19 grid cells colored according to which class has the largest predicted probability in that cell.

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:


Each cell gives us 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking once at the image (one forward pass through the network)! Different colors denote different classes.

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, we’ll use non-max suppression. Specifically, we’ll carry out these steps:

  • Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class).
  • Select only one box when several boxes overlap with each other and detect the same object.


Filtering with a threshold on class scores

We are going to apply a first filter by thresholding. We would like to get rid of any box for which the class “score” is less than a chosen threshold.

The model gives us a total of 19x19x5x85 numbers, with each box described by 85 numbers. It’ll be convenient to rearrange the (19,19,5,85) (or (19,19,425)) dimensional tensor into the following variables:

  • box_confidence: tensor of shape (19×19,5,1) containing pc (confidence probability that there’s some object) for each of the 5 boxes predicted in each of the 19×19 cells.
  • boxes: tensor of shape (19×19,5,4) containing (bx,by,bh,bw) for each of the 5 boxes per cell.
  • box_class_probs: tensor of shape (19×19,5,80) containing the detection probabilities (c1,c2,…c80) for each of the 80 classes for each of the 5 boxes per cell.

Exercise: Implement yolo_filter_boxes().

  • Compute box scores by doing the element-wise product as described in the above figure.
  • For each box, find:
    • the index of the class with the maximum box score.
    • the corresponding box score.
  • Create a mask by using a threshold.  The mask should be True for the boxes you want to keep.
  • Use TensorFlow to apply the mask to box_class_scores, boxes and box_classes to filter out the boxes we don’t want.
    We should be left with just the subset of boxes we want to keep.

Let’s first load the packages and dependencies that are going to be useful.

import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D
from keras.models import load_model, Model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body


def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
 """Filters YOLO boxes by thresholding on object and class confidence.

 box_confidence -- tensor of shape (19, 19, 5, 1)
 boxes -- tensor of shape (19, 19, 5, 4)
 box_class_probs -- tensor of shape (19, 19, 5, 80)
 threshold -- real value, if [ highest class probability score = threshold)

 # Step 4: Apply the mask to scores, boxes and classes

return scores, boxes, classes


Non-max suppression

Even after filtering by thresholding over the classes scores, we still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).


n this example, the model has predicted 3 cars, but it’s actually 3 predictions of the same car. Running non-max suppression (NMS) will select only the most accurate (highest probability) one of the 3 boxes.

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

Definition of “Intersection over Union”



Exercise: Implement iou(). Some hints:

  • In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
  • To calculate the area of a rectangle we need to multiply its height (y2 – y1) by its width (x2 – x1)
  • We’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
    xi1 = maximum of the x1 coordinates of the two boxes
    yi1 = maximum of the y1 coordinates of the two boxes
    xi2 = minimum of the x2 coordinates of the two boxes
    yi2 = minimum of the y2 coordinates of the two boxes

In this code, we use the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) the lower-right corner.

def iou(box1, box2):
 """Implement the intersection over union (IoU) between box1 and box2

 box1 -- first box, list object with coordinates (x1, y1, x2, y2)
 box2 -- second box, list object with coordinates (x1, y1, x2, y2)

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.

# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)

# compute the IoU

return iou


We are now ready to implement non-max suppression. The key steps are:

  • Select the box that has the highest score.
  • Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
  • Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

Exercise: Implement yolo_non_max_suppression() using TensorFlow. TensorFlow has two built-in functions that are used to implement non-max suppression (so we don’t actually need to use your iou() implementation):

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
 Applies Non-max suppression (NMS) to set of boxes

 scores -- tensor of shape (None,), output of yolo_filter_boxes()
 boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
 classes -- tensor of shape (None,), output of yolo_filter_boxes()
 max_boxes -- integer, maximum number of predicted boxes you'd like
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 scores -- tensor of shape (, None), predicted score for each box
 boxes -- tensor of shape (4, None), predicted box coordinates
 classes -- tensor of shape (, None), predicted class for each box

 Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
 function will transpose the shapes of scores, boxes, classes. This is made for convenience.

 max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
 K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

 # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep

 # Use K.gather() to select only nms_indices from scores, boxes and classes

 return scores, boxes, classes


Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions we’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail we have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which are provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh)

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape)

YOLO’s network was trained to run on 608×608 images. If we are testing this data on a different size image – for example, the car detection dataset had 720×1280 images – his step rescales the boxes so that they can be plotted on top of the original 720×1280 image.

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
 Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.

 yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
 box_confidence: tensor of shape (None, 19, 19, 5, 1)
 box_xy: tensor of shape (None, 19, 19, 5, 2)
 box_wh: tensor of shape (None, 19, 19, 5, 2)
 box_class_probs: tensor of shape (None, 19, 19, 5, 80)
 image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
 max_boxes -- integer, maximum number of predicted boxes you'd like
 score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 scores -- tensor of shape (None, ), predicted score for each box
 boxes -- tensor of shape (None, 4), predicted box coordinates
 classes -- tensor of shape (None,), predicted class for each box

 # Retrieve outputs of the YOLO model

 # Convert boxes to be ready for filtering functions 

 # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold

 # Scale boxes back to original image shape.

 # Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold 

 return scores, boxes, classes


Summary for YOLO:

  • Input image (608, 608, 3)
  • The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
  • After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
    • Each cell in a 19×19 grid over the input image gives 425 numbers.
    • 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
    • 85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we’d like to detect.
  • We then select only few boxes based on:
    • Score-thresholding: throw away boxes that have detected a class with a score less than the threshold.
    • Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes.
  • This gives us YOLO’s final output.


Test YOLO pretrained model on images

In this part, we are going to use a pre-trained model and test it on the car detection dataset. As usual, we start by creating a session to start your graph. Run the following cell.

sess = K.get_session()

Defining classes, anchors and image shape.

Recall that we are trying to detect 80 classes, and are using 5 anchor boxes. We have gathered the information about the 80 classes and 5 boxes in two files “coco_classes.txt” and “yolo_anchors.txt”. Let’s load these quantities into the model by running the next cell.

The car detection dataset has 720×1280 images, which we’ve pre-processed into 608×608 images.

class_names = read_classes(“coco_classes.txt”)
anchors = read_anchors(“yolo_anchors.txt”)
image_shape = (720., 1280.)


Loading a pretrained model

Training a YOLO model takes a very long time and requires a fairly large dataset of labelled bounding boxes for a large range of target classes. We are going to load an existing pretrained Keras YOLO model stored in “yolo.h5”. (These weights come from the official YOLO website, and were converted using a function written by Allan Zelener.  Technically, these are the parameters from the “YOLOv2” model, but we will more simply refer to it as “YOLO” in this notebook.)

yolo_model = load_model(“yolo.h5”)

This loads the weights of a trained YOLO model. Here’s a summary of the layers our model contains.


Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) (None, 608, 608, 3) 0
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
batch_normalization_1 (BatchNor (None, 608, 608, 32) 128 conv2d_1[0][0]
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
batch_normalization_2 (BatchNor (None, 304, 304, 64) 256 conv2d_2[0][0]
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
conv2d_3 (Conv2D) (None, 152, 152, 128 73728 max_pooling2d_2[0][0]
batch_normalization_3 (BatchNor (None, 152, 152, 128 512 conv2d_3[0][0]
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_3[0][0]
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
batch_normalization_4 (BatchNor (None, 152, 152, 64) 256 conv2d_4[0][0]
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
conv2d_5 (Conv2D) (None, 152, 152, 128 73728 leaky_re_lu_4[0][0]
batch_normalization_5 (BatchNor (None, 152, 152, 128 512 conv2d_5[0][0]
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_5[0][0]
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
batch_normalization_6 (BatchNor (None, 76, 76, 256) 1024 conv2d_6[0][0]
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
batch_normalization_7 (BatchNor (None, 76, 76, 128) 512 conv2d_7[0][0]
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
batch_normalization_8 (BatchNor (None, 76, 76, 256) 1024 conv2d_8[0][0]
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
batch_normalization_9 (BatchNor (None, 38, 38, 512) 2048 conv2d_9[0][0]
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
batch_normalization_10 (BatchNo (None, 38, 38, 256) 1024 conv2d_10[0][0]
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
batch_normalization_11 (BatchNo (None, 38, 38, 512) 2048 conv2d_11[0][0]
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
batch_normalization_12 (BatchNo (None, 38, 38, 256) 1024 conv2d_12[0][0]
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
batch_normalization_13 (BatchNo (None, 38, 38, 512) 2048 conv2d_13[0][0]
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
batch_normalization_14 (BatchNo (None, 19, 19, 1024) 4096 conv2d_14[0][0]
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
batch_normalization_15 (BatchNo (None, 19, 19, 512) 2048 conv2d_15[0][0]
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
batch_normalization_16 (BatchNo (None, 19, 19, 1024) 4096 conv2d_16[0][0]
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
batch_normalization_17 (BatchNo (None, 19, 19, 512) 2048 conv2d_17[0][0]
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
batch_normalization_18 (BatchNo (None, 19, 19, 1024) 4096 conv2d_18[0][0]
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
batch_normalization_19 (BatchNo (None, 19, 19, 1024) 4096 conv2d_19[0][0]
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
batch_normalization_21 (BatchNo (None, 38, 38, 64) 256 conv2d_21[0][0]
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
batch_normalization_20 (BatchNo (None, 19, 19, 1024) 4096 conv2d_20[0][0]
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
batch_normalization_22 (BatchNo (None, 19, 19, 1024) 4096 conv2d_22[0][0]
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672


Reminder: this model converts a pre-processed batch of input images (shape: (m, 608, 608, 3)) into a tensor of shape (m, 19, 19, 5, 85) as explained in the above Figure.

Convert output of the model to usable bounding box tensors

The output of yolo_model is a (m, 19, 19, 5, 85) tensor that needs to pass through non-trivial processing and conversion. The following code does this.

yolo_outputs = yolo_head(yolo_model.output, anchors, len(class_names))

We added yolo_outputs to your graph. This set of 4 tensors is ready to be used as input by our yolo_eval function.

Filtering boxes

yolo_outputs gave us all the predicted boxes of yolo_model in the correct format. We’re now ready to perform filtering and select only the best boxes. Lets now call yolo_eval, which you had previously implemented, to do this.

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

Run the graph on an image

Let the fun begin. We have created a (sess) graph that can be summarized as follows:

  1. yolo_model.input is given to yolo_model. The model is used to compute the output yolo_model.output
  2. yolo_model.output is processed by yolo_head. It gives us yolo_outputs
  3. yolo_outputs goes through a filtering function, yolo_eval. It outputs your predictions: scores, boxes, classes

Exercise: Implement predict() which runs the graph to test YOLO on an image. We shall need to run a TensorFlow session, to have it compute scores, boxes, classes.

The code below also uses the following function:

image, image_data = preprocess_image(“images/” + image_file, model_image_size = (608, 608))

which outputs:

  • image: a python (PIL) representation of your image used for drawing boxes. You won’t need to use it.
  • image_data: a numpy-array representing the image. This will be the input to the CNN.

Important note: when a model uses BatchNorm (as is the case in YOLO), we will need to pass an additional placeholder in the feed_dict {K.learning_phase(): 0}.

def predict(sess, image_file):
Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.

sess -- your tensorflow/Keras session containing the YOLO graph
image_file -- name of an image stored in the "images" folder.

out_scores -- tensor of shape (None, ), scores of the predicted boxes
out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
out_classes -- tensor of shape (None, ), class index of the predicted boxes

Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes.

 # Preprocess your image

 # Run the session with the correct tensors and choose the correct placeholders in the
 # feed_dict. We'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})

 # Print predictions info
print('Found {} boxes for {}'.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image"out", image_file), quality=90)
# Display the results
output_image = scipy.misc.imread(os.path.join("out", image_file))

 return out_scores, out_boxes, out_classes

Let’s Run the following cell on the following “test.jpg” image to verify that our function is correct.


out_scores, out_boxes, out_classes = predict(sess, “test.jpg”)

The following figure shows the output after car detection. Each of the bounding boxes have the name of the object detected on the top left along with the confidence value.

Output (with detected cars with YOLO)

Found 7 boxes for test.jpg
car 0.60 (925, 285) (1045, 374)
car 0.66 (706, 279) (786, 350)
bus 0.67 (5, 266) (220, 407)
car 0.70 (947, 324) (1280, 705)
car 0.74 (159, 303) (346, 440)
car 0.80 (761, 282) (942, 412)
car 0.89 (367, 300) (745, 648)


The following animation shows the output Images with detected objects (cars) using YOLO for a set of input images.


What we should remember:

  • YOLO is a state-of-the-art object detection model that is fast and accurate.
  • It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
  • The encoding can be seen as a grid where each of the 19×19 cells contains information about 5 boxes.
  • You filter through all the boxes using non-max suppression. Specifically:
    Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes.
  • Intersection over Union (IoU) thresholding to eliminate overlapping boxes.
  • Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise.


References: The ideas presented in this notebook came primarily from the two YOLO papers. The implementation here also took significant inspiration and used many components from Allan Zelener’s github repository. The pretrained weights used in this exercise came from the official YOLO website.

  1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi – You Only Look Once: Unified, Real-Time Object Detection (2015)
  2. Joseph Redmon, Ali Farhadi – YOLO9000: Better, Faster, Stronger (2016)
  3. Allan Zelener – YAD2K: Yet Another Darknet 2 Keras
  4. The official YOLO website .

Car detection dataset: Creative Commons License.

The Sample Dataset (provided by is licensed under a Creative Commons Attribution 4.0 International License.

Implementing Lucas-Kanade Optical Flow algorithm in Python

In this article an implementation of the Lucas-Kanade optical flow algorithm is going to be described. This problem appeared as an assignment in this computer vision course from UCSD. The inputs will be sequences of images (subsequent frames from a video) and the algorithm will output an optical flow field (u, v) and trace the motion of the moving objects. The problem description is taken from the assignment itself.


Problem Statement

Single-Scale Optical Flow

  • Let’s implement the single-scale Lucas-Kanade optical flow algorithm. This involves finding the motion (u, v) that minimizes the sum-squared error of the brightness constancy equations for each pixel in a window.  The algorithm will be implemented as a function with the following inputs:

     def optical_flow(I1, I2, window_size, tau) # returns (u, v)

  • Here, u and v are the x and y components of the optical flow, I1 and I2 are two images taken at times t = 1 and t = 2 respectively, and window_size is a 1 × 2 vector storing the width and height of the window used during flow computation.
  • In addition to these inputs, a theshold τ should be added, such that if τ is larger than the smallest eigenvalue of A’A, then the the optical flow at that position should not be computed. Recall that the optical flow is only valid in regions where

has rank 2, which is what the threshold is checking. A typical value for τ is 0.01.

  • We should try experimenting with different window sizes and find out the tradeoffs associated with using a small vs. a large window size.
  • The following figure describes the algorithm, which considers a nxn (n>=3) window around each pixel and solves a least-square problem to find the best flow vectors for the pixel.


  • The following code-snippet shows how the algorithm is implemented in python for a gray-level image.
import numpy as np
from scipy import signal
def optical_flow(I1g, I2g, window_size, tau=1e-2):

    kernel_x = np.array([[-1., 1.], [-1., 1.]])
    kernel_y = np.array([[-1., -1.], [1., 1.]])
    kernel_t = np.array([[1., 1.], [1., 1.]])#*.25
    w = window_size/2 # window_size is odd, all the pixels with offset in between [-w, w] are inside the window
    I1g = I1g / 255. # normalize pixels
    I2g = I2g / 255. # normalize pixels
    # Implement Lucas Kanade
    # for each point, calculate I_x, I_y, I_t
    mode = 'same'
    fx = signal.convolve2d(I1g, kernel_x, boundary='symm', mode=mode)
    fy = signal.convolve2d(I1g, kernel_y, boundary='symm', mode=mode)
    ft = signal.convolve2d(I2g, kernel_t, boundary='symm', mode=mode) +
         signal.convolve2d(I1g, -kernel_t, boundary='symm', mode=mode)
    u = np.zeros(I1g.shape)
    v = np.zeros(I1g.shape)
    # within window window_size * window_size
    for i in range(w, I1g.shape[0]-w):
        for j in range(w, I1g.shape[1]-w):
            Ix = fx[i-w:i+w+1, j-w:j+w+1].flatten()
            Iy = fy[i-w:i+w+1, j-w:j+w+1].flatten()
            It = ft[i-w:i+w+1, j-w:j+w+1].flatten()
            #b = ... # get b here
            #A = ... # get A here
            # if threshold τ is larger than the smallest eigenvalue of A'A:
            nu = ... # get velocity here

    return (u,v)


Some Results

  • The following figures and animations show the results of the algorithm on a few image sequences. Some of these input image sequences / videos are from the course and some are collected from the internet.
  • As can be seen, the algorithm performs best if the motion of the moving object(s) in between consecutive frames is slow. To the contrary, if the motion is large, the algorithm fails and we should implement / use multiple-scale version Lucas-Kanade with image pyramids.
  • Finally,  with small window size,  the algorithm captures subtle motions but not large motions. With large size it happens the other way.

Input Sequences



Output Optical Flow with different window sizes

window size = 15


window size = 21



Input Sequences

Output Optical Flow


Input Sequences (hamburg taxi)


Output Optical Flowtaxi_opt


Input Sequences


Output Optical Flow

Input Sequences


Output Optical Flowseq_opt

Input Sequences    fount3.gif


Output Optical Flowfount_opt

Input Sequences

Output Optical Flow

Input Sequencessynth

Output Optical Flowsynth_opt

Input Sequencescars1
Output Optical Flowcars1_optcars1_cmap

Input Sequencescars2

Output Optical Flowcars2_opt

Output Optical Flowcars2_opt2cars2_cmap

Input Sequences



Output Optical Flow with window size 45

Output Optical Flow with window size 10

Output Optical Flow with window size 25
Output Optical Flow with window size 45cars3_opt2_45


Efficient Graph-Based Image Segmentation in Python

In this article, an implementation of an efficient graph-based image segmentation technique will be described, this algorithm was proposed by Felzenszwalb et. al. from MIT in this paper.  The slides on this paper can be found from this link from the Stanford Vision Lab too. The algorithm is closely related to Kruskal’s algorithm for constructing a minimum spanning tree of a graph, as stated by the author and hence can  be
implemented to run in O(m log m) time, where m is the number of edges in the graph.

Problem Definition and the basic idea (from the paper)

  • Let G = (V, E) be an undirected graph with vertices vi ∈ V, the set of elements to be segmented, and edges (vi, vj ) ∈ E corresponding to pairs of neighboring vertices. Each edge (vi, vj ) ∈ E has a corresponding weight w((vi, vj)), which is a non-negative measure of the dissimilarity between neighboring elements vi and vj.

  • In the case of image segmentation, the elements in V are pixels and the weight of an edge is some measure of the dissimilarity between the two pixels connected by that edge (e.g., the difference in intensity, color, motion, location or some other local attribute).

  • Particularly for the implementation described here, an edge weight function based on the absolute intensity difference (in the yiq space) between the pixels connected by an edge, w((vi, vj )) = |I(pi) − I(pj )|.

  • In the graph-based approach, a segmentation S is a partition of V into components
    such that each component (or region) C ∈ S corresponds to a connected component
    in a graph G0 = (V, E0), where E0 ⊆ E.

  • In other words, any segmentation is induced by a subset of the edges in E. There are different ways to measure the quality of a segmentation but in general we want the elements in a component to be similar, and elements in different components to be dissimilar.

  • This means that edges between two vertices in the same component should have relatively low weights, and edges between vertices in different components should have higher weights.

  • The next figure shows the steps in the algorithm. The algorithm is very similar to Kruskal’s algorithm for computing the MST for an undirected graph.


  • The threshold function τ controls the degree to which the difference between two
    components must be greater than their internal differences in order for there to be
    evidence of a boundary between them.

  • For small components, Int(C) is not a good estimate of the local characteristics of the data. In the extreme case, when |C| = 1, Int(C) = 0. Therefore, a threshold function based on the size of the component, τ (C) = k/|C| is needed to be usedwhere |C| denotes the size of C, and k is some constant parameter.

  • That is, for small components we require stronger evidence for a boundary. In practice k sets a scale of observation, in that a larger k causes a preference for larger components.

  • In general, a Gaussian filter is used to smooth the image slightly before computing the edge weights, in order to compensate for digitization artifacts. We always use a Gaussian with σ = 0.8, which does not produce any visible change to the image but helps remove artifacts.

  • The following python code shows how to create the graph.

import numpy as np
from scipy import signal
import matplotlib.image as mpimg

def gaussian_kernel(k, s = 0.5):
    # generate a (2k+1)x(2k+1) gaussian kernel with mean=0 and sigma = s
    probs = [exp(-z*z/(2*s*s))/sqrt(2*pi*s*s) for z in range(-k,k+1)]
return np.outer(probs, probs)

def create_graph(imfile, k=1., sigma=0.8, sz=1):
    # create the pixel graph with edge weights as dissimilarities
     rgb = mpimg.imread(imfile)[:,:,:3]
     gauss_kernel = gaussian_kernel(sz, sigma)
     for i in range(3):
         rgb[:,:,i] = signal.convolve2d(rgb[:,:,i], gauss_kernel, boundary='symm', mode='same')
     yuv = rgb2yiq(rgb)
     (w, h) = yuv.shape[:2]
     edges = {}
     for i in range(yuv.shape[0]):
         for j in range(yuv.shape[1]):
             #compute edge weight for nbd pixel nodes for the node i,j
             for i1 in range(i-1, i+2):
                 for j1 in range(j-1, j+2):
                     if i1 == i and j1 == j: continue

                     if i1 >= 0 and i1 = 0 and j1 < h:
                        wt = np.abs(yuv[i,j,0]-yuv[i1,j1,0])
                        n1, n2 = ij2id(i,j,w,h), ij2id(i1,j1,w,h)
                        edges[n1, n2] = edges[n2, n1] = wt
     return edges


Some Results

  • The images are taken from the paper itself or from the internet. The following figures and animations show the result of segmentation as a result of iterative merging of the components (by choosing least weight edges), depending on the internal difference of the components.

  • Although in the paper the author described the best value of the parameter k to be around 300, but  since in this implementation the pixel RGB values are normalized (to have values in between 0 – 1) and then converted to YIQ values and the YIQ intensities are used for computing the weights (which are typically very small), the value of k that works best in this scenario is 0.001-0.01.

  • As we can see from the below results, higher the value of the parameter k, larger the size of the final component and lesser the number of components in the result.

  • The minimum spanning tree creation is also shown, the red edges shown in the figures are the edges chosen by the algorithm to merge the components.

Input Image


Output Images for two different values of the parameter k



out_ 0.001player.png

out_ 0.010player.png

The forest created after a few iterations


Input Image


Output Images for two different values of the parameter k


out_ 0.010hill.png


out_ 0.001hill.png

The forest created after a few iterations


Input Image


Output Segmented Images

parrot_k_0.001out_ 0.001parrot_Dark2.pngout_ 0.001parrot_hot.png

Input Image


Segmented Output Images

out_ 0.001road2_Set1.png

Input Image


Output Images for two different values of the parameter k


out_ 0.001road.png
road_k_0.01out_ 0.010road.png

The forest created after a few iterations



  Input Image (UMBC)


Segmented Output Images

umbc_k_0.001out_ 0.001umbc.png

    Input Image


Segmented Output Image with k=0.001

out_ 0.001nj.png

    Input Image


Segmented Output Image with k=0.001


    Input Image


Segmented Output Image with k=0.001

out_ 0.001flowers2.png

    Input Image (liver)


Segmented Output Images 

out_ 0.001liver.pngout_ 0.010liver.png

    Input Image


Segmented Output Images with different values of k


out_ 0.001building.pngout_ 0.010building.png

    Input Image


Segmented Output Image

out_ 0.001frame056.png

    Input Image


Segmented Output Images for different k



Interactive Image Segmentation with Graph-Cut in Python

In this article, interactive image segmentation with graph-cut is going to be discussed. and it will be used to segment the source object from the background in an image. This segmentation technique was proposed by Boycov and Jolli in this paper. This problem appeared as a homework assignment here., and also in this lecture video from the Coursera image processing course by Duke university.

Problem Statement: Interactive graph-cut segmentation

Let’s implement “intelligent paint” interactive segmentation tool using graph cuts algorithm on a weighted image grid. Our task will be to separate the foreground object from the background in an image.

Since it can be difficult sometimes to automatically define what’s foreground and what’s background for an image, the user is going to help us with a few interactive scribble lines using which our algorithm is going to identify the foreground and the background, after that it will be the algorithms job to obtain a complete segmentation of the foreground from the background image.

The following figures show how an input image gets scribbling from a user with two different colors (which is also going to be input to our algorithm) and the ideal segmented image output.

Scribbled Input Image                                       Expected Segmented Output Image       bj


The Graph-Cut Algorithm


The following describes how the segmentation problem is transformed into a graph-cut problem:

  1. Let’s first define the Directed Graph G = (V, E) as follows:

    1. Each of the pixels in the image is going to be a vertex in the graph. There will be another couple of special terminal vertices: a source vertex (corresponds to the foreground object) and a sink vertex (corresponds to the background object in the image). Hence, |V(G)| = width x height + 2.

    2. Next, let’s defines the edges of the graph. As obvious, there is going to be two types of edges: terminal (edges that connect the terminal nodes to the non-terminal nodes) and non-terminal (edges that connect the non-terminal nodes only).

    3. There will be a directed edge from the terminal node source to each of non-terminal nodes in the graph. Similarly,  a directed edge will be there from each non-terminal node (pixel) to the other terminal node sink. These are going to be all the terminal edges and hence, |E_T(G)| =  2 x width x height.

    4. Each of the non-terminal nodes (pixels) are going to be connected by edges with the nodes corresponding to the neighboring pixels (defined by 4 or 8 neighborhood of a pixel).  Hence, |E_N(G)| =  |Nbd| x width x height.

  2. Now let’s describe how to compute the edge weights in this graph.

    1. In order to compute the terminal edge weights, we need to estimate the feature distributions first, i.e., starting with the assumption that each of the nodes corresponding to the scribbled pixels have the probability 1.0 (since we want the solution to respect the regional hard constraints marked by the user-seeds / scribbles) to be in foreground or background object in the image (distinguished by the scribble color, e.g.), we have to compute the probability that a node belongs to the foreground (or background) for all the other non-terminal nodes.

    2. The simplest way to compute P_F and P_B is to first fit a couple of  Gaussian distributions on the scribbles by computing the parameters (μ, ∑)
      with MLE from the scribbled pixel intensities and then computing the (class-conditional) probabilities from the individual pdfs (followed by a normalization) for each of the pixels as shown in the next figures. The following code fragment show how the pdfs are being computed.

      import numpy as np
      from collections import defaultdict
      def compute_pdfs(imfile, imfile_scrib):
          # Compute foreground and background pdfs
          # input image and the image with user scribbles
           rgb = mpimg.imread(imfile)[:,:,:3]
           yuv = rgb2yiq(rgb)
           rgb_s = mpimg.imread(imfile_scrib)[:,:,:3]
           yuv_s = rgb2yiq(rgb_s)
           # find the scribble pixels
           scribbles = find_marked_locations(rgb, rgb_s)
           imageo = np.zeros(yuv.shape)
           # separately store background and foreground scribble pixels in the dictionary comps
           comps = defaultdict(lambda:np.array([]).reshape(0,3))
           for (i, j) in scribbles:
                imageo[i,j,:] = rgbs[i,j,:]
                # scribble color as key of comps
                comps[tuple(imageo[i,j,:])] = np.vstack([comps[tuple(imageo[i,j,:])], yuv[i,j,:]])
                mu, Sigma = {}, {}
           # compute MLE parameters for Gaussians
           for c in comps:
                mu[c] = np.mean(comps[c], axis=0)
                Sigma[c] = np.cov(comps[c].T)
           return (mu, Sigma)

      3. In order to compute the non-terminal edge weights, we need to compute the similarities in between a pixel node and the nodes corresponding to its neighborhood pixels, e.g., with the formula shown in the next figures (e.g., how similar neighborhood pixels are in RGB / YIQ space).

  3. Now that the underlying graph is defined, the segmentation of the foreground from the background image boils down to computing the min-cut in the graph or equivalently computing the max-flow (the dual problem) from the source to sink.

  4. The intuition is that the min-cut solution will keep the pixels with high probabilities to belong to the side of the source (foreground) node and likewise the background pixels on the other side of the cut near the sink (background) node, since it’s going to respect the (relatively) high-weight edges (by not going through the highly-similar pixels).

  5. There are several standard algorithms, e.g., Ford-Fulkerson (by finding an augmenting path with O(E max| f |) time complexity) or Edmonds-Karp (by using bfs to find the shortest augmenting path, with O(VE2) time complexity) to solve the max-flow problem, typical implementations of these algorithms run pretty fast, in polynomial time in V, E.  Here we are going to use a different implementation (with pymaxflow) based on Vladimir Kolmogorov, which is shown to run faster on many images empirically in this paper.







The following figures / animations show the interactive-segmentation results (computed probability densities, subset of the flow-graph & min-cut, final segmented image) on a few images,  some of them taken from the above-mentioned courses / videos, some of them taken from Berkeley Vision dataset.

Input Imagedeer

Input Image with Scribblesdeer_scribed

Fitted Densities from Color Scribblesgaussian_deer.png


A Tiny Sub-graph with Min-Cut


Input Image

Input Image with Scribbleseagle_scribed

Fitted Densities from Color Scribblesgaussian_eagle



A Tiny Sub-graph with Min-Cut flow_eagle

Input Image (liver)                                             

Input Image with Scribblesliver_scribed

Fitted Densities from Color Scribbles


Input Imagefishes2

Input Image with Scribbles

Fitted Densities from Color Scribblesgaussian_fishes2


A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribblespanda_scribed


Input Imagebunny

Input Image with Scribblesbunny_scribed

Fitted Densities from Color Scribbles



A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut


Input Image                                                         Input Image with Scribbles
me3    me3_scribed


Input Imageinput1

Input Image with Scribblesinput1_scribed


A Tiny Sub-graph of the flow-graph with Min-Cut


Input Imagerose

Input Image with Scribbles

Fitted Densities from Color Scribbles


Input Image

Input Image with Scribbleswhale_scribed

Fitted Densities from Color Scribblesgaussian_whale

A Tiny Sub-graph of the flow-graph with Min-Cut


Input Image (UMBC Campus Map)umbc_campus

Input Image with Scribbles


Input Imagehorses

Input Image with Scribbles


A Tiny Sub-graph of the flow-graph with Min-Cut (with blue foreground nodes)


Changing the background of an image (obtained using graph-cut segmentation) with another image’s background with cut & paste

The following figures / animation show how the background of a given image can be replaced by a new image using cut & paste (by replacing the corresponding pixels in the new image corresponding to foreground), once the foreground in the original image gets identified after segmentation.

Original Input Imagehorses

New Backgroundsky.png




Image Colorization Using Optimization in Python


This article is inspired by this SIGGRAPH paper by Levin et. al, for which they took this patent , the paper was referred to in the course CS1114 from Cornell.  This method is also discussed in the coursera online image processing course by NorthWestern University. Some part of the problem description is taken from the paper itself. Also, one can refer to the implementation provided by the authors in matlab, the following link and the following python implementation in github.


The Problem

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.


The Algorithm

YUV/YIQ color space is chosen to work in, this is commonly used in video, where Y is the monochromatic luminance channel, which we will refer to simply as intensity, while U and V are the chrominance channels, encoding the color.

The algorithm is given as input an intensity volume Y(x,y,t) and outputs two color volumes U(x,y,t) and V(x,y,t). To simplify notation the boldface letters are used (e.g. r,s) to denote \left(x,y,t \right) triplets. Thus, Y(r) is the intensity of a particular pixel.

Now, One needs to impose the constraint that two neighboring pixels r,s should have similar colors if their intensities are similar. Thus, the problem is formulated as the following optimization problem that aims to minimize the difference between the  colorU(r) at pixel r and the weighted average of the colors at neighboring pixels, where w(r,s) is a weighting function that sums to one, large when Y(r) is similar to Y(s), and small when the two intensities are different. 


When the intensity is constant the color should be constant, and when the intensity is an edge the color should also be an edge. Since the cost functions are quadratic and
the constraints are linear, this optimization problem yields a large, sparse system of linear equations, which may be solved using a number of standard methods.

As discussed in the paper, this algorithm is closely related to algorithms proposed for other tasks in image processing. In image segmentation algorithms based on normalized cuts [Shi and Malik 1997], one attempts to find the second smallest eigenvector of the matrix D − W where W is a npixels×npixels matrix whose elements are the pairwise affinities between pixels (i.e., the r,s entry of the matrix is w(r,s)) and D is a diagonal matrix whose diagonal elements are the sum of the affinities (in this case it is always 1). The second smallest eigenvector of any symmetric matrix A is a unit norm vector x that minimizes x^{T}Ax and is orthogonal to the first eigenvector. By direct  inspection, the quadratic form minimized by normalized cuts is exactly the cost function J, that is x^{T}(D-W)x =J(x). Thus, this algorithm minimizes the same cost function but under different constraints. In image denoising algorithms based on anisotropic diffusion [Perona and Malik 1989; Tang et al. 2001] one often minimizes a  function
similar to equation 1, but the function is applied to the image intensity as well.

The following figures show an original gray-scale image and the marked image with color-scribbles that are going to be used to compute the output colored image.

Original Gray-scale Image Input

Gray-scale image Input Marked with Color-Scribbles 

Implementation of the Algorithm

Here are the the steps for the algorithm:

  1. Convert both the original gray-scale image and the marked image (marked with color scribbles for a few pixels by the artist) to from RGB to YUV / YIQ color space.
  2. Compute the difference image from the marked and the gray-scale image. The pixels that differ  are going to be pinned and they will appear in the output, they are directly going to be used as stand-alone constraints in the minimization problem.
  3. We need to compute the weight matrix W that depends on the similarities in the neighbor intensities for a pixel from the original gray-scale image.
  4. The optimization problem finally boils down to solving the system of linear equations of the form WU = b that has a closed form least-square solution
    U = W^{+}b = {(W^{T}W)}^{-1}W^{T}b. Same thing is to be repeated for the V channel too.
  5. However the W matrix is going to be very huge and sparse, hence sparse-matrix based implementations must be used to obtain an efficient solution. However, in this python implementation in github, the scipy sparse lil_matrix was used when constructing the sparse matrices, which is quite slow, we can construct more efficient scipy csc matrix rightaway, by using a dictionary to store the weights initially. It is much faster. The python code in the next figure shows my implementation for computing the weight matrix W.
  6. Once W is computed it’s just a matter of obtaining the least-square solution, by computing the pseudo-inverse, which can be more efficiently computed with LU factorization and a sparse LU solver, as in this  python implementation in github.
  7. Once the solution of the optimization problem is obtained in YUV / YIQ space, it needs to be converted back to RGB. The following formula is used for conversion.rgbyiq.png

import scipy.sparse as sp
from collections import defaultdict

def compute_W(Y, colored):

    (w, h) = Y.shape
    W = defaultdict()
    for i in range(w):
        for j in range(h):
            if not (i, j) in colored: # pixel i,j in color scribble
            (N, sigma) = get_nbrs(Y, i, j, w, h) #r = (i, j)
            Z = 0.
            id1 = ij2id(i,j,w,h) # linearized id
            for (i1, j1) in N: #s = (i1, j1)
                id2 = ij2id(i1,j1,w,h)
                W[id1,id2] = np.exp(-(Y[i,j]-Y[i1,j1])**2/(sigma**2)) if sigma > 0 else 0.
                Z += W[id1,id2]
            if Z > 0:
               for (i1, j1) in N: #s = (i1, j1)
                   id2 = ij2id(i1,j1,w,h)
                   W[id1,id2] /= -Z
     for i in range(w):
        for j in range(h):
            id = ij2id(i,j,w,h)
            W[id,id] = 1.

    rows, cols = zip(*(W.keys())) #keys
    data = W.values() #[W[k] for k in keys]
    return sp.csc_matrix((data, (rows, cols)), shape=(w*h, w*h)) #W



The following images and animations show the results obtained with the optimization algorithm. Most of the following images are taken from the paper itself.

Original Gray-scale Image Input                 Gray-scale image Input Markedbaby  baby_marked

Color image Output

The next animations show how the incremental scribbling results in better and better color images.

Original Gray-scale Image Input 


As can be seen from the following animation, the different parts of the building get colored as more and more color-hints are scribbled / annotated.

Gray-scale image Input Markedhouse_marked

Color image Output


Original Gray-scale Image Input 


Gray-scale image Input Marked

Color image Output

Original Gray-scale Image Input (me)me3_gray

Gray-scale image Input Marked incrementally

Color image Output

Original Gray-scale Image Input

Gray-scale image Input Marked
Color image Output

Original Gray-scale Image Input


Gray-scale image Input Marked


Color image Output



Recursive Graphics, Bi/Tri-linear Interpolation, Anti-aliasing and Image Transformation in Python

The following problem appeared in an assignment in the Princeton course COS 126 . The problem description is taken from the course itself.

Recursive Graphics

Write a program that plots a Sierpinski triangle, as illustrated below. Then develop a program that plots a recursive patterns of your own design.


Part 1. 

The Sierpinski triangle is an example of a fractal pattern. The pattern was described by Polish mathematician Waclaw Sierpinski in 1915, but has appeared in Italian art since the 13th century. Though the Sierpinski triangle looks complex, it can be generated with a short recursive program.

Examples. Below are the target Sierpinski triangles for different values of order N.


Our task is to implement a recursive function sierpinski(). We need to think recursively: our function should draw one black triangle (pointed downwards) and then call itself recursively 3 times (with an appropriate stopping condition). When writing our program, we should exercise modular design.

The following code shows an implementation:

class Sierpinski:
#Height of an equilateral triangle whose sides are of the specified length.
def height (self, length):
return sqrt(3) * length / 2.
#Draws a filled equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def filledTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length/2., x-length/2.]), np.array([y, y+h, y+h]), alpha=1)
#Draws an empty equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def emptyTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length, x-length]), np.array([y+2*h, y, y]), alpha=0)
# Draws a Sierpinski triangle of order n, such that the largest filled
# triangle has bottom vertex (x, y) and sides of the specified length.
def sierpinski(self, n, x, y, length):
self.filledTriangle(x, y, length)
if n > 1:
self.sierpinski(n-1, x-length/2., y, length/2.)
self.sierpinski(n-1, x+length/2., y, length/2.)
self.sierpinski(n-1, x, y+self.height(length), length/2.)

The following animation shows how such a triangle of order 5 is drawn recursively.

The following animation shows how such a triangle of order 6 is drawn recursively.


A diversion: fractal dimension.

Formally, we can define the Hausdorff dimension or similarity dimension of a self-similar figure by partitioning the figure into a number of self-similar pieces of smaller size. We define the dimension to be the log (# self similar pieces) / log (scaling factor in each spatial direction). For example, we can decompose the unit square into 4 smaller squares, each of side length 1/2; or we can decompose it into 25 squares, each of side length 1/5. Here, the number of self-similar pieces is 4 (or 25) and the scaling factor is 2 (or 5). Thus, the dimension of a square is 2 since log (4) / log(2) = log (25) / log (5) = 2. We can decompose the unit cube into 8 cubes, each of side length 1/2; or we can decompose it into 125 cubes, each of side length 1/5. Therefore, the dimension of a cube is log(8) / log (2) = log(125) / log(5) = 3.

We can also apply this definition directly to the (set of white points in) Sierpinski triangle. We can decompose the unit Sierpinski triangle into 3 Sierpinski triangles, each of side length 1/2. Thus, the dimension of a Sierpinski triangle is log (3) / log (2) ≈ 1.585. Its dimension is fractional—more than a line segment, but less than a square! With Euclidean geometry, the dimension is always an integer; with fractal geometry, it can be something in between. Fractals are similar to many physical objects; for example, the coastline of Britain resembles a fractal, and its fractal dimension has been measured to be approximately 1.25.


Part 2.

Drawing a tree recursively, as described here:


The following shows how a tree of order 10 is drawn:



The next problem appeared in an assignment in the Cornell course CS1114 . The problem description is taken from the course itself.

Bilinear Interpolation

Let’s consider a 2D matrix of values at integer grid locations (e.g., a grayscale image). To interpolate values on a 2D grid, we can use the 2D analogue of linear interpolation: bilinear interpolation. In this case, there are four neighbors for each possible point we’d like to interpolation, and the intensity values of these four neighbors are all combined to compute the interpolated intensity, as shown in the next figure.


In the figure, the Q values represent intensities. To combine these intensities, we perform linear interpolation in multiple directions: we first interpolate in the x direction (to get the value at the blue points), then in the y direction (to get the value at the green points).

Image transformations

Next, we’ll use the interpolation function to help us implement image transformations.

A 2D affine transformation can be represented with a 3 ×3 matrix T:


Recall that the reason why this matrix is 3×3, rather than 2 ×2, is that we operate in homogeneous coordinates; that is, we add an extra 1 on the end of our 2D coordinates (i.e., (x,y) becomes (x,y,1)), in order to represent translations with a matrix multiplication. To apply a transformation T to a pixel, we multiply T by the pixel’s location:


The following figure shows a few such transformation matrices:


To apply a transformation T to an entire image I, we could apply the transformation to each of I’s pixels to map them to the output image. However, this forward warping procedure has several problems. Instead, we’ll use inverse mapping to warp the pixels of the output image back to the input image. Because this won’t necessarily hit an integer-valued location, we’ll need to use the (bi-linear) interpolation to determine the intensity of the input image at the desired location, as shown in the next figure.


To demo the transformation function, let’s implement the following on a gray scale bird image:

1. Horizontal flipping


2. Scaling by a factor of 0.5

3. Rotation by 45 degrees around the center of the image


The next animations show rotation and sheer transformations with the Lena image:



Next, let’s implement a function to transform RGB images. To do this, we need to simply call transform image three times, once for each channel, then put the results together into a single image.  Next figures and animations show some results on an RGB image.


non-linear transformations

The next figure shows the transform functions from here:


The next figures and animations show the application of the above non-linear transforms on the Lena image.









Some more non-linear transforms:




There is a problem with our interpolation method above: it is not very good at shrinking images, due to aliasing. For instance, if let’s try to down-sample the following bricks image by a factor of 0.4, we get the image shown in the following figure: notice the strange banding effects in the output image.

Original Image


Down-sampled Image with Bilinear Interpolation

The problem is that a single pixel in the output image corresponds to about 2.8 pixels in the input image, but we are sampling the value of a single pixel—we should really be averaging over a small area.

To overcome this problem, we will create a data structure that will let us (approximately) average over any possible square regions of pixels in the input image: an image stack. An image stack is a 3D matrix that we can think of as, not surprisingly, a stack of images, one on top of the other. The top image in the cube will be the original input image. Images further down the stack will be the input image with progressively larger amounts of blur. The size of the matrix will be rows × cols × num levels, where the original (grayscale) image has size rows×cols and there are num levels images in the stack.

Before we use the stack, we must write a function to create it, which takes as input a (grayscale) image and a number of levels in the stack, and returns a 3D matrix stack corresponding to the stack. Again, the first image on the stack will be the original image. Every other image in the stack will be a blurred version of the previous image. A good blur kernel to use is:


Now, for image k in the stack, we know that every pixel is a (weighted) average of some number of pixels (a k × k patch, roughly speaking) in the input image. Thus, if we down-sample the image by a factor of k, we want to sample pixels from level k of the stack.

⇒  Let’s write the following function to create image stack that takes a grayscale image and a number max levels, and returns an image stack.

from scipy import signal

def create_image_stack(img, max_level):
K = np.ones((3,3)) / 9.
image_stack = np.zeros((img.shape[0], img.shape[1], max_level))
image_stack[:,:,0] = img
for l in range(1, max_level):
image_stack[:,:,l] = signal.convolve2d(image_stack[:,:,l-1], K,
boundary=’symm’, mode=’same’)
return image_stack

The next animation shows the image stack created from the bricks image.



Trilinear Interpolation

Now, what happens if we down-sample the image by a fractional factor, such as 3.6? Unfortunately, there is no level 3.6 of the stack. Fortunately, we have a tool to solve this problem: interpolation. We now potentially need to sample a value at position (row,col,k) of the image stack, where all three coordinates are fractional. We therefore something more powerful than bilinear interpolation: trilinear interpolation! Each position we want to sample now has eight neighbors, and we’ll combine all of their values together in a weighted sum.

This sounds complicated, but we can write this in terms of our existing functions. In particular, we now interpolate separately along different dimensions: trilinear interpolation can be implemented with two calls to bilinear interpolation and one call to linear interpolation.

Let’s implement a function trilerp like the following that takes an image stack, and a row, column, and stack level k, and returns the interpolated value.

def trilerp (img_stack, x, y, k):

if k < 1: k = 1
if  k == int(k):
return bilerp(img_stack[:,:,k-1], x, y)
f_k, c_k = int(floor(k)), int(ceil(k))
v_f_k = bilerp(img_stack[:,:,f_k-1], x, y)
v_c_k = bilerp(img_stack[:,:,c_k-1], x, y)
return linterp(k, f_k, c_k, v_f_k, v_c_k)

Now we can finally implement a transformation function that does proper anti-aliasing. In order to do this, let’s implement a function that will

  • First compute the image stack.
  • Then compute, for the transformation T, how much T is scaling down the image. If T is defined by the six values a,b,c,d,e,f above, then, to a first approximation, the downscale factor is:
    However, if k < 1 (corresponding to scaling up the image), we still want to sample from level 1. This situation reverts to normal bilinear interpolation.
  • Next call the trilerp function on the image stack, instead of bilerp on the input image.

The next figure shows the output image obtained image transformation with proper anti-aliasing:

Down-sampled Image with Anti-aliasing using Trilinear Interpolationout_scale_t.png

As we can see from the above output, the aliasing artifact has disappeared.

The same results are obtained on the color image, as shown below, by applying the trilerp function on the color channels separately and creating separate image stacks for different color channels.

Original Image

Down-sampled Image with Bilinear Interpolation

Down-sampled Image with Anti-aliasing

The following animation shows the branding artifacts created when using bilinear interpolation for  different scale factors and how they are removed with anti-aliasing.

Down-sampled Images with Bilinear Interpolation


Down-sampled Images with Anti-aliasing


Hand-Gesture Classification using Deep Convolution and Residual Neural Network (ResNet-50) with Tensorflow / Keras in Python

In this article, first an application of convolution net to classify a set of hand-sign images is going to be discussed.  Later the accuracy of this classifier will be improved using a deep res-net. These problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deep-learning specialization) by the Stanford Prof. Andrew Ng. ( The problem descriptions are taken straightaway from the course itself.

1. Hand-gesture Classification with Convolution Neural Network

In this assignment, the following tasks are going to be accomplished:

  • Implement a fully functioning ConvNet using TensorFlow.
  • Build and train a ConvNet in TensorFlow for a classification problem

This assignment is going to be done using tensorflow.

First the necessary packages are loaded:

import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import scipy
from PIL import Image
from scipy import ndimage
import tensorflow as tf
from tensorflow.python.framework import ops
from cnn_utils import *

%matplotlib inline

Next the “SIGNS” dataset is loaded that we are going to use. The SIGNS dataset is a collection of 6 signs representing numbers from 0 to 5, as shown in the next figure. The output classes are shown with one hot encoding.

# Loading the data (signs)
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
signs_data_kiank.pngThe next figures show a few randomly sampled images for each class label from the training dataset. There are 180 images for each class and a total of 108 images in the training dataset.

number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)

The following steps are to be executed to train a conv-net model with tensorflow using the trainign dataset and then classify the images from the test dataset using the model.

Create placeholders

TensorFlow requires that we create placeholders for the input data that will be fed into the model when running the session.

Let’s implement the function below to create placeholders for the input image X and the output Y. We should not define the number of training examples for the moment. To do so, we could use “None” as the batch size, it will give us the flexibility to choose it later. Hence X should be of dimension [None, n_H0, n_W0, n_C0] and Y should be of dimension  [None, n_y].

def create_placeholders(n_H0, n_W0, n_C0, n_y):
    Creates the placeholders for the tensorflow session.
    n_H0 -- scalar, height of an input image
    n_W0 -- scalar, width of an input image
    n_C0 -- scalar, number of channels of the input
    n_y -- scalar, number of classes
    X -- placeholder for the data input, of shape [None, n_H0, n_W0, n_C0] and dtype "float"
    Y -- placeholder for the input labels, of shape [None, n_y] and dtype "float"

    X = tf.placeholder(tf.float32, shape=(None, n_H0, n_W0, n_C0))
    Y = tf.placeholder(tf.float32, shape=(None, n_y))
    return X, Y


Initialize parameters

Let’s initialize weights/filters W1 and Wusing xavier_initializer.
We don’t need to worry about bias variables as you will soon see that TensorFlow functions take care of the bias. Note also that you will only initialize the weights/filters for the conv2d functions. TensorFlow initializes the layers for the fully connected part automatically.

def initialize_parameters():
    Initializes weight parameters to build a neural network with tensorflow. The shapes are:
                        W1 : [4, 4, 3, 8]
                        W2 : [2, 2, 8, 16]
    parameters -- a dictionary of tensors containing W1, W2
    tf.set_random_seed(1)                              # so that our "random" numbers match ours
    W1 = tf.get_variable("W1", (4, 4, 3, 8), initializer = tf.contrib.layers.xavier_initializer(seed = 0)) 
    W2 = tf.get_variable("W2", (2, 2, 8, 16), initializer = tf.contrib.layers.xavier_initializer(seed = 0)) 

    parameters = {"W1": W1,
                  "W2": W2}
    return parameters

Forward propagation

Next we need to implement the forward_propagation function below to build the following model:

We need to use the following built-in tensorflow functions:

  • tf.nn.conv2d(X,W1, strides = [1,s,s,1], padding = ‘SAME’): given an input XX and a group of filters W1W1, this function convolves W1W1‘s filters on X. The third input ([1,f,f,1]) represents the strides for each dimension of the input (m, n_H_prev, n_W_prev, n_C_prev). You can read the full documentation here
  • tf.nn.max_pool(A, ksize = [1,f,f,1], strides = [1,s,s,1], padding = ‘SAME’): given an input A, this function uses a window of size (f, f) and strides of size (s, s) to carry out max pooling over each window. You can read the full documentation here
  • tf.nn.relu(Z1): computes the elementwise ReLU of Z1 (which can be any shape). You can read the full documentation here.
  • tf.contrib.layers.flatten(P): given an input P, this function flattens each example into a 1D vector it while maintaining the batch-size. It returns a flattened tensor with shape [batch_size, k]. You can read the full documentation here.
  • tf.contrib.layers.fully_connected(F, num_outputs): given a the flattened input F, it returns the output computed using a fully connected layer. You can read the full documentation here.

In detail, we will use the following parameters for all the steps:

 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use an 8 by 8 filter size and an 8 by 8 stride, padding is "SAME"
 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use a 4 by 4 filter size and a 4 by 4 stride, padding is "SAME"
 - Flatten the previous output.
 - FULLYCONNECTED (FC) layer: Apply a fully connected layer without an non-linear activation function. Do not call the softmax here. This will result in 6 neurons in the output layer, which then get passed later to a softmax. In TensorFlow, the softmax and cost function are lumped together into a single function, which you'll call in a different function when computing the cost. 
def forward_propagation(X, parameters):
    Implements the forward propagation for the model:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "W2"
                  the shapes are given in initialize_parameters

    Z3 -- the output of the last LINEAR unit

Compute cost

Next step is to implement the compute cost function using the following tensorflow functions:

  • tf.nn.softmax_cross_entropy_with_logits(logits = Z3, labels = Y): computes the softmax entropy loss. This function both computes the softmax activation function as well as the resulting loss. You can check the full documentation here.
  • tf.reduce_mean: computes the mean of elements across dimensions of a tensor. Use this to sum the losses over all the examples to get the overall cost. You can check the full documentation here.
def compute_cost(Z3, Y):
    Computes the cost
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3
    cost - Tensor of the cost function


Finally we need to merge the helper functions we implemented above to build a model and train it on the SIGNS dataset.

The model should:

  • create placeholders
  • initialize parameters
  • forward propagate
  • compute the cost
  • create an optimizer

Finally we need to create a session and run a for loop for num_epochs, get the mini-batches, and then for each mini-batch you will optimize the function.

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.009,
          num_epochs = 100, minibatch_size = 64, print_cost = True):
    Implements a three-layer ConvNet in Tensorflow:
    X_train -- training set, of shape (None, 64, 64, 3)
    Y_train -- test set, of shape (None, n_y = 6)
    X_test -- training set, of shape (None, 64, 64, 3)
    Y_test -- test set, of shape (None, n_y = 6)
    learning_rate -- learning rate of the optimization
    num_epochs -- number of epochs of the optimization loop
    minibatch_size -- size of a minibatch
    print_cost -- True to print the cost every 100 epochs
    train_accuracy -- real number, accuracy on the train set (X_train)
    test_accuracy -- real number, testing accuracy on the test set (X_test)
    parameters -- parameters learnt by the model. They can then be used to predict.

Then let’s train the model for 100 epochs.

_, _, parameters = model(X_train, Y_train, X_test, Y_test)
 with the following output:
Cost after epoch 0: 1.918487
Cost after epoch 5: 1.875008
Cost after epoch 10: 1.813409
Cost after epoch 15: 1.667654
Cost after epoch 20: 1.444399
Cost after epoch 25: 1.203926
Cost after epoch 30: 1.028009
Cost after epoch 35: 0.887578
Cost after epoch 40: 0.791803
Cost after epoch 45: 0.712319
Cost after epoch 50: 0.655244
Cost after epoch 55: 0.597494
Cost after epoch 60: 0.556236
Cost after epoch 65: 0.525260
Cost after epoch 70: 0.484548
Cost after epoch 75: 0.477365
Cost after epoch 80: 0.451908
Cost after epoch 85: 0.415393
Cost after epoch 90: 0.386501
Cost after epoch 95: 0.373167


Tensor(“Mean_1:0”, shape=(), dtype=float32)
Train Accuracy: 0.894444
Test Accuracy: 0.841667


2. Improving the Accuracy of the Hand-Gesture Classifier with Residual Networks

Now we shall learn how to build very deep convolutional networks, using Residual Networks (ResNets). In theory, very deep networks can represent very complex functions; but in practice, they are hard to train. Residual Networks, introduced by He et al., allow to train much deeper networks than were previously practically feasible.

In this assignment, the following tasks we are going to accomplish:

  • Implement the basic building blocks of ResNets.
  • Put together these building blocks to implement and train a state-of-the-art neural network for image classification.

This assignment will be done in Keras.

Let’s first load the following required packages.

import numpy as np
from keras import layers
from keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from keras.models import Model, load_model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot_ng as pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from resnets_utils import *
from keras.initializers import glorot_uniform
import scipy.misc
from matplotlib.pyplot import imshow
%matplotlib inline
import keras.backend as K

The problem of very deep neural networks

In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow.

During training, we might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:


We are now going to solve this problem by building a Residual Network!

Building a Residual Network

In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly back-propagated to earlier layers:


The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, we can form a very deep network.

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different. We are going to implement both of them.

1 – The identity block

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l]) has the same dimension as the output activation (say a[l+2]). To flesh out the different steps of what happens in a ResNet’s identity block, here is an alternative diagram showing the individual steps:

idblock2_kiank.pngThe upper path is the “shortcut path.” The lower path is the “main path.” In this diagram, we have also made explicit the CONV2D and ReLU steps in each layer. To speed up training we have also added a BatchNorm step. 

In this exercise, we’ll actually implement a slightly more powerful version of this identity block, in which the skip connection “skips over” 3 hidden layers rather than 2 layers. It looks like this:


Here’re the individual steps.

First component of main path:

  • The first CONV2D has F1 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2a'. Use 0 as the seed for the random initialization.
  • The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

  • The second CONV2D has F2 filters of shape (f,fand a stride of (1,1). Its padding is “same” and its name should be conv_name_base + '2b'. Use 0 as the seed for the random initialization.
  • The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

  • The third CONV2D has F3 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2c'. Use 0 as the seed for the random initialization.
  • The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Final step:

  • The shortcut and the input are added together.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Now let’s implement the ResNet identity block.

  • To implement the Conv2D step: See reference
  • To implement BatchNorm: See reference (axis: Integer, the axis that should be normalized (typically the channels axis))
  • For the activation, use: Activation('relu')(X)
  • To add the value passed forward by the shortcut: See reference
defidentity_block(X, f, filters, stage, block):
    Implementation of the identity block as defined in Figure 3
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    ### The first Component ###
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    # Retrieve Filters
    F1, F2, F3 = filters
    # Save the input value. You'll need this later to add back to the main path. 
    X_shortcut = X
    # First component of main path
    X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)

    ### The second Component ###

    # ...

    ### The third Component ###

    # ...

    return X

2 – The convolutional block

Next, the ResNet “convolutional block” is the other type of block. We can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:


The CONV2D layer in the shortcut path is used to resize the input x to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. For example, to reduce the activation dimensions’s height and width by a factor of 2, we can use a 1×1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.

The details of the convolutional block are as follows.

First component of main path:

  • The first CONV2D has F1 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '2a'.
  • The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

  • The second CONV2D has F2 filters of (f,f) and a stride of (1,1). Its padding is “same” and it’s name should be conv_name_base + '2b'.
  • The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

  • The third CONV2D has F3 filters of (1,1) and a stride of (1,1). Its padding is “valid” and it’s name should be conv_name_base + '2c'.
  • The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Shortcut path:

  • The CONV2D has F3 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '1'.
  • The BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '1'.

Final step:

  • The shortcut and the main path values are added together.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Let’s now implement the convolutional block.

defconvolutional_block(X, f, filters, stage, block, s = 2):
    Implementation of the convolutional block as defined in Figure 4
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    s -- Integer, specifying the stride to be used
    X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    # Retrieve Filters
    F1, F2, F3 = filters
    # Save the input value
    X_shortcut = X

    ##### MAIN PATH #####
    # First component of main path 
    X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)
    # Second component of main path
    # ...
    # Third component of main path
    # ...
    ##### SHORTCUT PATH ####
    # ...
    # Final step: Add shortcut value to main path, and pass it through a RELU activation
    # ...
    return X


3 – Building our first ResNet model (50 layers)

We now have the necessary blocks to build a very deep ResNet. The following figure describes in detail the architecture of this neural network. “ID BLOCK” in the diagram stands for “Identity block,” and “ID BLOCK x3” means we should stack 3 identity blocks together.


The details of this ResNet-50 model are:

  • Zero-padding pads the input with a pad of (3,3)
  • Stage 1:
    • The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”.
    • BatchNorm is applied to the channels axis of the input.
    • MaxPooling uses a (3,3) window and a (2,2) stride.
  • Stage 2:
    • The convolutional block uses three set of filters of size [64,64,256], “f” is 3, “s” is 1 and the block is “a”.
    • The 2 identity blocks use three set of filters of size [64,64,256], “f” is 3 and the blocks are “b” and “c”.
  • Stage 3:
    • The convolutional block uses three set of filters of size [128,128,512], “f” is 3, “s” is 2 and the block is “a”.
    • The 3 identity blocks use three set of filters of size [128,128,512], “f” is 3 and the blocks are “b”, “c” and “d”.
  • Stage 4:
    • The convolutional block uses three set of filters of size [256, 256, 1024], “f” is 3, “s” is 2 and the block is “a”.
    • The 5 identity blocks use three set of filters of size [256, 256, 1024], “f” is 3 and the blocks are “b”, “c”, “d”, “e” and “f”.
  • Stage 5:
    • The convolutional block uses three set of filters of size [512, 512, 2048], “f” is 3, “s” is 2 and the block is “a”.
    • The 2 identity blocks use three set of filters of size [512, 512, 2048], “f” is 3 and the blocks are “b” and “c”.
  • The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
  • The flatten doesn’t have any hyperparameters or name.
  • The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

Let’s implement the ResNet with 50 layers described in the figure above.

We’ll need to use this function:

Here’re some other functions we used in the code below:

def ResNet50(input_shape = (64, 64, 3), classes = 6):
    Implementation of the popular ResNet50 the following architecture:

    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    model -- a Model() instance in Keras
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)

    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)
    # Stage 1
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # Stage 2
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], stage = 2, block='a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='b')
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='c')

    # ...

    # ...

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='softmax', name='fc' + str(classes), kernel_initializer = glorot_uniform(seed=0))(X)
    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

Next, let’s build the model’s graph. We have 6 output classes for the hand-signs dataset.

model = ResNet50(input_shape = (64, 64, 3), classes = 6)

We need to configure the learning process by compiling the model.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

The model is now ready to be trained. The only thing we need is to pass the same hand-signs dataset that we used earlier. We need to load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Convert training and test labels to one hot matrices
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T

print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))
number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)

Now let’s train our  resnet model on 20 epochs with a batch size of 32., Y_train, epochs = 20, batch_size = 32)
Epoch 1/20
1080/1080 [==============================] - 173s - loss: 2.0610 - acc: 0.3435   
Epoch 2/20
1080/1080 [==============================] - 149s - loss: 1.8561 - acc: 0.4259   
Epoch 3/20
1080/1080 [==============================] - 147s - loss: 2.0284 - acc: 0.4343   
Epoch 4/20
1080/1080 [==============================] - 151s - loss: 1.7140 - acc: 0.4500   
Epoch 5/20
1080/1080 [==============================] - 134s - loss: 1.4401 - acc: 0.5676   
Epoch 6/20
1080/1080 [==============================] - 128s - loss: 1.1950 - acc: 0.6481   
Epoch 7/20
1080/1080 [==============================] - 129s - loss: 0.9886 - acc: 0.7426   
Epoch 8/20
1080/1080 [==============================] - 133s - loss: 1.2155 - acc: 0.6843   
Epoch 9/20
1080/1080 [==============================] - 131s - loss: 0.8536 - acc: 0.8185   
Epoch 10/20
1080/1080 [==============================] - 132s - loss: 0.9502 - acc: 0.7565   
Epoch 11/20
1080/1080 [==============================] - 129s - loss: 0.8180 - acc: 0.8111   
Epoch 12/20
1080/1080 [==============================] - 130s - loss: 0.7060 - acc: 0.8343   
Epoch 13/20
1080/1080 [==============================] - 130s - loss: 0.8687 - acc: 0.8148   
Epoch 14/20
1080/1080 [==============================] - 130s - loss: 0.8282 - acc: 0.8509   
Epoch 15/20
1080/1080 [==============================] - 130s - loss: 0.9303 - acc: 0.7972   
Epoch 16/20
1080/1080 [==============================] - 146s - loss: 1.1211 - acc: 0.7870   
Epoch 17/20
1080/1080 [==============================] - 143s - loss: 0.9337 - acc: 0.7824   
Epoch 18/20
1080/1080 [==============================] - 150s - loss: 0.3976 - acc: 0.8870   
Epoch 19/20
1080/1080 [==============================] - 143s - loss: 0.2532 - acc: 0.9407   
Epoch 20/20
1080/1080 [==============================] - 133s - loss: 0.2528 - acc: 0.9556

Let’s see how this model performs on the test set.

preds = model.evaluate(X_test, Y_test)
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))
Loss = 0.36906948487
Test Accuracy = 0.891666662693

We can also print a summary of your model by running the following code.

Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 64, 64, 3)     0                                            
zero_padding2d_1 (ZeroPadding2D) (None, 70, 70, 3)     0                                            
conv1 (Conv2D)                   (None, 32, 32, 64)    9472                                         
bn_conv1 (BatchNormalization)    (None, 32, 32, 64)    256                                          
activation_4 (Activation)        (None, 32, 32, 64)    0                                            
max_pooling2d_1 (MaxPooling2D)   (None, 15, 15, 64)    0                                            
res2a_branch2a (Conv2D)          (None, 15, 15, 64)    4160                                         
bn2a_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_5 (Activation)        (None, 15, 15, 64)    0                                            
res2a_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
bn2a_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_6 (Activation)        (None, 15, 15, 64)    0                                            
res2a_branch1 (Conv2D)           (None, 15, 15, 256)   16640                                        
res2a_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
bn2a_branch1 (BatchNormalization (None, 15, 15, 256)   1024                                         
bn2a_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
add_2 (Add)                      (None, 15, 15, 256)   0                                            
activation_7 (Activation)        (None, 15, 15, 256)   0                                            
res2b_branch2a (Conv2D)          (None, 15, 15, 64)    16448                                        
bn2b_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_8 (Activation)        (None, 15, 15, 64)    0                                            
res2b_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
bn2b_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_9 (Activation)        (None, 15, 15, 64)    0                                            
res2b_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
bn2b_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
add_3 (Add)                      (None, 15, 15, 256)   0                                            
activation_10 (Activation)       (None, 15, 15, 256)   0                                            
res2c_branch2a (Conv2D)          (None, 15, 15, 64)    16448                                        
bn2c_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_11 (Activation)       (None, 15, 15, 64)    0                                            
res2c_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
bn2c_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_12 (Activation)       (None, 15, 15, 64)    0                                            
res2c_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
bn2c_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
add_4 (Add)                      (None, 15, 15, 256)   0                                            
activation_13 (Activation)       (None, 15, 15, 256)   0                                            
res3a_branch2a (Conv2D)          (None, 8, 8, 128)     32896                                        
bn3a_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_14 (Activation)       (None, 8, 8, 128)     0                                            
res3a_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
bn3a_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_15 (Activation)       (None, 8, 8, 128)     0                                            
res3a_branch1 (Conv2D)           (None, 8, 8, 512)     131584                                       
res3a_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
bn3a_branch1 (BatchNormalization (None, 8, 8, 512)     2048                                         
bn3a_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
add_5 (Add)                      (None, 8, 8, 512)     0                                            
activation_16 (Activation)       (None, 8, 8, 512)     0                                            
res3b_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
bn3b_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_17 (Activation)       (None, 8, 8, 128)     0                                            
res3b_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
bn3b_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_18 (Activation)       (None, 8, 8, 128)     0                                            
res3b_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
bn3b_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
add_6 (Add)                      (None, 8, 8, 512)     0                                            
activation_19 (Activation)       (None, 8, 8, 512)     0                                            
res3c_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
bn3c_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_20 (Activation)       (None, 8, 8, 128)     0                                            
res3c_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
bn3c_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_21 (Activation)       (None, 8, 8, 128)     0                                            
res3c_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
bn3c_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
add_7 (Add)                      (None, 8, 8, 512)     0                                            
activation_22 (Activation)       (None, 8, 8, 512)     0                                            
res3d_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
bn3d_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_23 (Activation)       (None, 8, 8, 128)     0                                            
res3d_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
bn3d_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_24 (Activation)       (None, 8, 8, 128)     0                                            
res3d_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
bn3d_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
add_8 (Add)                      (None, 8, 8, 512)     0                                            
activation_25 (Activation)       (None, 8, 8, 512)     0                                            
res4a_branch2a (Conv2D)          (None, 4, 4, 256)     131328                                       
bn4a_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_26 (Activation)       (None, 4, 4, 256)     0                                            
res4a_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4a_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_27 (Activation)       (None, 4, 4, 256)     0                                            
res4a_branch1 (Conv2D)           (None, 4, 4, 1024)    525312                                       
res4a_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4a_branch1 (BatchNormalization (None, 4, 4, 1024)    4096                                         
bn4a_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_9 (Add)                      (None, 4, 4, 1024)    0                                            
activation_28 (Activation)       (None, 4, 4, 1024)    0                                            
res4b_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4b_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_29 (Activation)       (None, 4, 4, 256)     0                                            
res4b_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4b_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_30 (Activation)       (None, 4, 4, 256)     0                                            
res4b_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4b_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_10 (Add)                     (None, 4, 4, 1024)    0                                            
activation_31 (Activation)       (None, 4, 4, 1024)    0                                            
res4c_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4c_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_32 (Activation)       (None, 4, 4, 256)     0                                            
res4c_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4c_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_33 (Activation)       (None, 4, 4, 256)     0                                            
res4c_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4c_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_11 (Add)                     (None, 4, 4, 1024)    0                                            
activation_34 (Activation)       (None, 4, 4, 1024)    0                                            
res4d_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4d_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_35 (Activation)       (None, 4, 4, 256)     0                                            
res4d_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4d_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_36 (Activation)       (None, 4, 4, 256)     0                                            
res4d_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4d_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_12 (Add)                     (None, 4, 4, 1024)    0                                            
activation_37 (Activation)       (None, 4, 4, 1024)    0                                            
res4e_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4e_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_38 (Activation)       (None, 4, 4, 256)     0                                            
res4e_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4e_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_39 (Activation)       (None, 4, 4, 256)     0                                            
res4e_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4e_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_13 (Add)                     (None, 4, 4, 1024)    0                                            
activation_40 (Activation)       (None, 4, 4, 1024)    0                                            
res4f_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4f_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_41 (Activation)       (None, 4, 4, 256)     0                                            
res4f_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4f_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_42 (Activation)       (None, 4, 4, 256)     0                                            
res4f_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4f_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_14 (Add)                     (None, 4, 4, 1024)    0                                            
activation_43 (Activation)       (None, 4, 4, 1024)    0                                            
res5a_branch2a (Conv2D)          (None, 2, 2, 512)     524800                                       
bn5a_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_44 (Activation)       (None, 2, 2, 512)     0                                            
res5a_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
bn5a_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_45 (Activation)       (None, 2, 2, 512)     0                                            
res5a_branch1 (Conv2D)           (None, 2, 2, 2048)    2099200                                      
res5a_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
bn5a_branch1 (BatchNormalization (None, 2, 2, 2048)    8192                                         
bn5a_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
add_15 (Add)                     (None, 2, 2, 2048)    0                                            
activation_46 (Activation)       (None, 2, 2, 2048)    0                                            
res5b_branch2a (Conv2D)          (None, 2, 2, 512)     1049088                                      
bn5b_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_47 (Activation)       (None, 2, 2, 512)     0                                            
res5b_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
bn5b_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_48 (Activation)       (None, 2, 2, 512)     0                                            
res5b_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
bn5b_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
add_16 (Add)                     (None, 2, 2, 2048)    0                                            
activation_49 (Activation)       (None, 2, 2, 2048)    0                                            
res5c_branch2a (Conv2D)          (None, 2, 2, 512)     1049088                                      
bn5c_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_50 (Activation)       (None, 2, 2, 512)     0                                            
res5c_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
bn5c_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_51 (Activation)       (None, 2, 2, 512)     0                                            
res5c_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
bn5c_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
add_17 (Add)                     (None, 2, 2, 2048)    0                                            
activation_52 (Activation)       (None, 2, 2, 2048)    0                                            
avg_pool (AveragePooling2D)      (None, 1, 1, 2048)    0                                            
flatten_1 (Flatten)              (None, 2048)          0                                            
fc6 (Dense)                      (None, 6)             12294                                        
Total params: 23,600,006.0
Trainable params: 23,546,886.0
Non-trainable params: 53,120.0

Finally, the next figure shows the visualization of our ResNet50.


Key points

  • Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.
  • The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
  • There are two main type of blocks: The identity block and the convolutional block.
  • Very deep Residual Networks are built by stacking these blocks together.


This article presents the ResNet algorithm due to He et al. (2015). The implementation here also took significant inspiration and follows the structure given in the github repository of Francois Chollet:

Some Applications of Markov Chain in Python

In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.

1. Markov Model of Natural Language

This problem appeared as an assignment in the Princeton course cos126 . This assignment was developed by Prof. Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.

Problem Statement

Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these frequencies as probabilities. For example, if the input text is "gagggagaggcgagaaa", the Markov model of order 0 predicts that each letter is 'a' with probability 7/17, 'c' with probability 1/17, and 'g' with probability 9/17 because these are the fraction of times each letter occurs. The following sequence of letters is a typical example generated from this model:

g a g g c g a g a a g a g a a g a a a g a g a g a g a a a g a g a a g ...

A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with statistical properties of English text because there a high correlation among successive letters in a word or sentence. For example, 'w' is more likely to be followed with 'e' than with 'u', while 'q' is more likely to be followed with 'u' than with 'e'.

We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters. Let a k-gram mean any k consecutive letters. Then for example, if the text has 100 occurrences of "th", with 60 occurrences of "the", 25 occurrences of "thi", 10 occurrences of "tha", and 5 occurrences of "tho", the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20.

A brute-force solution. Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of order 1:

“ To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. ”

Our task is to write a python program to automate this laborious task in a more efficient way — Shannon’s brute-force approach is prohibitively slow when the size of the input text is large.

Markov model data type.
 Create an immutable data type MarkovModel to represent a Markov model of order k from a given text string. The data type must implement the following API:


  • Constructor. To implement the data type, create a symbol table, whose keys will be Stringk-grams. You may assume that the input text is a sequence of characters over the ASCII alphabet so that all char values are between 0 and 127. The value type of your symbol table needs to be a data structure that can represent the frequency of each possible next character. The frequencies should be tallied as if the text were circular (i.e., as if it repeated the first k characters at the end).
  • Order. Return the order k of the Markov Model.
  • Frequency. There are two frequency methods.
    • freq(kgram) returns the number of times the k-gram was found in the original text.
    • freq(kgram, c) returns the number of times the k-gram was followed by the character c in the original text.
  • Randomly generate a character. Return a character. It must be a character that followed the k-gram in the original text. The character should be chosen randomly, but the results of calling rand(kgram) several times should mirror the frequencies of characters that followed the k-gram in the original text.
  • Generate pseudo-random text. Return a String of length T that is a randomly generated stream of characters whose first k characters are the argument kgram. Starting with the argument kgram, repeatedly call rand() to generate the next character. Successive k-grams should be formed by using the most recent k characters in the newly generated text. Use a StringBuilder object to build the stream of characters (otherwise, as we saw when discussing performance, your code will take order of N2 time to generate N characters, which is too slow).

To avoid dead ends, treat the input text as a circular string: the last character is considered to precede the first character.

For example, if k = 2 and the text is the 17-character string "gagggagaggcgagaaa", then the salient features of the Markov model are captured in the table below:

               frequency of   probability that
                next char       next char is 
kgram   freq    a   c   g        a    c    g
 aa      2      1   0   1       1/2   0   1/2  
 ag      5      3   0   2       3/5   0   2/5  
 cg      1      1   0   0        1    0    0
 ga      5      1   0   4       1/5   0   4/5  
 gc      1      0   0   1        0    0    1  
 gg      3      1   1   1       1/3  1/3  1/3  
        17      7   1   9

Taken from an example from the same assignment.

Note that the frequency of "ag" is 5 (and not 4) because we are treating the string as circular.

Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For example, if the current state is "ga" in the Markov model of order 2 discussed above, then the next character is 'a' with probability 1/5 and 'g' with probability 4/5. The next state in the Markov chain is obtained by appending the new character to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such states. Below is a possible trajectory consisting of 9 transitions.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
probability for a:       1/5      3/5      1/3       0        1       1/5      3/51/5      1/2
probability for c:        0        0       1/3       0        0        0        0        0        0
probability for g:       4/5      2/5      1/3       1        0       4/5      2/5      4/5      1/2

Taken from an example from the same assignment.

Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters.

To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then, simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
output:              ga        g        g        c        g        a        g        a        a        g

Text generation client. Implement a client program TextGenerator that takes two command-line integers k and T, reads the input text from standard input and builds a Markov model of order k from the input text; then, starting with the k-gram consisting of the first k letters of the input text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain. We may assume that the text has length at least k, and also that T ≥ k.

A Python Implementation

import numpy as np
from collections import defaultdict

class MarkovModel:

    def __init__(self, text, k): 
        create a Markov model of order k from given text
        Assume that text has length at least k.
        self.k = k               
        self.tran = defaultdict(float)
        self.alph = list(set(list(text)))
        self.kgrams = defaultdict(int)
        n = len(text)
        text += text[:k]
        for i in range(n):
            self.tran[text[i:i+k],text[i+k]] += 1.
            self.kgrams[text[i:i+k]] += 1

    def order(self):                   # order k of Markov model
        return self.k

    def freq(self, kgram):             # number of occurrences of kgram in text
        assert len(kgram) == self.k    # (check if kgram is of length k)
        return self.kgrams[kgram]

    def freq2(self, kgram, c):   	# number of times that character c follows kgram
        assert len(kgram) == self.k     # (check if kgram is of length k)  
        return self.tran[kgram,c]  

    def rand(self, kgram):             # random character following given kgram
        assert len(kgram) == self.k    # (check if kgram is of length k.
        Z = sum([self.tran[kgram, alph] for alph in self.alph])
        return np.random.choice(self.alph, 1, p=np.array([self.tran[kgram, alph] for alph in self.alph])/Z)

    def gen(self, kgram, T):           # generate a String of length T characters
        assert len(kgram) == self.k    # by simulating a trajectory through the corresponding
        str = ''                       # Markov chain. The first k characters of the newly
        for _ in range(T):             # generated String should be the argument kgram.
             #print kgram, c           # check if kgram is of length k.
             c =  self.rand(kgram)[0]  # Assume that T is at least k.
             kgram = kgram[1:] + c     
             str += c			 
        return str

Some Results

m = MarkovModel('gagggagaggcgagaaa', 2)

generates the following MarkovChain where each state represents a 2-gram.



Input: news item (taken from the assignment)

Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.

“We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Counsel for Microsoft Corporation. “We are disappointed with this decision, but we will immediately comply with the Court’s order.”

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft’s goal.

“We will continue to listen to our customers and provide them the tools they need to write great software using the Java language,” added Tod Nielsen, General Manager for Microsoft’s Developer Relations Group/Platform Marketing.

Markov Model learnt


Generated output: random news item, using input as an order 7 model


Microsoft is no longer able to use the Java language,” added Tod Nielsen, General Counsel for Microsoft’s Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developer Relations Group/Platform Marketing.Microsoft to be Microsoft Corporation. “We are disappointed with the Court’s order.”

Microsoft is no longer able to use the Java language. Providing the best tools and provide them the tools and programming options will continue to listen to our customers and provide them the tools and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the Java programming option of taking advantage of Windows features when writing software Developer Relations Group/Platform Marketing.Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Manager for Microsoft’s goal.

“We will continue to listen to our customers and programming language. Providing the Java language,” added Tod Nielsen, General Manager for Microsoft is no longer able to use the Java language.

Noisy Text Correction

Imagine we receive a message where some of the characters have been corrupted by noise. We represent unknown characters by the ~ symbol (we assume we don’t use ~ in our messages). Add a method replaceUnknown that decodes a noisy message by replacing each ~ with the most likely character given our order k Markov model, and conditional on the surrounding text:

def replaceUnknown(corrupted)  # replace unknown characters with most probable characters

Assume unknown characters are at least k characters apart and also appear at least k characters away from the start and end of the message.  This maximum-likelihood approach doesn’t always get it perfect, but it fixes most of the missing characters correctly.

Here are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should consider all possible replacement characters. We want the replacement character that makes sense not only at the unknown position (given the previous characters) but also when the replacement is used in the context of the k subsequent known characters. For example we expect the unknown character in "was ~he wo" to be 't' and not simply the most likely character in the context of "was ". You can compute the probability of each hypothesis by multiplying the probabilities of generating each of k+1 characters in sequence: the missing one, and the k next ones. The following figure illustrates how we want to consider k+1 windows to maximize the log-likelihood:


Using the algorithm described above, here are the results obtained for the following example:

Original : it was the best of times, it was the worst of times.
Noisy :      it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.
Corrected (k=4): it was the best of times, it was the worst of times.
Corrected (k=2): it was the best of times, in was the wo st of times.


2. Detecting authorship

This problem appeared as an assignment in the Cornell course cs1114 .

The Problem Statement

In this assignment, we shall be implementing an authorship detector which, when
given a large sample size of text to train on, can then guess the author of an unknown

  • The algorithm to be implemented works based on the following idea: An author’s
    writing style can be defined quantitatively by looking at the words he uses. Specifically, we want to keep track of his word flow – that is, which words he tends to use after other words.
  • To make things significantly simpler, we’re going to assume that the author always
    follows a given word with the same distribution of words. Of course, this isn’t true,
    since the words you choose when writing obviously depend on context. Nevertheless, this simplifying assumption should hold over an extended amount of text, where context becomes less relevant.
  • In order to implement this model of an author’s writing style, we will use a Markov
    chain. A Markov chain is a set of states with the Markov property – that is, the
    probabilities of each state are independent from the probabilities of every other state. This behavior correctly models our assumption of word independence.
  • A Markov chain can be represented as a directed graph. Each node is a state (words,
    in our case), and a directed edge going from state Si to Sj represents the probability we will go to Sj when we’re at Si. We will implement this directed graph as a transition matrix. Given a set of words W1, W2, …Wn, we can construct an n by n transition matrix A, where an edge from Wi to Wj of weight p means Aij = p.
  • The edges, in this case, represent the probability that word j follows word i from the
    given author. This means, of course, that the sum of the weights of all edges leaving
    from each word must add up to 1.
  • We can construct this graph from a large sample text corpus. Our next step would be finding the author of an unknown, short chunk of text. To do this, we simply compute the probability of this unknown text occurring, using the words in that order, in each of our Markov chains. The author would likely be the one with the highest probability.
  • We shall implement the Markov chain model of writing style. We are given some sample texts to train our model on, as well as some challenges for you to
    figure out.


Constructing the transition matrix

Our first step is to construct the transition matrix representing our Markov chain.
First, we must read the text from a sample file. We shall want to create a sparse array using the scipy csr sparse matrix. Along with the transition matrix, we shall be creating a corresponding vector that contains 2 word frequencies (normalized by the total number of words in the document (including repeated words)).

Calculating likelihood

Once we have our transition matrix, we can calculate the likelihood of an unknown
sample of text. We are given several pieces of literature by various authors,
as well as excerpts from each of the authors as test dataset. Our goal is to identify the authors of each excerpt.

To do so, we shall need to calculate the likelihood of the excerpt occurring in each author’s transition matrix. Recall that each edge in the directed graph that the transition
matrix represents is the probability that the author follows a word with another.
Since we shall be multiplying numerous possibly small probabilities together, our
calculated likelihood will likely be extremely small. Thus, you should compare log(likelihood) instead. Keep in mind the possibility that the author may have used a word he has never used before. Our calculated likelihood should not eliminate an author completely because of this. We shall be imposing a high penalty if a word is missing.

Finding the author with the maximum likelihood

Now that we can compute likelihoods, the next step is to write a routine that takes a set of transition matrices and dictionaries, and a sequence of text, and returns the index of the transition matrix that results in the highest likelihood. You will write this in a function classify text, which takes transition matrices, dictionaries, histograms, and the name of the file containing the test text, and returns a single integer best index.  The following figure shows how to detect an author k (A_k) of the test text t_1..t_n  using the transition matrix P_k with MLE :


Python Implementation

from np import log
def log0(x):
return 0 if x <= 0 else log(x)

def compute_text_likelihood(filename, T, dict_rev, histogram, index):

Compute the (log) likelihood L of a given string (in ‘filename’) given
a word transition probability T, dictionary ‘dict’, and histogram

text = word_tokenize(open(filename).read().replace(‘\n’, ‘ ‘).lower())
num_words = len(text)
text = [word for word in text if word in histogram] # keep only the words that are found in the training dataset
ll = log0(histogram[text[0]]) – log0(sum(histogram.values()))
for i in range(1, len(text)):
ll += log0(T[dict_rev[text[i-1]], dict_rev[text[i]]])
return ll + (num_words – num_matches)*penalty

def classify_text(tmatrices, dict_revs, histograms, filename):

Return the index of the most likely author given the transition matrices, dictionaries, and a test file

for i in range(len(tmatrices)):
ll = compute_text_likelihood(filename, tmatrices[i], dict_revs[i], histograms[i], i)

print i, ll

Training Dataset

The list of authors whose writings are there in the training dataset:

0. Austen
1. Carroll
2. Hamilton
3. Jay
4. Madison
5. Shakespeare
6. Thoreau
7. Twain


A few lines of excerpts from the training files (the
literature by several authors, taken from Project Gutenberg), the word clouds and a few states from the corresponding Markov Models Constructed for a few authors


Author JANE AUSTEN (Texts taken from Emma, Mansfield Park, Pride and Prejudice)

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and
happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister’s marriage,
been mistress of his house from a very early period. Her mother had
died too long ago for her to have more than an indistinct remembrance
of her caresses; and her place had been supplied by an excellent woman
as governess, who had fallen little short of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a
governess than a friend, very fond of both daughters, but particularly
of Emma. Between _them_ it was more the intimacy of sisters. Even
before Miss Taylor had ceased to hold the nominal office of governess,
the mildness of her temper had hardly allowed her to impose any
restraint; and the shadow of authority being now long passed away, they
had been living together as friend and friend very mutually attached,
and Emma doing just what she liked; highly esteeming Miss Taylor’s
judgment, but directed chiefly by her own.

The real evils, indeed, of Emma’s situation were the power of having
rather too much her own way, and a disposition to think a little too
well of herself; these were the disadvantages which threatened alloy to
her many enjoyments. The danger, however, was at present so
unperceived, that they did not by any means rank as misfortunes with




Author Lewis Carroll (Texts taken from Alice’s Adventures in Wonderland, Sylvie and Bruno)

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, ‘and what is the use of a book,’ thought Alice ‘without pictures or

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear!
Oh dear! I shall be late!’ (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural); but when the Rabbit actually TOOK A WATCH
OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,
Alice started to her feet, for it flashed across her mind that she had
never before seen a rabbit with either a waistcoat-pocket, or a watch
to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large
rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how
in the world she was to get out again.

The rabbit-hole went straight on like a tunnel for some way, and then
dipped suddenly down, so suddenly that Alice had not a moment to think
about stopping herself before she found herself falling down a very deep




Author William Shakespeare (Texts taken from Henry IV Part 1, Romeo and Juliet, Twelfth Night)


So shaken as we are, so wan with care,
Find we a time for frighted peace to pant,
And breathe short-winded accents of new broils
To be commenced in strands afar remote.
No more the thirsty entrance of this soil
Shall daub her lips with her own children’s blood;
No more shall trenching war channel her fields,
Nor bruise her flowerets with the armed hoofs
Of hostile paces: those opposed eyes,
Which, like the meteors of a troubled heaven,
All of one nature, of one substance bred,
Did lately meet in the intestine shock
And furious close of civil butchery,
Shall now, in mutual well-beseeming ranks,
March all one way, and be no more opposed
Against acquaintance, kindred, and allies:
The edge of war, like an ill-sheathed knife,
No more shall cut his master. Therefore, friends,
As far as to the sepulchre of Christ–
Whose soldier now, under whose blessed cross
We are impressed and engaged to fight–
Forthwith a power of English shall we levy,
To chase these pagans in those holy fields
Over whose acres walk’d those blessed feet
Which fourteen hundred years ago were nail’d
For our advantage on the bitter cross.
But this our purpose now is twelvemonth old,
And bootless ’tis to tell you we will go:
Therefore we meet not now.–Then let me hear
Of you, my gentle cousin Westmoreland,
What yesternight our Council did decree
In forwarding this dear expedience.

My liege, this haste was hot in question,
And many limits of the charge set down
But yesternight; when, all athwart, there came
A post from Wales loaden with heavy news;
Whose worst was, that the noble Mortimer,
Leading the men of Herefordshire to fight
Against th’ irregular and wild Glendower,
Was by the rude hands of that Welshman taken;
A thousand of his people butchered,
Upon whose dead corpse’ there was such misuse,
Such beastly, shameless transformation,
By those Welshwomen done, as may not be
Without much shame re-told or spoken of.




Author MARK TWAIN (Texts taken from A Connecticut Yankee in King Arthur’s Court, The Adventures of Huckleberry Finn, The Prince and the Pauper)

I am an American. I was born and reared in Hartford, in the State
of Connecticut–anyway, just over the river, in the country. So
I am a Yankee of the Yankees–and practical; yes, and nearly
barren of sentiment, I suppose–or poetry, in other words. My
father was a blacksmith, my uncle was a horse doctor, and I was
both, along at first. Then I went over to the great arms factory
and learned my real trade; learned all there was to it; learned
to make everything: guns, revolvers, cannon, boilers, engines, all
sorts of labor-saving machinery. Why, I could make anything
a body wanted–anything in the world, it didn’t make any difference
what; and if there wasn’t any quick new-fangled way to make a thing,
I could invent one–and do it as easy as rolling off a log. I became
head superintendent; had a couple of thousand men under me.

Well, a man like that is a man that is full of fight–that goes
without saying. With a couple of thousand rough men under one,
one has plenty of that sort of amusement. I had, anyway. At last
I met my match, and I got my dose. It was during a misunderstanding
conducted with crowbars with a fellow we used to call Hercules.
He laid me out with a crusher alongside the head that made everything
crack, and seemed to spring every joint in my skull and made it
overlap its neighbor. Then the world went out in darkness, and
I didn’t feel anything more, and didn’t know anything at all
–at least for a while.

When I came to again, I was sitting under an oak tree, on the
grass, with a whole beautiful and broad country landscape all
to myself–nearly. Not entirely; for there was a fellow on a horse,
looking down at me–a fellow fresh out of a picture-book. He was
in old-time iron armor from head to heel, with a helmet on his
head the shape of a nail-keg with slits in it; and he had a shield,
and a sword, and a prodigious spear; and his horse had armor on,
too, and a steel horn projecting from his forehead, and gorgeous
red and green silk trappings that hung down all around him like
a bedquilt, nearly to the ground.

“Fair sir, will ye just?” said this fellow.

“Will I which?”

“Will ye try a passage of arms for land or lady or for–”

“What are you giving me?” I said. “Get along back to your circus,
or I’ll report you.”

Now what does this man do but fall back a couple of hundred yards
and then come rushing at me as hard as he could tear, with his
nail-keg bent down nearly to his horse’s neck and his long spear
pointed straight ahead. I saw he meant business, so I was up
the tree when he arrived.






Classifying unknown texts from the Test Dataset

Each of the Markov models learnt from the training texts for each author are used to compute the log-likelihood of the unknown test text, the author with the maximum log-likelihood is chosen to be the likely author of the text.
Unknown Text1 

Against the interest of her own individual comfort, Mrs. Dashwood had
determined that it would be better for Marianne to be any where, at
that time, than at Barton, where every thing within her view would be
bringing back the past in the strongest and most afflicting manner, by
constantly placing Willoughby before her, such as she had always seen
him there. She recommended it to her daughters, therefore, by all
means not to shorten their visit to Mrs. Jennings; the length of which,
though never exactly fixed, had been expected by all to comprise at
least five or six weeks. A variety of occupations, of objects, and of
company, which could not be procured at Barton, would be inevitable
there, and might yet, she hoped, cheat Marianne, at times, into some
interest beyond herself, and even into some amusement, much as the
ideas of both might now be spurned by her.

From all danger of seeing Willoughby again, her mother considered her
to be at least equally safe in town as in the country, since his
acquaintance must now be dropped by all who called themselves her
friends. Design could never bring them in each other’s way: negligence
could never leave them exposed to a surprise; and chance had less in
its favour in the crowd of London than even in the retirement of
Barton, where it might force him before her while paying that visit at
Allenham on his marriage, which Mrs. Dashwood, from foreseeing at first
as a probable event, had brought herself to expect as a certain one.

She had yet another reason for wishing her children to remain where
they were; a letter from her son-in-law had told her that he and his
wife were to be in town before the middle of February, and she judged
it right that they should sometimes see their brother.

Marianne had promised to be guided by her mother’s opinion, and she
submitted to it therefore without opposition, though it proved
perfectly different from what she wished and expected, though she felt
it to be entirely wrong, formed on mistaken grounds, and that by
requiring her longer continuance in London it deprived her of the only
possible alleviation of her wretchedness, the personal sympathy of her
mother, and doomed her to such society and such scenes as must prevent
her ever knowing a moment’s rest.

But it was a matter of great consolation to her, that what brought evil
to herself would bring good to her sister; and Elinor, on the other
hand, suspecting that it would not be in her power to avoid Edward
entirely, comforted herself by thinking, that though their longer stay
would therefore militate against her own happiness, it would be better
for Marianne than an immediate return into Devonshire.

Her carefulness in guarding her sister from ever hearing Willoughby’s
name mentioned, was not thrown away. Marianne, though without knowing
it herself, reaped all its advantage; for neither Mrs. Jennings, nor
Sir John, nor even Mrs. Palmer herself, ever spoke of him before her.
Elinor wished that the same forbearance could have extended towards
herself, but that was impossible, and she was obliged to listen day
after day to the indignation of them all.

log-likelihood values computed for the probable authors

Author  LL
0           -3126.5812874
1           -4127.9155186
2           -7364.15782346
3           -9381.06336055
4           -7493.78440066
5           -4837.98005673
6           -3515.44028659
7           -3455.85716104

As can be seen from above the maximum likelihood value corresponds to the author 0, i.e., Austen.  Hence, the most probable author of the unknown text is Austen.



Unknown Text2 

Then he tossed the marble away pettishly, and stood cogitating. The
truth was, that a superstition of his had failed, here, which he and
all his comrades had always looked upon as infallible. If you buried a
marble with certain necessary incantations, and left it alone a
fortnight, and then opened the place with the incantation he had just
used, you would find that all the marbles you had ever lost had
gathered themselves together there, meantime, no matter how widely they
had been separated. But now, this thing had actually and unquestionably
failed. Tom’s whole structure of faith was shaken to its foundations.
He had many a time heard of this thing succeeding but never of its
failing before. It did not occur to him that he had tried it several
times before, himself, but could never find the hiding-places
afterward. He puzzled over the matter some time, and finally decided
that some witch had interfered and broken the charm. He thought he
would satisfy himself on that point; so he searched around till he
found a small sandy spot with a little funnel-shaped depression in it.
He laid himself down and put his mouth close to this depression and

“Doodle-bug, doodle-bug, tell me what I want to know! Doodle-bug,
doodle-bug, tell me what I want to know!”

The sand began to work, and presently a small black bug appeared for a
second and then darted under again in a fright.

“He dasn’t tell! So it WAS a witch that done it. I just knowed it.”

He well knew the futility of trying to contend against witches, so he
gave up discouraged. But it occurred to him that he might as well have
the marble he had just thrown away, and therefore he went and made a
patient search for it. But he could not find it. Now he went back to
his treasure-house and carefully placed himself just as he had been
standing when he tossed the marble away; then he took another marble
from his pocket and tossed it in the same way, saying:

“Brother, go find your brother!”

He watched where it stopped, and went there and looked. But it must
have fallen short or gone too far; so he tried twice more. The last
repetition was successful. The two marbles lay within a foot of each

log-likelihood values computed for the probable authors

Author  LL
0             -2779.02810424
1             -2738.09304225
2             -5978.83684489
3             -6551.16571407
4             -5780.39620942
5             -4166.34886511
6             -2309.25043697
7             -2033.00112729

As can be seen from above the maximum likelihood value corresponds to the author 7, i.e., Twain.  Hence, the most probable author of the unknown text is Twain. The following figure shows the relevant states corresponding to the Markov model for Twain trained from the training dataset.