# Solving Some Image Processing and Computer Vision Problems with Python libraries

In this article, a few image processing / computer vision problems and their solutions  with python libraries (scikit-image, cv2) will be discussed. Some of the problems are from the exercises from this book (available on amazon). This blog will be continued here.

## Removing Gaussian Noise from images by computing mean and median images with scikit-image

2. Create n (e.g, n=100) noisy images by adding i.i.d. Gaussian noise (with zero mean) to the original image, with scikit-image.
3. Compute the mean (median) of the noisy images.
4. Compare PSNR with the original image.
5. Vary n and compare the results.
from skimage import img_as_float
from skimage.util import random_noise
from skimage.measure import compare_psnr
import matplotlib.pylab as plt
import numpy as np

im = img_as_float(imread('../new images/parrot.jpg')) # original image
np.random.seed(0)
# generate n noisy images from the original image by adding Gaussian noise
n = 25
images = np.zeros((n, im.shape[0], im.shape[1], im.shape[2]))
sigma = 0.2
for i in range(n):
images[i,...] = random_noise(im, var=sigma**2)

im_mean = images.mean(axis=0)
im_median = np.median(images, axis=0)
plt.figure(figsize=(20,16))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, hspace=.01)
plt.subplot(221), plt.imshow(im), plt.axis('off'), plt.title('Original image', size=20)
plt.subplot(222), plt.imshow(images[0]), plt.axis('off'), plt.title('Noisy PSNR: ' + str(compare_psnr(im, images[0])), size=20)
plt.subplot(223), plt.imshow(im_mean), plt.axis('off'), plt.title('Mean PSNR: ' + str(compare_psnr(im, im_mean)), size=20)
plt.subplot(224), plt.imshow(im_median), plt.axis('off'), plt.title('Median PSNR: ' + str(compare_psnr(im, im_median)), size=20)
plt.show()



The next figure shows the original image, a noisy image generated from it by adding Gaussian noise (with 0 mean) to it and the images obtained by taking mean / median over all the n noisy images generated. As can be seen, the Gaussian noise in the images gets cancelled out by taking mean / median.

with n = 25

with n=100


plt.hist(images[:,100,100,0], color='red', alpha=0.2, label='red')
plt.hist(images[:,100,100,1], color='green', alpha=0.2, label='green')
plt.hist(images[:,100,100,2], color='blue', alpha=0.2, label='blue')
plt.legend()
plt.grid()
plt.show()



The next figure shows how a pixel value (that can be considered a random variable) for a particular location in different noisy images follows approximately a Gaussian distribution.

Distribution of a pixel value at location (100,100) in the noisy images


ns = [25, 50, 100, 200]
# mean_psnrs contain the PSNR values for different n
plt.plot(ns, mean_psnrs, '.--', label='PSNR (mean)')
plt.plot(ns, median_psnrs, '.--', label='PSNR (median)')
plt.legend()
plt.xlabel('n'),  plt.ylabel('PSNR')
plt.show()



The following figure shows that the PSNR improves with large n (since by SLLN / WLLN, the sample mean converges to population mean 0 of the Gaussian noise). Also, for median the improvement in the image quality is higher for larger values of n.

## Tracking Pedestrians with HOG-SVM with OpenCV / scikit-image

2. Capture the video / extract frames from the video.
3. For each frame
1. Create HOG scale pyramid of the frame image.
2. At each scale, use a sliding window to extract the corresponding block from the frame, compute the HOG descriptor features.
3. Use cv2‘s HOGDescriptor_getDefaultPeopleDetector() – a pre-trained SVM classifier on the HOG descriptor to classify whether the corresponding block contains a pedestrian or not.
4. Run non-max-suppression to get rid of multiple detection of the same person.
5. Use cv2‘s  detectMultiScale() function to implement steps 3-4.

The code is adapted from the code here and here.


# HOG descriptor using default people (pedestrian) detector
hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

# run detection, using a spatial stride of 4 pixels,
# a scale stride of 1.02, and zero grouping of rectangles
# (to demonstrate that HOG will detect at potentially
# multiple places in the scale pyramid)
(foundBoundingBoxes, weights) = hog.detectMultiScale(frame, winStride=(4, 4), padding=(8, 8), scale=1.02, finalThreshold=0, useMeanshiftGrouping=False)

# convert bounding boxes from format (x1, y1, w, h) to (x1, y1, x2, y2)
rects = np.array([[x, y, x + w, y + h] for (x, y, w, h) in foundBoundingBoxes])

# run non-max suppression on the boxes based on an overlay of 65%
nmsBoundingBoxes = non_max_suppression(rects, probs=None, overlapThresh=0.65)



cv2 functions are used to extract HOG descriptor features and pedestrian detection with SVM,  whereas scikit-image functions are used to visualize the HOG features. The animations below display the original video, what HOG sees and  the detected pedestrians after non-max suppression. Notice there are a few false positive detection.

Original Video

HOG-descriptor features video (what HOG sees)Original Video with detected Pedestrians

1. Capture video with webcam with cv2.VideoCapture().
2. For each frame, use the pre-trained Adaboost Cascade classifiers (the haarcascade_frontalface_default classifier for face detection and haarcascade_eye_tree_eyeglasses classifier for better detection of the eyes with glasses, from the corresponding xml files that come with cv2’s installation) using Haar-like features with cv2.CascadeClassifier().
3. First detect the face(s) with the detectMultiScale() function and draw a bounding box. Then detect the eyes inside a detected face with the same function.
4. The following python code snippet shows how to detect faces and eyes with cv2. The code is adapted from here.


gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# return bounding box of the face(s) if one is detected
for (x,y,w,h) in faces:
frame = cv2.rectangle(frame,(x,y),(x+w,y+h),(255,0,0),2)
roi_gray = gray[y:y+h, x:x+w]
roi_color = frame[y:y+h, x:x+w]
for (ex,ey,ew,eh) in eyes:
cv2.rectangle(roi_color,(ex,ey),(ex+ew,ey+eh),(0,255,0),2)



The next animation shows the results of face detection when scalefactor 1.03 was used to create the scale pyramid.  As can be seen, the eyes with the glasses on and some small faces from the photos are not detected at this scale.

The next animation shows the results of face detection when scalefactor 1.3 was used to create the scale pyramid.  As can be seen, the eyes with/without the glasses on as well as most of the small faces from the photos are detected at this scale most of the time.

## Semantic Segmentation with ENet / DeepLab (Deep Learning  model)

Input video and the segmented Output video

Input video and the segmented Output video

# Learning Distributed Word  Representations with Neural Network: an implementation in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by  Prof.  Geoffrey Hinton from the University of Toronto in 2012.  This problem also appeared as an assignment in this course from the same university.  The problem description is taken from the assignment pdf.

## Problem Statement

In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:

• The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.
• The 4-grams are chosen so that all the words involved come from a small
vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.
• Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

data.vocab
ans =
{
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
.
.
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left
}

• The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.
• Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .

• The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.
• Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.


[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

% 3-gram features for a training data-tuple
train_x(:,13,14)
%ans =
%46
%58
%32
data.vocab{train_x(:,13,14)}
%ans = now
%ans = where
%ans = do

% target for the same data tuple from training dataset
train_t(:,13,14)
%ans = 91
data.vocab{train_t(:,13,14)}
%ans = we


• The validation and test data are also similar. They contain 46,568 4-grams each.
• Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.
• The data needs to get loaded and then separated into inputs and target. After that,  mini-batches of size 100 for the training set are created.
• First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.
• At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.
• The training method will output a ‘model’ (weight matrices, biases for each layer in the network).

## Description of the Network

• As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.
• The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.
• This layer is connected to the hidden layer, which in turn is connected to the output layer.
• The output layer is a softmax over the 250 words.
• The training consists of two steps:  (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.

## Forward Propagation

• The forward propagation is pretty straight-forward and can be implemented as shown in the following code:

function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
hid_to_output_weights, hid_bias, output_bias)
% This method forward propagates through a neural network.
% Inputs:
% input_batch: The input data as a matrix of size numwords X batchsize where,
% numwords is the number of words, batchsize is the number of data points.
% So, if input_batch(i, j) = k then the ith word in data point j is word
% index k of the vocabulary.
%
% word_embedding_weights: Word embedding as a matrix of size
% vocab_size X numhid1, where vocab_size is the size of the vocabulary
% numhid1 is the dimensionality of the embedding space.
%
% embed_to_hid_weights: Weights between the word embedding layer and hidden
% layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
% number of hidden units.
%
% hid_to_output_weights: Weights between the hidden layer and output softmax
% unit as a matrix of size numhid2 X vocab_size
%
% hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
%
% output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
%
% Outputs:
% embedding_layer_state: State of units in the embedding layer as a matrix of
% size numhid1*numwords X batchsize
%
% hidden_layer_state: State of units in the hidden layer as a matrix of size
% numhid2 X batchsize
%
% output_layer_state: State of units in the output layer as a matrix of size
% vocab_size X batchsize
%

[numwords, batchsize] = size(input_batch);
[vocab_size, numhid1] = size(word_embedding_weights);
numhid2 = size(embed_to_hid_weights, 2);

%% COMPUTE STATE OF WORD EMBEDDING LAYER.
% Look up the inputs word indices in the word_embedding_weights matrix.
embedding_layer_state = reshape(...
word_embedding_weights(reshape(input_batch, 1, []),:)',...
numhid1 * numwords, []);

%% COMPUTE STATE OF HIDDEN LAYER.
% Compute inputs to hidden units.
inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
repmat(hid_bias, 1, batchsize);

% Apply logistic activation function.
hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize);

%% COMPUTE STATE OF OUTPUT LAYER.
% Compute inputs to softmax.
inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize);

% Subtract maximum.
% Remember that adding or subtracting the same constant from each input to a
% softmax unit does not affect the outputs. Here we are subtracting maximum to
% make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their
% exponents.
inputs_to_softmax = inputs_to_softmax...
- repmat(max(inputs_to_softmax), vocab_size, 1);

% Compute exp.
output_layer_state = exp(inputs_to_softmax);

% Normalize to get probability distribution.
output_layer_state = output_layer_state ./ repmat(...
sum(output_layer_state, 1), vocab_size, 1);



## Back-Propagation

•  The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.

• As the model trains it prints out some numbers that tell how well the training is going.
• The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.
• After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.
• The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.
• At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.

# Some Applications

## 1. Predict next word

• Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.
• The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.
• It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.

## 2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random  text:

1. Given 3 words to start from, initialize the text with those 3 words.
2. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words.
3. Choose one of the most probable words predicted randomly and insert it at the end of the text.
4. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:


function gen_rand_text(words, model, k=3)

probs = [];
i = 4;
while (i < 20 || word != '.')
[word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k);                   words = {words{:}, word};
probs = [probs; prob];
i = i + 1;
end
fprintf(1, "%s ", words{:}) ;
fprintf(1, '\n');
fprintf(1, "%.2f ", round(probs.*100)./100) ;
fprintf(1, '\n');

end



Starting with the words  'i was going‘, here are some texts that were generated using the model:

Starting with the words  ‘life in new‘, here is a piece of text that was generated using the model:

## 3. Find nearest words

•  The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.
• As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.
• For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.

## 4. Visualization in 2-dimension with t-SNE

•  In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.
• As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.
• Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).

## 5. Solving Word-Analogy Problem

•  with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation.  Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.
• The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.

• The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.

analogy('year', 'years', 'day', model); % singular-plural relation
%year:years::day:days
%dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186

analogy('on', 'off', 'new', model) % antonyms relation
%on:off::new:old
%dist_E('on','off')=2.013958, dist_E('new','old')=2.265665

analogy('use', 'used', 'do', model) % present-past relation
%use:used::do:did
%dist_E('use','used')=2.556175, dist_E('do','did')=2.456098

analogy('he', 'his', 'they', model) % pronoun-relations
%he:his::they:their
%dist_E('he','his')=3.824808, dist_E('they','their')=3.825453

analogy('today', 'yesterday', 'now', model)
%today:yesterday::now:then
%dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935


## Model Selection

• Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.
• The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures,  the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).
• The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.

# Autonomous Driving – Car detection with YOLO Model with Keras in Python

In this article, object detection using the very powerful YOLO model will be described, particularly in the context of car detection for autonomous driving. This problem appeared as an assignment in the coursera course Convolution Networks which is a part of the Deep Learning Specialization (taught by Prof. Andrew Ng.,  from Stanford and deeplearning.ai, the lecture videos corresponding to the YOLO algorithm can be found here).  The problem description is taken straightaway from the assignment.

Given a set of images (a car detection dataset), the goal is to detect objects (cars) in those images using a pre-trained YOLO (You Only Look Once) model, with bounding boxes. Many of the ideas are from the two original YOLO papers: Redmon et al., 2016  and Redmon and Farhadi, 2016 .

## Some Theory

Let’s first clear the concepts regarding classification, localization, detection and how the object detection problem can be transformed to supervised machine learning problem and subsequently can be solved using a deep convolution neural network. As can be seen from the next figure,

• Image classification with localization aims to find the location of an object in an image by not only classifying the image (e.g., a binary classification problem: whether there is a car in an image or not), but also finding a bounding box around the object, if one found.
• Detection goes a level further by aiming to identify multiple instances of same/ different types of objects, by marking their locations (the localization problem usually tries to find a single object location).
• The localization problem can be converted to a supervised machine learning multi-class classification problem in the following way: in addition to the class label of the object to be identified, the output vector corresponding to an input training image must also contain the location (bounding box coordinates relative to image size) of the object.
• A typical output data vector will contain 8 entries for a 4-class classification, as shown in the next figure, the first entry will correspond to whether or not an object of any from the 3 classes of objects. In case one is present in an image, the next 4 entries will define the bounding box containing the object, followed by 3 binary values for the 3 class labels indicating the class of the object. In case none of the objects are present, the first entry will be 0 and the others will be ignored.

• Now moving from localization to detection, one can proceed in two steps as shown below in the next figure: first use small tightly cropped images to train a convolution neural net for image classification and then use sliding windows of different window sizes (smaller to larger) to classify a test image within that window using the convnet learnt and run the windows sequentially through the entire image, but it’s infeasibly slow computationally.
• However, as shown in the next figure, the convolutional implementation of the sliding windows by replacing the fully-connected layers by 1×1 filters makes it possible to simultaneously classify the image-subset inside all possible sliding windows parallelly, making it much more efficient computationally.

• The convolutional sliding windows, although computationally much more efficient, still has the problem of detecting the accurate bounding boxes, since the boxes don’t align with the sliding windows and the object shapes also tend to be different.
• YOLO algorithm overcomes this limitation by dividing a training image into grids and assigning an object to a grid if and only if the center of the object falls inside the grid, that way each object in a training image can get assigned to exactly one grid and then the corresponding bounding box is represented by the coordinates relative to the grid. The next figure described the details of the algorithm.
• In the test images, multiple adjacent grids may think that an object actually belongs to them, in order to resolve the iou (intersection of union) measure is used to find the maximum overlap and the non-maximum-suppression algorithm is used to discard all the other bounding boxes with low-confidence of containing an object, keeping the one with the highest confidence among the competing ones and discard the others.
• Still there is a problem of multiple objects falling in the same grid. Multiple anchor boxes (of different shapes) are used to resolve the problem, each anchor box of a particular shape being likely to eventually detect  an object of a particular shape.

The following figure shows the slides taken from the presentation You Only Look Once: Unified, Real-Time Object Detection in the CVPR 2016 summarizing the algorithm:

## Problem Statement

Let’s assume that we are working on a self-driving car. As a critical component of this project, we’d like to first build a car detection system. To collect data, we’ve mounted a camera to the hood (meaning the front) of the car, which takes pictures of the road ahead every few seconds while we drive around.

The above pictures are taken from a car-mounted camera while driving around Silicon Valley.  We would like to especially thank drive.ai for providing this dataset! Drive.ai is a company building the brains of self-driving vehicles.

We’ve gathered all these images into a folder and have labelled them by drawing bounding boxes around every car we found. Here’s an example of what our bounding boxes look like.

### Definition of a box

If we have 80 classes that we want YOLO to recognize, we can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. Here we will use both representations, depending on which is more convenient for a particular step.

In this exercise, we shall learn how YOLO works, then apply it to car detection. Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for our use.  The instructions for how to do it can be obtained from here and here.

## YOLO

YOLO (“you only look once“) is a popular algorithm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

### Model details

First things to know:

• The input is a batch of images of shape (m, 608, 608, 3).
• The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If we expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So we can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Let’s look in greater detail at what this encoding represents.

### Encoding architecture for YOLO

If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

### Flattening the last two last dimensions

Now, for each box (of each cell) we will compute the following element-wise product and extract a probability that the box contains a certain class.

### Find the class detected by each box

Here’s one way to visualize what YOLO is predicting on an image:

• For each of the 19×19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
• Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:

Each of the 19×19 grid cells colored according to which class has the largest predicted probability in that cell.

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:

Each cell gives us 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking once at the image (one forward pass through the network)! Different colors denote different classes.

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, we’ll use non-max suppression. Specifically, we’ll carry out these steps:

• Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class).
• Select only one box when several boxes overlap with each other and detect the same object.

## Filtering with a threshold on class scores

We are going to apply a first filter by thresholding. We would like to get rid of any box for which the class “score” is less than a chosen threshold.

The model gives us a total of 19x19x5x85 numbers, with each box described by 85 numbers. It’ll be convenient to rearrange the (19,19,5,85) (or (19,19,425)) dimensional tensor into the following variables:

• box_confidence: tensor of shape (19×19,5,1) containing pc (confidence probability that there’s some object) for each of the 5 boxes predicted in each of the 19×19 cells.
• boxes: tensor of shape (19×19,5,4) containing (bx,by,bh,bw) for each of the 5 boxes per cell.
• box_class_probs: tensor of shape (19×19,5,80) containing the detection probabilities (c1,c2,…c80) for each of the 80 classes for each of the 5 boxes per cell.

Exercise: Implement yolo_filter_boxes().

• Compute box scores by doing the element-wise product as described in the above figure.
• For each box, find:
• the index of the class with the maximum box score.
• the corresponding box score.
• Create a mask by using a threshold.  The mask should be True for the boxes you want to keep.
• Use TensorFlow to apply the mask to box_class_scores, boxes and box_classes to filter out the boxes we don’t want.
We should be left with just the subset of boxes we want to keep.

Let’s first load the packages and dependencies that are going to be useful.

import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.io
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D




def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
"""Filters YOLO boxes by thresholding on object and class confidence.

Arguments:
box_confidence -- tensor of shape (19, 19, 5, 1)
boxes -- tensor of shape (19, 19, 5, 4)
box_class_probs -- tensor of shape (19, 19, 5, 80)
threshold -- real value, if [ highest class probability score = threshold)

# Step 4: Apply the mask to scores, boxes and classes

return scores, boxes, classes



### Non-max suppression

Even after filtering by thresholding over the classes scores, we still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).

n this example, the model has predicted 3 cars, but it’s actually 3 predictions of the same car. Running non-max suppression (NMS) will select only the most accurate (highest probability) one of the 3 boxes.

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

### Definition of “Intersection over Union”

Exercise: Implement iou(). Some hints:

• In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
• To calculate the area of a rectangle we need to multiply its height (y2 – y1) by its width (x2 – x1)
• We’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
xi1 = maximum of the x1 coordinates of the two boxes
yi1 = maximum of the y1 coordinates of the two boxes
xi2 = minimum of the x2 coordinates of the two boxes
yi2 = minimum of the y2 coordinates of the two boxes

In this code, we use the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) the lower-right corner.


def iou(box1, box2):
"""Implement the intersection over union (IoU) between box1 and box2

Arguments:
box1 -- first box, list object with coordinates (x1, y1, x2, y2)
box2 -- second box, list object with coordinates (x1, y1, x2, y2)
"""

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.

# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)

# compute the IoU

return iou



We are now ready to implement non-max suppression. The key steps are:

• Select the box that has the highest score.
• Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
• Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

Exercise: Implement yolo_non_max_suppression() using TensorFlow. TensorFlow has two built-in functions that are used to implement non-max suppression (so we don’t actually need to use your iou() implementation):

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
"""
Applies Non-max suppression (NMS) to set of boxes

Arguments:
scores -- tensor of shape (None,), output of yolo_filter_boxes()
boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
classes -- tensor of shape (None,), output of yolo_filter_boxes()
max_boxes -- integer, maximum number of predicted boxes you'd like
iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

Returns:
scores -- tensor of shape (, None), predicted score for each box
boxes -- tensor of shape (4, None), predicted box coordinates
classes -- tensor of shape (, None), predicted class for each box

Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
function will transpose the shapes of scores, boxes, classes. This is made for convenience.
"""

max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

# Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep

# Use K.gather() to select only nms_indices from scores, boxes and classes

return scores, boxes, classes



### Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions we’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail we have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which are provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh)

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape)

YOLO’s network was trained to run on 608×608 images. If we are testing this data on a different size image – for example, the car detection dataset had 720×1280 images – his step rescales the boxes so that they can be plotted on top of the original 720×1280 image.


def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
"""
Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.

Arguments:
yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
box_confidence: tensor of shape (None, 19, 19, 5, 1)
box_xy: tensor of shape (None, 19, 19, 5, 2)
box_wh: tensor of shape (None, 19, 19, 5, 2)
box_class_probs: tensor of shape (None, 19, 19, 5, 80)
image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
max_boxes -- integer, maximum number of predicted boxes you'd like
score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

Returns:
scores -- tensor of shape (None, ), predicted score for each box
boxes -- tensor of shape (None, 4), predicted box coordinates
classes -- tensor of shape (None,), predicted class for each box
"""

# Retrieve outputs of the YOLO model

# Convert boxes to be ready for filtering functions

# Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold

# Scale boxes back to original image shape.

# Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold

return scores, boxes, classes



### Summary for YOLO:

• Input image (608, 608, 3)
• The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
• After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
• Each cell in a 19×19 grid over the input image gives 425 numbers.
• 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
• 85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we’d like to detect.
• We then select only few boxes based on:
• Score-thresholding: throw away boxes that have detected a class with a score less than the threshold.
• Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes.
• This gives us YOLO’s final output.

### Test YOLO pretrained model on images

In this part, we are going to use a pre-trained model and test it on the car detection dataset. As usual, we start by creating a session to start your graph. Run the following cell.

sess = K.get_session()

### Defining classes, anchors and image shape.

Recall that we are trying to detect 80 classes, and are using 5 anchor boxes. We have gathered the information about the 80 classes and 5 boxes in two files “coco_classes.txt” and “yolo_anchors.txt”. Let’s load these quantities into the model by running the next cell.

The car detection dataset has 720×1280 images, which we’ve pre-processed into 608×608 images.

image_shape = (720., 1280.)

Training a YOLO model takes a very long time and requires a fairly large dataset of labelled bounding boxes for a large range of target classes. We are going to load an existing pretrained Keras YOLO model stored in “yolo.h5”. (These weights come from the official YOLO website, and were converted using a function written by Allan Zelener.  Technically, these are the parameters from the “YOLOv2” model, but we will more simply refer to it as “YOLO” in this notebook.)

This loads the weights of a trained YOLO model. Here’s a summary of the layers our model contains.

yolo_model.summary()

____________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
===========================================================================
input_1 (InputLayer) (None, 608, 608, 3) 0
____________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
____________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 608, 608, 32) 128 conv2d_1[0][0]
____________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
____________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
____________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
____________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 304, 304, 64) 256 conv2d_2[0][0]
____________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
____________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
____________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 152, 152, 128 73728 max_pooling2d_2[0][0]
____________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 152, 152, 128 512 conv2d_3[0][0]
____________________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_3[0][0]
____________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
____________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 152, 152, 64) 256 conv2d_4[0][0]
____________________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
____________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 152, 152, 128 73728 leaky_re_lu_4[0][0]
____________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 152, 152, 128 512 conv2d_5[0][0]
____________________________________________________________________________________________
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_5[0][0]
____________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
____________________________________________________________________________________________
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
____________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 76, 76, 256) 1024 conv2d_6[0][0]
____________________________________________________________________________________________
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
____________________________________________________________________________________________
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
____________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 76, 76, 128) 512 conv2d_7[0][0]
____________________________________________________________________________________________
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
____________________________________________________________________________________________
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
____________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 76, 76, 256) 1024 conv2d_8[0][0]
____________________________________________________________________________________________
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
____________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
____________________________________________________________________________________________
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
____________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 38, 38, 512) 2048 conv2d_9[0][0]
____________________________________________________________________________________________
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
____________________________________________________________________________________________
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
____________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 38, 38, 256) 1024 conv2d_10[0][0]
____________________________________________________________________________________________
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
____________________________________________________________________________________________
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
____________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 38, 38, 512) 2048 conv2d_11[0][0]
____________________________________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
____________________________________________________________________________________________
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
____________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 38, 38, 256) 1024 conv2d_12[0][0]
____________________________________________________________________________________________
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
____________________________________________________________________________________________
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
____________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 38, 38, 512) 2048 conv2d_13[0][0]
____________________________________________________________________________________________
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
____________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
____________________________________________________________________________________________
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
____________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 19, 19, 1024) 4096 conv2d_14[0][0]
____________________________________________________________________________________________
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
____________________________________________________________________________________________
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
____________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 19, 19, 512) 2048 conv2d_15[0][0]
____________________________________________________________________________________________
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
____________________________________________________________________________________________
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
____________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 19, 19, 1024) 4096 conv2d_16[0][0]
____________________________________________________________________________________________
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
____________________________________________________________________________________________
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
____________________________________________________________________________________________
batch_normalization_17 (BatchNo (None, 19, 19, 512) 2048 conv2d_17[0][0]
____________________________________________________________________________________________
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
____________________________________________________________________________________________
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
____________________________________________________________________________________________
batch_normalization_18 (BatchNo (None, 19, 19, 1024) 4096 conv2d_18[0][0]
____________________________________________________________________________________________
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
____________________________________________________________________________________________
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
____________________________________________________________________________________________
batch_normalization_19 (BatchNo (None, 19, 19, 1024) 4096 conv2d_19[0][0]
____________________________________________________________________________________________
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
____________________________________________________________________________________________
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
____________________________________________________________________________________________
batch_normalization_21 (BatchNo (None, 38, 38, 64) 256 conv2d_21[0][0]
____________________________________________________________________________________________
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
____________________________________________________________________________________________
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
____________________________________________________________________________________________
batch_normalization_20 (BatchNo (None, 19, 19, 1024) 4096 conv2d_20[0][0]
____________________________________________________________________________________________
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
____________________________________________________________________________________________
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
____________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
leaky_re_lu_20[0][0]
____________________________________________________________________________________________
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
____________________________________________________________________________________________
batch_normalization_22 (BatchNo (None, 19, 19, 1024) 4096 conv2d_22[0][0]
____________________________________________________________________________________________
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
____________________________________________________________________________________________
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
===========================================================================
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672
____________________________________________________________________________________________

Reminder: this model converts a pre-processed batch of input images (shape: (m, 608, 608, 3)) into a tensor of shape (m, 19, 19, 5, 85) as explained in the above Figure.

### Convert output of the model to usable bounding box tensors

The output of yolo_model is a (m, 19, 19, 5, 85) tensor that needs to pass through non-trivial processing and conversion. The following code does this.

We added yolo_outputs to your graph. This set of 4 tensors is ready to be used as input by our yolo_eval function.

### Filtering boxes

yolo_outputs gave us all the predicted boxes of yolo_model in the correct format. We’re now ready to perform filtering and select only the best boxes. Lets now call yolo_eval, which you had previously implemented, to do this.

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

### Run the graph on an image

Let the fun begin. We have created a (sess) graph that can be summarized as follows:

1. yolo_model.input is given to yolo_model. The model is used to compute the output yolo_model.output
2. yolo_model.output is processed by yolo_head. It gives us yolo_outputs
3. yolo_outputs goes through a filtering function, yolo_eval. It outputs your predictions: scores, boxes, classes

Exercise: Implement predict() which runs the graph to test YOLO on an image. We shall need to run a TensorFlow session, to have it compute scores, boxes, classes.

The code below also uses the following function:

image, image_data = preprocess_image(“images/” + image_file, model_image_size = (608, 608))

which outputs:

• image: a python (PIL) representation of your image used for drawing boxes. You won’t need to use it.
• image_data: a numpy-array representing the image. This will be the input to the CNN.

Important note: when a model uses BatchNorm (as is the case in YOLO), we will need to pass an additional placeholder in the feed_dict {K.learning_phase(): 0}.


def predict(sess, image_file):
"""
Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.

Arguments:
sess -- your tensorflow/Keras session containing the YOLO graph
image_file -- name of an image stored in the "images" folder.

Returns:
out_scores -- tensor of shape (None, ), scores of the predicted boxes
out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
out_classes -- tensor of shape (None, ), class index of the predicted boxes

Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes.
"""

# Run the session with the correct tensors and choose the correct placeholders in the
# feed_dict. We'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})

# Print predictions info
print('Found {} boxes for {}'.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image
image.save(os.path.join("out", image_file), quality=90)
# Display the results
imshow(output_image)

return out_scores, out_boxes, out_classes



Let’s Run the following cell on the following “test.jpg” image to verify that our function is correct.

Input

out_scores, out_boxes, out_classes = predict(sess, “test.jpg”)

The following figure shows the output after car detection. Each of the bounding boxes have the name of the object detected on the top left along with the confidence value.

Output (with detected cars with YOLO)

Found 7 boxes for test.jpg
car 0.60 (925, 285) (1045, 374)
car 0.66 (706, 279) (786, 350)
bus 0.67 (5, 266) (220, 407)
car 0.70 (947, 324) (1280, 705)
car 0.74 (159, 303) (346, 440)
car 0.80 (761, 282) (942, 412)
car 0.89 (367, 300) (745, 648)

The following animation shows the output Images with detected objects (cars) using YOLO for a set of input images.

What we should remember:

• YOLO is a state-of-the-art object detection model that is fast and accurate.
• It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
• The encoding can be seen as a grid where each of the 19×19 cells contains information about 5 boxes.
• You filter through all the boxes using non-max suppression. Specifically:
Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes.
• Intersection over Union (IoU) thresholding to eliminate overlapping boxes.
• Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise.

References: The ideas presented in this notebook came primarily from the two YOLO papers. The implementation here also took significant inspiration and used many components from Allan Zelener’s github repository. The pretrained weights used in this exercise came from the official YOLO website.

1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi – You Only Look Once: Unified, Real-Time Object Detection (2015)
2. Joseph Redmon, Ali Farhadi – YOLO9000: Better, Faster, Stronger (2016)
3. Allan Zelener – YAD2K: Yet Another Darknet 2 Keras
4. The official YOLO website .

Car detection dataset: Creative Commons License.

# Hand-Gesture Classification using Deep Convolution and Residual Neural Network (ResNet-50) with Tensorflow / Keras in Python

In this article, first an application of convolution net to classify a set of hand-sign images is going to be discussed.  Later the accuracy of this classifier will be improved using a deep res-net. These problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deep-learning specialization) by the Stanford Prof. Andrew Ng. (deeplearning.ai). The problem descriptions are taken straightaway from the course itself.

## 1. Hand-gesture Classification with Convolution Neural Network

In this assignment, the following tasks are going to be accomplished:

• Implement a fully functioning ConvNet using TensorFlow.
• Build and train a ConvNet in TensorFlow for a classification problem

This assignment is going to be done using tensorflow.

First the necessary packages are loaded:

import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import scipy
from PIL import Image
from scipy import ndimage
import tensorflow as tf
from tensorflow.python.framework import ops
from cnn_utils import *

%matplotlib inline
np.random.seed(1)

Next the “SIGNS” dataset is loaded that we are going to use. The SIGNS dataset is a collection of 6 signs representing numbers from 0 to 5, as shown in the next figure. The output classes are shown with one hot encoding.

# Loading the data (signs)
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

The next figures show a few randomly sampled images for each class label from the training dataset. There are 180 images for each class and a total of 108 images in the training dataset.

number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)


The following steps are to be executed to train a conv-net model with tensorflow using the trainign dataset and then classify the images from the test dataset using the model.

### Create placeholders

TensorFlow requires that we create placeholders for the input data that will be fed into the model when running the session.

Let’s implement the function below to create placeholders for the input image X and the output Y. We should not define the number of training examples for the moment. To do so, we could use “None” as the batch size, it will give us the flexibility to choose it later. Hence X should be of dimension [None, n_H0, n_W0, n_C0] and Y should be of dimension  [None, n_y].

def create_placeholders(n_H0, n_W0, n_C0, n_y):
"""
Creates the placeholders for the tensorflow session.

Arguments:
n_H0 -- scalar, height of an input image
n_W0 -- scalar, width of an input image
n_C0 -- scalar, number of channels of the input
n_y -- scalar, number of classes

Returns:
X -- placeholder for the data input, of shape [None, n_H0, n_W0, n_C0] and dtype "float"
Y -- placeholder for the input labels, of shape [None, n_y] and dtype "float"
"""

X = tf.placeholder(tf.float32, shape=(None, n_H0, n_W0, n_C0))
Y = tf.placeholder(tf.float32, shape=(None, n_y))

return X, Y

### Initialize parameters

Let’s initialize weights/filters W1 and Wusing xavier_initializer.
We don’t need to worry about bias variables as you will soon see that TensorFlow functions take care of the bias. Note also that you will only initialize the weights/filters for the conv2d functions. TensorFlow initializes the layers for the fully connected part automatically.

def initialize_parameters():
"""
Initializes weight parameters to build a neural network with tensorflow. The shapes are:
W1 : [4, 4, 3, 8]
W2 : [2, 2, 8, 16]
Returns:
parameters -- a dictionary of tensors containing W1, W2
"""

tf.set_random_seed(1)                              # so that our "random" numbers match ours

W1 = tf.get_variable("W1", (4, 4, 3, 8), initializer = tf.contrib.layers.xavier_initializer(seed = 0))
W2 = tf.get_variable("W2", (2, 2, 8, 16), initializer = tf.contrib.layers.xavier_initializer(seed = 0))

parameters = {"W1": W1,
"W2": W2}

return parameters


### Forward propagation

Next we need to implement the forward_propagation function below to build the following model:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED.

We need to use the following built-in tensorflow functions:

• tf.nn.conv2d(X,W1, strides = [1,s,s,1], padding = ‘SAME’): given an input XX and a group of filters W1W1, this function convolves W1W1‘s filters on X. The third input ([1,f,f,1]) represents the strides for each dimension of the input (m, n_H_prev, n_W_prev, n_C_prev). You can read the full documentation here
• tf.nn.max_pool(A, ksize = [1,f,f,1], strides = [1,s,s,1], padding = ‘SAME’): given an input A, this function uses a window of size (f, f) and strides of size (s, s) to carry out max pooling over each window. You can read the full documentation here
• tf.nn.relu(Z1): computes the elementwise ReLU of Z1 (which can be any shape). You can read the full documentation here.
• tf.contrib.layers.flatten(P): given an input P, this function flattens each example into a 1D vector it while maintaining the batch-size. It returns a flattened tensor with shape [batch_size, k]. You can read the full documentation here.
• tf.contrib.layers.fully_connected(F, num_outputs): given a the flattened input F, it returns the output computed using a fully connected layer. You can read the full documentation here.

In detail, we will use the following parameters for all the steps:

 - Conv2D: stride 1, padding is "SAME"
- ReLU
- Max pool: Use an 8 by 8 filter size and an 8 by 8 stride, padding is "SAME"
- Conv2D: stride 1, padding is "SAME"
- ReLU
- Max pool: Use a 4 by 4 filter size and a 4 by 4 stride, padding is "SAME"
- Flatten the previous output.
- FULLYCONNECTED (FC) layer: Apply a fully connected layer without an non-linear activation function. Do not call the softmax here. This will result in 6 neurons in the output layer, which then get passed later to a softmax. In TensorFlow, the softmax and cost function are lumped together into a single function, which you'll call in a different function when computing the cost. 
def forward_propagation(X, parameters):
"""
Implements the forward propagation for the model:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED

Arguments:
X -- input dataset placeholder, of shape (input size, number of examples)
parameters -- python dictionary containing your parameters "W1", "W2"
the shapes are given in initialize_parameters

Returns:
Z3 -- the output of the last LINEAR unit
"""


### Compute cost

Next step is to implement the compute cost function using the following tensorflow functions:

• tf.nn.softmax_cross_entropy_with_logits(logits = Z3, labels = Y): computes the softmax entropy loss. This function both computes the softmax activation function as well as the resulting loss. You can check the full documentation here.
• tf.reduce_mean: computes the mean of elements across dimensions of a tensor. Use this to sum the losses over all the examples to get the overall cost. You can check the full documentation here.
def compute_cost(Z3, Y):
"""
Computes the cost

Arguments:
Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
Y -- "true" labels vector placeholder, same shape as Z3

Returns:
cost - Tensor of the cost function
"""

## Model

Finally we need to merge the helper functions we implemented above to build a model and train it on the SIGNS dataset.

The model should:

• create placeholders
• initialize parameters
• forward propagate
• compute the cost
• create an optimizer

Finally we need to create a session and run a for loop for num_epochs, get the mini-batches, and then for each mini-batch you will optimize the function.

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.009,
num_epochs = 100, minibatch_size = 64, print_cost = True):
"""
Implements a three-layer ConvNet in Tensorflow:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED

Arguments:
X_train -- training set, of shape (None, 64, 64, 3)
Y_train -- test set, of shape (None, n_y = 6)
X_test -- training set, of shape (None, 64, 64, 3)
Y_test -- test set, of shape (None, n_y = 6)
learning_rate -- learning rate of the optimization
num_epochs -- number of epochs of the optimization loop
minibatch_size -- size of a minibatch
print_cost -- True to print the cost every 100 epochs

Returns:
train_accuracy -- real number, accuracy on the train set (X_train)
test_accuracy -- real number, testing accuracy on the test set (X_test)
parameters -- parameters learnt by the model. They can then be used to predict.
"""

Then let’s train the model for 100 epochs.

_, _, parameters = model(X_train, Y_train, X_test, Y_test)

with the following output:
Cost after epoch 0: 1.918487
Cost after epoch 5: 1.875008
Cost after epoch 10: 1.813409
Cost after epoch 15: 1.667654
Cost after epoch 20: 1.444399
Cost after epoch 25: 1.203926
Cost after epoch 30: 1.028009
Cost after epoch 35: 0.887578
Cost after epoch 40: 0.791803
Cost after epoch 45: 0.712319
Cost after epoch 50: 0.655244
Cost after epoch 55: 0.597494
Cost after epoch 60: 0.556236
Cost after epoch 65: 0.525260
Cost after epoch 70: 0.484548
Cost after epoch 75: 0.477365
Cost after epoch 80: 0.451908
Cost after epoch 85: 0.415393
Cost after epoch 90: 0.386501
Cost after epoch 95: 0.373167

Tensor(“Mean_1:0”, shape=(), dtype=float32)
Train Accuracy: 0.894444
Test Accuracy: 0.841667

## 2. Improving the Accuracy of the Hand-Gesture Classifier with Residual Networks

Now we shall learn how to build very deep convolutional networks, using Residual Networks (ResNets). In theory, very deep networks can represent very complex functions; but in practice, they are hard to train. Residual Networks, introduced by He et al., allow to train much deeper networks than were previously practically feasible.

In this assignment, the following tasks we are going to accomplish:

• Implement the basic building blocks of ResNets.
• Put together these building blocks to implement and train a state-of-the-art neural network for image classification.

This assignment will be done in Keras.

Let’s first load the following required packages.

import numpy as np
from keras import layers
from keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot_ng as pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from resnets_utils import *
from keras.initializers import glorot_uniform
import scipy.misc
from matplotlib.pyplot import imshow
%matplotlib inline
import keras.backend as K
K.set_image_data_format('channels_last')
K.set_learning_phase(1)

## The problem of very deep neural networks

In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow.

During training, we might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:

We are now going to solve this problem by building a Residual Network!

## Building a Residual Network

In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly back-propagated to earlier layers:

The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, we can form a very deep network.

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different. We are going to implement both of them.

### 1 – The identity block

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l]) has the same dimension as the output activation (say a[l+2]). To flesh out the different steps of what happens in a ResNet’s identity block, here is an alternative diagram showing the individual steps:

The upper path is the “shortcut path.” The lower path is the “main path.” In this diagram, we have also made explicit the CONV2D and ReLU steps in each layer. To speed up training we have also added a BatchNorm step.

In this exercise, we’ll actually implement a slightly more powerful version of this identity block, in which the skip connection “skips over” 3 hidden layers rather than 2 layers. It looks like this:

Here’re the individual steps.

First component of main path:

• The first CONV2D has F1 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2a'. Use 0 as the seed for the random initialization.
• The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
• Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

• The second CONV2D has F2 filters of shape (f,fand a stride of (1,1). Its padding is “same” and its name should be conv_name_base + '2b'. Use 0 as the seed for the random initialization.
• The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
• Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

• The third CONV2D has F3 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2c'. Use 0 as the seed for the random initialization.
• The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Final step:

• The shortcut and the input are added together.
• Then apply the ReLU activation function. This has no name and no hyperparameters.

Now let’s implement the ResNet identity block.

• To implement the Conv2D step: See reference
• To implement BatchNorm: See reference (axis: Integer, the axis that should be normalized (typically the channels axis))
• For the activation, use: Activation('relu')(X)
• To add the value passed forward by the shortcut: See reference
defidentity_block(X, f, filters, stage, block):
"""
Implementation of the identity block as defined in Figure 3

Arguments:
X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
f -- integer, specifying the shape of the middle CONV's window for the main path
filters -- python list of integers, defining the number of filters in the CONV layers of the main path
stage -- integer, used to name the layers, depending on their position in the network
block -- string/character, used to name the layers, depending on their position in the network

Returns:
X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
"""
### The first Component ###
# defining name basis
conv_name_base = 'res' + str(stage) + block + '_branch'
bn_name_base = 'bn' + str(stage) + block + '_branch'

# Retrieve Filters
F1, F2, F3 = filters

# Save the input value. You'll need this later to add back to the main path.
X_shortcut = X

# First component of main path
X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
X = Activation('relu')(X)

### The second Component ###

# ...

### The third Component ###

# ...

return X

## 2 – The convolutional block

Next, the ResNet “convolutional block” is the other type of block. We can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:

The CONV2D layer in the shortcut path is used to resize the input x to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. For example, to reduce the activation dimensions’s height and width by a factor of 2, we can use a 1×1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.

The details of the convolutional block are as follows.

First component of main path:

• The first CONV2D has F1 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '2a'.
• The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
• Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

• The second CONV2D has F2 filters of (f,f) and a stride of (1,1). Its padding is “same” and it’s name should be conv_name_base + '2b'.
• The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
• Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

• The third CONV2D has F3 filters of (1,1) and a stride of (1,1). Its padding is “valid” and it’s name should be conv_name_base + '2c'.
• The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Shortcut path:

• The CONV2D has F3 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '1'.
• The BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '1'.

Final step:

• The shortcut and the main path values are added together.
• Then apply the ReLU activation function. This has no name and no hyperparameters.

Let’s now implement the convolutional block.

• Conv Hint
• BatchNorm Hint (axis: Integer, the axis that should be normalized (typically the features axis))
• For the activation, use: Activation('relu')(X)
defconvolutional_block(X, f, filters, stage, block, s = 2):
"""
Implementation of the convolutional block as defined in Figure 4

Arguments:
X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
f -- integer, specifying the shape of the middle CONV's window for the main path
filters -- python list of integers, defining the number of filters in the CONV layers of the main path
stage -- integer, used to name the layers, depending on their position in the network
block -- string/character, used to name the layers, depending on their position in the network
s -- Integer, specifying the stride to be used

Returns:
X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
"""

# defining name basis
conv_name_base = 'res' + str(stage) + block + '_branch'
bn_name_base = 'bn' + str(stage) + block + '_branch'

# Retrieve Filters
F1, F2, F3 = filters

# Save the input value
X_shortcut = X

##### MAIN PATH #####
# First component of main path
X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
X = Activation('relu')(X)

# Second component of main path
# ...
# Third component of main path
# ...
##### SHORTCUT PATH ####
# ...
# Final step: Add shortcut value to main path, and pass it through a RELU activation
# ...
return X

## 3 – Building our first ResNet model (50 layers)

We now have the necessary blocks to build a very deep ResNet. The following figure describes in detail the architecture of this neural network. “ID BLOCK” in the diagram stands for “Identity block,” and “ID BLOCK x3” means we should stack 3 identity blocks together.

The details of this ResNet-50 model are:

• Stage 1:
• The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”.
• BatchNorm is applied to the channels axis of the input.
• MaxPooling uses a (3,3) window and a (2,2) stride.
• Stage 2:
• The convolutional block uses three set of filters of size [64,64,256], “f” is 3, “s” is 1 and the block is “a”.
• The 2 identity blocks use three set of filters of size [64,64,256], “f” is 3 and the blocks are “b” and “c”.
• Stage 3:
• The convolutional block uses three set of filters of size [128,128,512], “f” is 3, “s” is 2 and the block is “a”.
• The 3 identity blocks use three set of filters of size [128,128,512], “f” is 3 and the blocks are “b”, “c” and “d”.
• Stage 4:
• The convolutional block uses three set of filters of size [256, 256, 1024], “f” is 3, “s” is 2 and the block is “a”.
• The 5 identity blocks use three set of filters of size [256, 256, 1024], “f” is 3 and the blocks are “b”, “c”, “d”, “e” and “f”.
• Stage 5:
• The convolutional block uses three set of filters of size [512, 512, 2048], “f” is 3, “s” is 2 and the block is “a”.
• The 2 identity blocks use three set of filters of size [512, 512, 2048], “f” is 3 and the blocks are “b” and “c”.
• The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
• The flatten doesn’t have any hyperparameters or name.
• The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

Let’s implement the ResNet with 50 layers described in the figure above.

We’ll need to use this function:

Here’re some other functions we used in the code below:

def ResNet50(input_shape = (64, 64, 3), classes = 6):
"""
Implementation of the popular ResNet50 the following architecture:
CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
-> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> TOPLAYER

Arguments:
input_shape -- shape of the images of the dataset
classes -- integer, number of classes

Returns:
model -- a Model() instance in Keras
"""

# Define the input as a tensor with shape input_shape
X_input = Input(input_shape)

# Stage 1
X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = glorot_uniform(seed=0))(X)
X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
X = Activation('relu')(X)
X = MaxPooling2D((3, 3), strides=(2, 2))(X)

# Stage 2
X = convolutional_block(X, f = 3, filters = [64, 64, 256], stage = 2, block='a', s = 1)
X = identity_block(X, 3, [64, 64, 256], stage=2, block='b')
X = identity_block(X, 3, [64, 64, 256], stage=2, block='c')

# ...

# ...

# output layer
X = Flatten()(X)
X = Dense(classes, activation='softmax', name='fc' + str(classes), kernel_initializer = glorot_uniform(seed=0))(X)

# Create model
model = Model(inputs = X_input, outputs = X, name='ResNet50')

return model

Next, let’s build the model’s graph. We have 6 output classes for the hand-signs dataset.

model = ResNet50(input_shape = (64, 64, 3), classes = 6)


We need to configure the learning process by compiling the model.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])


The model is now ready to be trained. The only thing we need is to pass the same hand-signs dataset that we used earlier. We need to load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Convert training and test labels to one hot matrices
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T

print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))
number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)


Now let’s train our  resnet model on 20 epochs with a batch size of 32.

model.fit(X_train, Y_train, epochs = 20, batch_size = 32)
Epoch 1/20
1080/1080 [==============================] - 173s - loss: 2.0610 - acc: 0.3435
Epoch 2/20
1080/1080 [==============================] - 149s - loss: 1.8561 - acc: 0.4259
Epoch 3/20
1080/1080 [==============================] - 147s - loss: 2.0284 - acc: 0.4343
Epoch 4/20
1080/1080 [==============================] - 151s - loss: 1.7140 - acc: 0.4500
Epoch 5/20
1080/1080 [==============================] - 134s - loss: 1.4401 - acc: 0.5676
Epoch 6/20
1080/1080 [==============================] - 128s - loss: 1.1950 - acc: 0.6481
Epoch 7/20
1080/1080 [==============================] - 129s - loss: 0.9886 - acc: 0.7426
Epoch 8/20
1080/1080 [==============================] - 133s - loss: 1.2155 - acc: 0.6843
Epoch 9/20
1080/1080 [==============================] - 131s - loss: 0.8536 - acc: 0.8185
Epoch 10/20
1080/1080 [==============================] - 132s - loss: 0.9502 - acc: 0.7565
Epoch 11/20
1080/1080 [==============================] - 129s - loss: 0.8180 - acc: 0.8111
Epoch 12/20
1080/1080 [==============================] - 130s - loss: 0.7060 - acc: 0.8343
Epoch 13/20
1080/1080 [==============================] - 130s - loss: 0.8687 - acc: 0.8148
Epoch 14/20
1080/1080 [==============================] - 130s - loss: 0.8282 - acc: 0.8509
Epoch 15/20
1080/1080 [==============================] - 130s - loss: 0.9303 - acc: 0.7972
Epoch 16/20
1080/1080 [==============================] - 146s - loss: 1.1211 - acc: 0.7870
Epoch 17/20
1080/1080 [==============================] - 143s - loss: 0.9337 - acc: 0.7824
Epoch 18/20
1080/1080 [==============================] - 150s - loss: 0.3976 - acc: 0.8870
Epoch 19/20
1080/1080 [==============================] - 143s - loss: 0.2532 - acc: 0.9407
Epoch 20/20
1080/1080 [==============================] - 133s - loss: 0.2528 - acc: 0.9556

Let’s see how this model performs on the test set.

preds = model.evaluate(X_test, Y_test)
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))
Loss = 0.36906948487
Test Accuracy = 0.891666662693

We can also print a summary of your model by running the following code.

model.summary()

____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to
====================================================================================================
input_1 (InputLayer)             (None, 64, 64, 3)     0
____________________________________________________________________________________________________
____________________________________________________________________________________________________
conv1 (Conv2D)                   (None, 32, 32, 64)    9472
____________________________________________________________________________________________________
bn_conv1 (BatchNormalization)    (None, 32, 32, 64)    256
____________________________________________________________________________________________________
activation_4 (Activation)        (None, 32, 32, 64)    0
____________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)   (None, 15, 15, 64)    0
____________________________________________________________________________________________________
res2a_branch2a (Conv2D)          (None, 15, 15, 64)    4160
____________________________________________________________________________________________________
bn2a_branch2a (BatchNormalizatio (None, 15, 15, 64)    256
____________________________________________________________________________________________________
activation_5 (Activation)        (None, 15, 15, 64)    0
____________________________________________________________________________________________________
res2a_branch2b (Conv2D)          (None, 15, 15, 64)    36928
____________________________________________________________________________________________________
bn2a_branch2b (BatchNormalizatio (None, 15, 15, 64)    256
____________________________________________________________________________________________________
activation_6 (Activation)        (None, 15, 15, 64)    0
____________________________________________________________________________________________________
res2a_branch1 (Conv2D)           (None, 15, 15, 256)   16640
____________________________________________________________________________________________________
res2a_branch2c (Conv2D)          (None, 15, 15, 256)   16640
____________________________________________________________________________________________________
bn2a_branch1 (BatchNormalization (None, 15, 15, 256)   1024
____________________________________________________________________________________________________
bn2a_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_7 (Activation)        (None, 15, 15, 256)   0
____________________________________________________________________________________________________
res2b_branch2a (Conv2D)          (None, 15, 15, 64)    16448
____________________________________________________________________________________________________
bn2b_branch2a (BatchNormalizatio (None, 15, 15, 64)    256
____________________________________________________________________________________________________
activation_8 (Activation)        (None, 15, 15, 64)    0
____________________________________________________________________________________________________
res2b_branch2b (Conv2D)          (None, 15, 15, 64)    36928
____________________________________________________________________________________________________
bn2b_branch2b (BatchNormalizatio (None, 15, 15, 64)    256
____________________________________________________________________________________________________
activation_9 (Activation)        (None, 15, 15, 64)    0
____________________________________________________________________________________________________
res2b_branch2c (Conv2D)          (None, 15, 15, 256)   16640
____________________________________________________________________________________________________
bn2b_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_10 (Activation)       (None, 15, 15, 256)   0
____________________________________________________________________________________________________
res2c_branch2a (Conv2D)          (None, 15, 15, 64)    16448
____________________________________________________________________________________________________
bn2c_branch2a (BatchNormalizatio (None, 15, 15, 64)    256
____________________________________________________________________________________________________
activation_11 (Activation)       (None, 15, 15, 64)    0
____________________________________________________________________________________________________
res2c_branch2b (Conv2D)          (None, 15, 15, 64)    36928
____________________________________________________________________________________________________
bn2c_branch2b (BatchNormalizatio (None, 15, 15, 64)    256
____________________________________________________________________________________________________
activation_12 (Activation)       (None, 15, 15, 64)    0
____________________________________________________________________________________________________
res2c_branch2c (Conv2D)          (None, 15, 15, 256)   16640
____________________________________________________________________________________________________
bn2c_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_13 (Activation)       (None, 15, 15, 256)   0
____________________________________________________________________________________________________
res3a_branch2a (Conv2D)          (None, 8, 8, 128)     32896
____________________________________________________________________________________________________
bn3a_branch2a (BatchNormalizatio (None, 8, 8, 128)     512
____________________________________________________________________________________________________
activation_14 (Activation)       (None, 8, 8, 128)     0
____________________________________________________________________________________________________
res3a_branch2b (Conv2D)          (None, 8, 8, 128)     147584
____________________________________________________________________________________________________
bn3a_branch2b (BatchNormalizatio (None, 8, 8, 128)     512
____________________________________________________________________________________________________
activation_15 (Activation)       (None, 8, 8, 128)     0
____________________________________________________________________________________________________
res3a_branch1 (Conv2D)           (None, 8, 8, 512)     131584
____________________________________________________________________________________________________
res3a_branch2c (Conv2D)          (None, 8, 8, 512)     66048
____________________________________________________________________________________________________
bn3a_branch1 (BatchNormalization (None, 8, 8, 512)     2048
____________________________________________________________________________________________________
bn3a_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_16 (Activation)       (None, 8, 8, 512)     0
____________________________________________________________________________________________________
res3b_branch2a (Conv2D)          (None, 8, 8, 128)     65664
____________________________________________________________________________________________________
bn3b_branch2a (BatchNormalizatio (None, 8, 8, 128)     512
____________________________________________________________________________________________________
activation_17 (Activation)       (None, 8, 8, 128)     0
____________________________________________________________________________________________________
res3b_branch2b (Conv2D)          (None, 8, 8, 128)     147584
____________________________________________________________________________________________________
bn3b_branch2b (BatchNormalizatio (None, 8, 8, 128)     512
____________________________________________________________________________________________________
activation_18 (Activation)       (None, 8, 8, 128)     0
____________________________________________________________________________________________________
res3b_branch2c (Conv2D)          (None, 8, 8, 512)     66048
____________________________________________________________________________________________________
bn3b_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_19 (Activation)       (None, 8, 8, 512)     0
____________________________________________________________________________________________________
res3c_branch2a (Conv2D)          (None, 8, 8, 128)     65664
____________________________________________________________________________________________________
bn3c_branch2a (BatchNormalizatio (None, 8, 8, 128)     512
____________________________________________________________________________________________________
activation_20 (Activation)       (None, 8, 8, 128)     0
____________________________________________________________________________________________________
res3c_branch2b (Conv2D)          (None, 8, 8, 128)     147584
____________________________________________________________________________________________________
bn3c_branch2b (BatchNormalizatio (None, 8, 8, 128)     512
____________________________________________________________________________________________________
activation_21 (Activation)       (None, 8, 8, 128)     0
____________________________________________________________________________________________________
res3c_branch2c (Conv2D)          (None, 8, 8, 512)     66048
____________________________________________________________________________________________________
bn3c_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_22 (Activation)       (None, 8, 8, 512)     0
____________________________________________________________________________________________________
res3d_branch2a (Conv2D)          (None, 8, 8, 128)     65664
____________________________________________________________________________________________________
bn3d_branch2a (BatchNormalizatio (None, 8, 8, 128)     512
____________________________________________________________________________________________________
activation_23 (Activation)       (None, 8, 8, 128)     0
____________________________________________________________________________________________________
res3d_branch2b (Conv2D)          (None, 8, 8, 128)     147584
____________________________________________________________________________________________________
bn3d_branch2b (BatchNormalizatio (None, 8, 8, 128)     512
____________________________________________________________________________________________________
activation_24 (Activation)       (None, 8, 8, 128)     0
____________________________________________________________________________________________________
res3d_branch2c (Conv2D)          (None, 8, 8, 512)     66048
____________________________________________________________________________________________________
bn3d_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_25 (Activation)       (None, 8, 8, 512)     0
____________________________________________________________________________________________________
res4a_branch2a (Conv2D)          (None, 4, 4, 256)     131328
____________________________________________________________________________________________________
bn4a_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_26 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4a_branch2b (Conv2D)          (None, 4, 4, 256)     590080
____________________________________________________________________________________________________
bn4a_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_27 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4a_branch1 (Conv2D)           (None, 4, 4, 1024)    525312
____________________________________________________________________________________________________
res4a_branch2c (Conv2D)          (None, 4, 4, 1024)    263168
____________________________________________________________________________________________________
bn4a_branch1 (BatchNormalization (None, 4, 4, 1024)    4096
____________________________________________________________________________________________________
bn4a_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_28 (Activation)       (None, 4, 4, 1024)    0
____________________________________________________________________________________________________
res4b_branch2a (Conv2D)          (None, 4, 4, 256)     262400
____________________________________________________________________________________________________
bn4b_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_29 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4b_branch2b (Conv2D)          (None, 4, 4, 256)     590080
____________________________________________________________________________________________________
bn4b_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_30 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4b_branch2c (Conv2D)          (None, 4, 4, 1024)    263168
____________________________________________________________________________________________________
bn4b_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_31 (Activation)       (None, 4, 4, 1024)    0
____________________________________________________________________________________________________
res4c_branch2a (Conv2D)          (None, 4, 4, 256)     262400
____________________________________________________________________________________________________
bn4c_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_32 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4c_branch2b (Conv2D)          (None, 4, 4, 256)     590080
____________________________________________________________________________________________________
bn4c_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_33 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4c_branch2c (Conv2D)          (None, 4, 4, 1024)    263168
____________________________________________________________________________________________________
bn4c_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_34 (Activation)       (None, 4, 4, 1024)    0
____________________________________________________________________________________________________
res4d_branch2a (Conv2D)          (None, 4, 4, 256)     262400
____________________________________________________________________________________________________
bn4d_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_35 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4d_branch2b (Conv2D)          (None, 4, 4, 256)     590080
____________________________________________________________________________________________________
bn4d_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_36 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4d_branch2c (Conv2D)          (None, 4, 4, 1024)    263168
____________________________________________________________________________________________________
bn4d_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_37 (Activation)       (None, 4, 4, 1024)    0
____________________________________________________________________________________________________
res4e_branch2a (Conv2D)          (None, 4, 4, 256)     262400
____________________________________________________________________________________________________
bn4e_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_38 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4e_branch2b (Conv2D)          (None, 4, 4, 256)     590080
____________________________________________________________________________________________________
bn4e_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_39 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4e_branch2c (Conv2D)          (None, 4, 4, 1024)    263168
____________________________________________________________________________________________________
bn4e_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_40 (Activation)       (None, 4, 4, 1024)    0
____________________________________________________________________________________________________
res4f_branch2a (Conv2D)          (None, 4, 4, 256)     262400
____________________________________________________________________________________________________
bn4f_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_41 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4f_branch2b (Conv2D)          (None, 4, 4, 256)     590080
____________________________________________________________________________________________________
bn4f_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024
____________________________________________________________________________________________________
activation_42 (Activation)       (None, 4, 4, 256)     0
____________________________________________________________________________________________________
res4f_branch2c (Conv2D)          (None, 4, 4, 1024)    263168
____________________________________________________________________________________________________
bn4f_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_43 (Activation)       (None, 4, 4, 1024)    0
____________________________________________________________________________________________________
res5a_branch2a (Conv2D)          (None, 2, 2, 512)     524800
____________________________________________________________________________________________________
bn5a_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048
____________________________________________________________________________________________________
activation_44 (Activation)       (None, 2, 2, 512)     0
____________________________________________________________________________________________________
res5a_branch2b (Conv2D)          (None, 2, 2, 512)     2359808
____________________________________________________________________________________________________
bn5a_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048
____________________________________________________________________________________________________
activation_45 (Activation)       (None, 2, 2, 512)     0
____________________________________________________________________________________________________
res5a_branch1 (Conv2D)           (None, 2, 2, 2048)    2099200
____________________________________________________________________________________________________
res5a_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624
____________________________________________________________________________________________________
bn5a_branch1 (BatchNormalization (None, 2, 2, 2048)    8192
____________________________________________________________________________________________________
bn5a_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_46 (Activation)       (None, 2, 2, 2048)    0
____________________________________________________________________________________________________
res5b_branch2a (Conv2D)          (None, 2, 2, 512)     1049088
____________________________________________________________________________________________________
bn5b_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048
____________________________________________________________________________________________________
activation_47 (Activation)       (None, 2, 2, 512)     0
____________________________________________________________________________________________________
res5b_branch2b (Conv2D)          (None, 2, 2, 512)     2359808
____________________________________________________________________________________________________
bn5b_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048
____________________________________________________________________________________________________
activation_48 (Activation)       (None, 2, 2, 512)     0
____________________________________________________________________________________________________
res5b_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624
____________________________________________________________________________________________________
bn5b_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_49 (Activation)       (None, 2, 2, 2048)    0
____________________________________________________________________________________________________
res5c_branch2a (Conv2D)          (None, 2, 2, 512)     1049088
____________________________________________________________________________________________________
bn5c_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048
____________________________________________________________________________________________________
activation_50 (Activation)       (None, 2, 2, 512)     0
____________________________________________________________________________________________________
res5c_branch2b (Conv2D)          (None, 2, 2, 512)     2359808
____________________________________________________________________________________________________
bn5c_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048
____________________________________________________________________________________________________
activation_51 (Activation)       (None, 2, 2, 512)     0
____________________________________________________________________________________________________
res5c_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624
____________________________________________________________________________________________________
bn5c_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192
____________________________________________________________________________________________________
____________________________________________________________________________________________________
activation_52 (Activation)       (None, 2, 2, 2048)    0
____________________________________________________________________________________________________
avg_pool (AveragePooling2D)      (None, 1, 1, 2048)    0
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 2048)          0
____________________________________________________________________________________________________
fc6 (Dense)                      (None, 6)             12294
====================================================================================================
Total params: 23,600,006.0
Trainable params: 23,546,886.0
Non-trainable params: 53,120.0
_____________________________

Finally, the next figure shows the visualization of our ResNet50.

### Key points

• Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.
• The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
• There are two main type of blocks: The identity block and the convolutional block.
• Very deep Residual Networks are built by stacking these blocks together.

### References

This article presents the ResNet algorithm due to He et al. (2015). The implementation here also took significant inspiration and follows the structure given in the github repository of Francois Chollet:

# Classifying a Face as Happy/Unhappy and Face Recognition using a Pre-trained Deep Inception Network with Keras in Python

In this article couple of problems are going to be discussed. Both the problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deeplearning specialization) by the Stanford Prof. Andrew Ng. (deeplearning.ai). The problem descriptions are taken from the course itself.

## 1. Classifying a Face Image as Happy/Unhappy

• Given:
• 600 RGB (labeled) training images each of size 64×64, with labels 0 (not happy) and 1 (happy).
• 150 (unlabeled) test images (also the ground-truths separately).
• Train a deep convolution neural net model for binary classification.
• Use the model to predict the labels of the test images and evaluate the model using the ground truth.

Details of the “Happy” dataset:

Images are of shape (64,64,3)
Training: 600 pictures

Test: 150 pictures

It is now time to solve the “Happy” Challenge.

import numpy as np
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
import keras.backend as K
K.set_image_data_format(‘channels_last’)
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

Then let’s normalize and load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

​# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Reshape
Y_train = Y_train_orig.T
Y_test = Y_test_orig.T

print (“number of training examples = ” + str(X_train.shape[0]))
print (“number of test examples = ” + str(X_test.shape[0]))
print (“X_train shape: ” + str(X_train.shape))
print (“Y_train shape: ” + str(Y_train.shape))
print (“X_test shape: ” + str(X_test.shape))
print (“Y_test shape: ” + str(Y_test.shape))

number of training examples = 600
number of test examples = 150
X_train shape: (600, 64, 64, 3)
Y_train shape: (600, 1)
X_test shape: (150, 64, 64, 3)
Y_test shape: (150, 1)

Now let’s find the number of labeled happy and unhappy faces in the training dataset.

print(X_train[Y_train.ravel()==1].shape, X_train[Y_train.ravel()==0].shape)

(300, 64, 64, 3) (300, 64, 64, 3)

As can be seen, there are equal numbers of positive and negative examples in the training dataset. The following figures show a few samples drawn from each class.

### Building a model in Keras

Keras is very good for rapid prototyping. In just a short time we shall be able to build a model that achieves outstanding results.

Let’s Implement a HappyModel() with the following architecture:

def HappyModel(input_shape):
“””
Implementation of the HappyModel.

Arguments:
input_shape — shape of the images of the dataset

Returns:
model — a Model() instance in Keras
“””

# Define the input placeholder as a tensor with shape input_shape. Think of
# this as our input image!
X_input = Input(input_shape)

# CONV -> BN -> RELU Block applied to X
X = Conv2D(32, (7, 7), strides = (1, 1), name = ‘conv0’)(X)
X = BatchNormalization(axis = 3, name = ‘bn0’)(X)
X = Activation(‘relu’)(X)

# MAXPOOL
X = MaxPooling2D((2, 2), name=’max_pool’)(X)

# FLATTEN X (means convert it to a vector) + FULLYCONNECTED
X = Flatten()(X)
X = Dense(1, activation=’sigmoid’, name=’fc’)(X)

# Create model. This creates our Keras model instance, you’ll use this instance
# to train/test the model.
model = Model(inputs = X_input, outputs = X, name=’HappyModel’)

return model

Step 1: Let’s first create the model.

happyModel = HappyModel((64,64,3))

Step 2:  Compile the model to configure the learning process, keeping in view that the Happy Challenge is a binary classification problem.

happyModel.compile(optimizer = “Adam”, loss = “binary_crossentropy”, metrics = [“accuracy”])

Step 3: Train the model. Choose the number of epochs and the batch size.

happyModel.fit(x = X_train, y = Y_train, epochs = 20, batch_size = 32)

Epoch 1/20
600/600 [==============================] – 6s – loss: 1.0961 – acc: 0.6750
Epoch 2/20
600/600 [==============================] – 7s – loss: 0.4198 – acc: 0.8250
Epoch 3/20
600/600 [==============================] – 8s – loss: 0.1933 – acc: 0.9250
Epoch 4/20
600/600 [==============================] – 7s – loss: 0.1165 – acc: 0.9567
Epoch 5/20
600/600 [==============================] – 6s – loss: 0.1224 – acc: 0.9500
Epoch 6/20
600/600 [==============================] – 6s – loss: 0.0970 – acc: 0.9667
Epoch 7/20
600/600 [==============================] – 7s – loss: 0.0639 – acc: 0.9850
Epoch 8/20
600/600 [==============================] – 7s – loss: 0.0841 – acc: 0.9700
Epoch 9/20
600/600 [==============================] – 8s – loss: 0.0934 – acc: 0.9733
Epoch 10/20
600/600 [==============================] – 7s – loss: 0.0677 – acc: 0.9767
Epoch 11/20
600/600 [==============================] – 6s – loss: 0.0705 – acc: 0.9650
Epoch 12/20
600/600 [==============================] – 7s – loss: 0.0548 – acc: 0.9783
Epoch 13/20
600/600 [==============================] – 7s – loss: 0.0533 – acc: 0.9800
Epoch 14/20
600/600 [==============================] – 7s – loss: 0.0517 – acc: 0.9850
Epoch 15/20
600/600 [==============================] – 7s – loss: 0.0665 – acc: 0.9750
Epoch 16/20
600/600 [==============================] – 7s – loss: 0.0273 – acc: 0.9917
Epoch 17/20
600/600 [==============================] – 7s – loss: 0.0291 – acc: 0.9933
Epoch 18/20
600/600 [==============================] – 6s – loss: 0.0245 – acc: 0.9917
Epoch 19/20
600/600 [==============================] – 7s – loss: 0.0376 – acc: 0.9883
Epoch 20/20
600/600 [==============================] – 7s – loss: 0.0440 – acc: 0.9917

Note that if we run fit() again, the model will continue to train with the parameters it has already learnt instead of re-initializing them.

Step 4: Test/evaluate the model.

preds = happyModel.evaluate(x = X_test, y = Y_test)
print()
print (“Loss = ” + str(preds[0]))
print (“Test Accuracy = ” + str(preds[1]))

150/150 [==============================] – 0s

Loss = 0.167731122573
Test Accuracy = 0.94666667064

As can be seen, our model gets around 95% test accuracy in 20 epochs (and 99% train accuracy).

### Test with my own image

Let’s test on my own image to see how well the model generalizes on unseen face images.

img_path = ‘me_happy.png’
imshow(img)

x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
print(happyModel.predict(x))

[[ 1.]]             # Happy !

### Model Summary

happyModel.summary()

## 2.  Face Recognition with Deep Neural Net

Face recognition problems commonly fall into two categories:

1. Face Verification – “is this the claimed person?”. For example, at some airports, one can pass through customs by letting a system scan your passport and then verifying that he (the person carrying the passport) is the correct person. A mobile phone that unlocks using our face is also using face verification. This is a 1:1 matching problem.
2. Face Recognition – “who is this person?”. For example, this video of Baidu employees entering the office without needing to otherwise identify themselves is an example of face recognition. This is a 1:K matching problem.

FaceNet learns a neural network that encodes a face image into a vector of 128 numbers. By comparing two such vectors, we can then determine if two pictures are of the same person.

In this assignment, we shall:

• Implement the triplet loss function
• Use a pretrained model to map face images into 128-dimensional encodings
• Use these encodings to perform face verification and face recognition

In this exercise, we will be using a pre-trained model which represents ConvNet activations using a “channels first” convention, as opposed to the “channels last” convention.

In other words, a batch of images will be of shape (m,n_C,n_H,n_W) instead of (m,n_H,n_W,n_C). Both of these conventions have a reasonable amount of traction among open-source implementations; there isn’t a uniform standard yet within the deep learning community.

### Naive Face Verification

In Face Verification, we’re given two images and we have to tell if they are of the same person. The simplest way to do this is to compare the two images pixel-by-pixel. If the distance between the raw images are less than a chosen threshold, it may be the same person!

Of course, this algorithm performs really poorly, since the pixel values change dramatically due to variations in lighting, orientation of the person’s face, even minor changes in head position, and so on.

We’ll see that rather than using the raw image, we can learn an encoding f(img) so that element-wise comparisons of this encoding gives more accurate judgments as to whether two pictures are of the same person.

### Encoding face images into a 128-dimensional vector

#### Using an ConvNet to compute encodings

The FaceNet model takes a lot of data and a long time to train. So following common practice in applied deep learning settings, let’s just load weights that someone else has already trained. The network architecture follows the Inception model from Szegedy et al.. We are going to use an inception network implementation.

This network uses 96×96 dimensional RGB images as its input. Specifically, inputs a face image (or batch of m face images) as a tensor of shape (m,nC,nH,nW)=(m,3,96,96).
It outputs a matrix of shape (m,128) that encodes each input face image into a 128-dimensional vector.

Let’s create the model for face images.

FRmodel = faceRecoModel(input_shape=(3, 96, 96))
print(“Total Params:”, FRmodel.count_params())

Total Params: 3743280

By using a 128-neuron fully connected layer as its last layer, the model ensures that the output is an encoding vector of size 128. We then use the encodings the compare two face images as follows:

By computing a distance between two encodings and thresholding, we can determine if the two pictures represent the same person.

So, an encoding is a good one if:

• The encodings of two images of the same person are quite similar to each other
• The encodings of two images of different persons are very different

The triplet loss function formalizes this, and tries to “push” the encodings of two images of the same person (Anchor and Positive) closer together, while “pulling” the encodings of two images of different persons (Anchor, Negative) further apart.

In the next part, we will call the pictures from left to right: Anchor (A), Positive (P), Negative (N).

FaceNet is trained by minimizing the triplet loss. But since training requires a lot of data and a lot of computation, we won’t train it from scratch here. Instead, we load a previously trained model. Let’s Load a model using the following code; this might take a couple of minutes to run.

FRmodel.compile(optimizer = ‘adam’, loss = triplet_loss, metrics = [‘accuracy’])

Here is the summary of the very deep inception network:

The next figure shows how the pre-trained deep inception network looks like:

Here’re some examples of distances between the encodings between three individuals:

Let’s now use this model to perform face verification and face recognition!

### Face Verification

Let’s build a database containing one encoding vector for each person. To generate the encoding we use img_to_encoding(image_path, model) which basically runs the forward propagation of the model on the specified image.

Let’s build a database to map each person’s name to a 128-dimensional encoding of their face.

Now this can be used in an automated employee verification at the gate in an office in the following way: when someone shows up at the front door and swipes their ID card (thus giving us their name), we can look up their encoding in the database, and use it to check if the person standing at the front door matches the name on the ID.

Let’s implement the verify() function which checks if the front-door camera picture (image_path) is actually the person called “identity“. We shall have to go through the following steps:

• Compute the encoding of the image from image_path
• Compute the distance in between this encoding and the encoding of the identity image stored in the database
• Open the door if the distance is less than the threshold  0.7, else do not open.

As presented above, we are going to use the L2 distance (np.linalg.norm).

def verify(image_path, identity, database, model):

“””
Function that verifies if the person on the “image_path” image is “identity”.

Arguments:

image_path — path to an image

identity — string, name of the person you’d like to verify the identity. Has to be a resident of the Happy house.

database — python dictionary mapping names of allowed people’s names (strings) to their encodings (vectors).

model — your Inception model instance in Keras

Returns:

dist — distance between the image_path and the image of “identity” in the database.
door_open — True, if the door should open. False otherwise.

“””

### CODE HERE ###

return dist, door_open

Younes is trying to enter the  and the camera takes a picture of him (“camera_0.jpg”). Let’s run the above verification algorithm on this picture and compare with the one stored in the system (image_path):

verify(“camera_0.jpg”, “younes”, database, FRmodel)

# output
It’s younes, welcome home!
(0.67291224, True)

Benoit, has been banned from the office and removed from the database. He stole Kian’s ID card and came back to the house to try to present himself as Kian. The front-door camera took a picture of Benoit (“camera_2.jpg). Let’s run the verification algorithm to check if benoit can enter.

verify(“camera_2.jpg”, “kian”, database, FRmodel)

# output
It’s not kian, please go away
(0.86543155, False)

### Face Recognition

In this case, we need to implement a face recognition system that takes as input an image, and figures out if it is one of the authorized persons (and if so, who). Unlike the previous face verification system, we will no longer get a person’s name as another input.

Implement who_is_it(). We shall have to go through the following steps:

• Compute the target encoding of the image from image_path
• Find the encoding from the database that has smallest distance with the target encoding.
• Initialize the min_dist variable to a large enough number (100). It will help to keep track of what is the closest encoding to the input’s encoding.
• Loop over the database dictionary’s names and encodings. To loop use for (name, db_enc) in database.items().
• Compute L2 distance between the target “encoding” and the current “encoding” from the database.
• If this distance is less than the min_dist, then set min_dist to dist, and identity to name.

def who_is_it(image_path, database, model):
“””
Implements face recognition for the happy house by finding who is the person on the image_path image.

Arguments:
image_path — path to an image
database — database containing image encodings along with the name of the person on the image
model — your Inception model instance in Keras

Returns:
min_dist — the minimum distance between image_path encoding and the encodings from the database
identity — string, the name prediction for the person on image_path
“””

###  CODE HERE ###

return min_dist, identity

Younes is at the front-door and the camera takes a picture of him (“camera_0.jpg”). Let’s see if our who_it_is() algorithm identifies Younes.

who_is_it(“camera_0.jpg”, database, FRmodel)

# output
it’s younes, the distance is 0.672912
(0.67291224, ‘younes’)

We can change “camera_0.jpg” (picture of younes) to “camera_1.jpg” (picture of bertrand) and see the result.

who_is_it(“camera_1.jpg”, database, FRmodel)

# output
it’s bertrand, the distance is 0.474829
(0.47482917, ‘bertrand’)

Here is the takeaway:

• Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
• The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
• The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.

## References:

• Florian Schroff, Dmitry Kalenichenko, James Philbin (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering
• Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Lior Wolf (2014). DeepFace: Closing the gap to human-level performance in face verification
• The pretrained model we use is inspired by Victor Sy Wang’s implementation and was loaded using his code: https://github.com/iwantooxxoox/Keras-OpenFace.
• Implementation by Ng. et al. also took a lot of inspiration from the official FaceNet github repository: https://github.com/davidsandberg/facenet

# Deep Learning & Art: Neural Style Transfer – An Implementation with Tensorflow (using Transfer Learning with a Pre-trained VGG-19 Network) in Python

This problem appeared as an assignment in the online coursera course Convolution Neural Networks by Prof Andrew Ng, (deeplearing.ai).  The description of the problem is taken straightway from the assignment.

Neural Style Transfer algorithm was created by Gatys et al. (2015) , the paper can be found here .

In this assignment, we shall:

• Implement the neural style transfer algorithm
• Generate novel artistic images using our algorithm

Most of the algorithms we’ve studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, we  shall optimize a cost function to get pixel values!

## Problem Statement

Neural Style Transfer (NST) is one of the most fun techniques in deep learning. As seen below, it merges two images, namely,

1. a “content” image (C) and
2. a “style” image (S),

to create a “generated” image (G). The generated image G combines the “content” of the image C with the “style” of image S.

In this example, we are going to generate an image of the Louvre museum in Paris (content image C), mixed with a painting by Claude Monet, a leader of the impressionist movement (style image S).

Let’s see how we can do this.

## Transfer Learning

Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that. The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

Following the original NST paper, we shall use the VGG network. Specifically, we’ll use VGG-19, a 19-layer version of the VGG network. This model has already been trained on the very large ImageNet database, and thus has learned to recognize a variety of low level features (at the earlier layers) and high level features (at the deeper layers). The following figure (taken from the google image search results) shows how a VGG-19 convolution neural net looks like, without the last fully-connected (FC) layers.

We run the following code to load parameters from the pre-trained VGG-19 model serialized in a matlab file. This takes a few seconds.

import pprint
pprint.pprint(model)

{‘avgpool1’: <tf.Tensor ‘AvgPool_5:0’ shape=(1, 150, 200, 64) dtype=float32>,
‘avgpool2’: <tf.Tensor ‘AvgPool_6:0’ shape=(1, 75, 100, 128) dtype=float32>,
‘avgpool3’: <tf.Tensor ‘AvgPool_7:0’ shape=(1, 38, 50, 256) dtype=float32>,
‘avgpool4’: <tf.Tensor ‘AvgPool_8:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘avgpool5’: <tf.Tensor ‘AvgPool_9:0’ shape=(1, 10, 13, 512) dtype=float32>,
‘conv1_1’: <tf.Tensor ‘Relu_16:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv1_2’: <tf.Tensor ‘Relu_17:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv2_1’: <tf.Tensor ‘Relu_18:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv2_2’: <tf.Tensor ‘Relu_19:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv3_1’: <tf.Tensor ‘Relu_20:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_2’: <tf.Tensor ‘Relu_21:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_3’: <tf.Tensor ‘Relu_22:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_4’: <tf.Tensor ‘Relu_23:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv4_1’: <tf.Tensor ‘Relu_24:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_2’: <tf.Tensor ‘Relu_25:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_3’: <tf.Tensor ‘Relu_26:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_4’: <tf.Tensor ‘Relu_27:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv5_1’: <tf.Tensor ‘Relu_28:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_2’: <tf.Tensor ‘Relu_29:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_3’: <tf.Tensor ‘Relu_30:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_4’: <tf.Tensor ‘Relu_31:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘input’: <tensorflow.python.ops.variables.Variable object at 0x7f7a5bf8f7f0>}

The next figure shows the content image (C) – the Louvre museum’s pyramid surrounded by old Paris buildings, against a sunny sky with a few clouds.

For the above content image, the activation outputs from the convolution layers are visualized in the next few figures.

## How to ensure that the generated image G matches the content of the image C?

As we know, the earlier (shallower) layers of a ConvNet tend to detect lower-level features such as edges and simple textures, and the later (deeper) layers tend to detect higher-level features such as more complex textures as well as object classes.

We would like the “generated” image G to have similar content as the input image C. Suppose we have chosen some layer’s activations to represent the content of an image. In practice, we shall get the most visually pleasing results if we choose a layer in the middle of the network – neither too shallow nor too deep.

First we need to compute the “content cost” using TensorFlow.

• The content cost takes a hidden layer activation of the neural network, and measures how different a(C) and a(G) are.
• When we minimize the content cost later, this will help make sure G
has similar content as C.

def compute_content_cost(a_C, a_G):
“””
Computes the content cost

Arguments:
a_C — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G

Returns:
J_content — scalar that we need to compute using equation 1 above.
“””

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape a_C and a_G
a_C_unrolled = tf.reshape(tf.transpose(a_C), (m, n_H * n_W, n_C))
a_G_unrolled = tf.reshape(tf.transpose(a_G), (m, n_H * n_W, n_C))

# compute the cost with tensorflow
J_content = tf.reduce_sum((a_C_unrolled – a_G_unrolled)**2 / (4.* n_H * n_W *  \
n_C))

return J_content

## Computing the style cost

For our running example, we will use the following style image (S). This painting was painted in the style of impressionism, by  Claude Monet .

def gram_matrix(A):
“””
Argument:
A — matrix of shape (n_C, n_H*n_W)

Returns:
GA — Gram matrix of A, of shape (n_C, n_C)
“””

GA = tf.matmul(A, tf.transpose(A))
return GA

def compute_layer_style_cost(a_S, a_G):
“””
Arguments:
a_S — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G

Returns:
J_style_layer — tensor representing a scalar value, style cost defined above by equation (2)
“””

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape the images to have them of shape (n_C, n_H*n_W)
a_S = tf.reshape(tf.transpose(a_S), (n_C, n_H * n_W))
a_G = tf.reshape(tf.transpose(a_G), (n_C, n_H * n_W))

# Computing gram_matrices for both images S and G (≈2 lines)
GS = gram_matrix(a_S)
GG = gram_matrix(a_G)

# Computing the loss
J_style_layer = tf.reduce_sum((GS – GG)**2 / (4.* (n_H * n_W * n_C)**2))

return J_style_layer

• The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
• Minimizing the style cost will cause the image G to follow the style of the image S.

## Defining the total cost to optimize

Finally, let’s create and implement a cost function that minimizes both the style and the content cost. The formula is:

def total_cost(J_content, J_style, alpha = 10, beta = 40):
“””
Computes the total cost function

Arguments:
J_content — content cost coded above
J_style — style cost coded above
alpha — hyperparameter weighting the importance of the content cost
beta — hyperparameter weighting the importance of the style cost

Returns:
J — total cost as defined by the formula above.
“””

J = alpha * J_content + beta * J_style
return J

• The total cost is a linear combination of the content cost J_content(C,G) and the style cost J_style(S,G).
• α and β are hyperparameters that control the relative weighting between content and style.

## Solving the optimization problem

Finally, let’s put everything together to implement Neural Style Transfer!

Here’s what the program will have to do:

• Create an Interactive Session
• Randomly initialize the image to be generated
• Build the TensorFlow graph:
• Run the content image through the VGG19 model and compute the content cost.
• Run the style image through the VGG19 model and compute the style cost
Compute the total cost.
• Define the optimizer and the learning rate.
• Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.

Let’s first load, reshape, and normalize our “content” image (the Louvre museum picture) and “style” image (Claude Monet’s painting).

Now, we initialize the “generated” image as a noisy image created from the content_image. By initializing the pixels of the generated image to be mostly noise but still slightly correlated with the content image, this will help the content of the “generated” image more rapidly match the content of the “content” image. The following figure shows the noisy image:

Next, let’s load the pre-trained VGG-19 model.

To get the program to compute the content cost, we will now assign a_C and a_G to be the appropriate hidden layer activations. We will use layer conv4_2 to compute the content cost. We need to do the following:

• Assign the content image to be the input to the VGG model.
• Set a_C to be the tensor giving the hidden layer activation for layer “conv4_2”.
• Set a_G to be the tensor giving the hidden layer activation for the same layer.
• Compute the content cost using a_C and a_G.

Next, we need to compute the style cost and compute the total cost J by taking a linear combination of the two. Use alpha = 10 and beta = 40.

Then we are going to  set up the Adam optimizer in TensorFlow, using a learning rate of 2.0.

Finally, we need to initialize the variables of the tensorflow graph, assign the input image (initial generated image) as the input of the VGG19 model and runs the model to minimize the total cost J for a large number of iterations.

## Results

The following figures / animations show the generated images (G) with different content (C) and style images (S) at different iterations in the optimization process.

Content

Style (Claud Monet’s The Poppy Field near Argenteuil)

Generated

Content

Style

Generated

Content

Style

Generated

Content

Style (Van Gogh’s The Starry Night)

Generated

Content

Style

Generated

Content (Victoria Memorial Hall)

Style (Van Gogh’s The Starry Night)

Generated

Content (Taj Mahal)

Style (Van Gogh’s Starry Night Over the Rhone)

Generated

Content

Style (Claud Monet’s Sunset in Venice)

Generated

Content (Visva Bharati)

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )

Generated

Content (Howrah Bridge)

Style (Van Gogh’s The Starry Night)

Generated

Content (Leonardo Da Vinci’s Mona Lisa)

Style (Van Gogh’s The Starry Night)

Generated

Content (My sketch: Rabindranath Tagore)

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )

Generated

Content (me)

Style (Van Gogh’s Irises)

Generated

Content

Style

Generated

Content

Style (Publo Picaso’s Factory at Horto de Ebro)

Generated

The following animations show how the generated image changes with the change in VGG-19 convolution layer used for computing content cost.

Content

Style (Van Gogh’s The Starry Night)

Generated

convolution layer 3_2 used

convolution layer 4_2 used

convolution layer 5_2 used

# Some Deep Learning with Python, TensorFlow and Keras

The following problems are taken from a few assignments from the coursera courses Introduction to Deep Learning (by Higher School of Economics) and Neural Networks and Deep Learning (by Prof Andrew Ng, deeplearning.ai). The problem descriptions are taken straightaway from the assignments.

# 1. Linear models, Optimization

In this assignment a linear classifier will be implemented and it will be trained using stochastic gradient descent with numpy.

## Two-dimensional classification

To make things more intuitive, let’s solve a 2D classification problem with synthetic data.

## Features

As we  can notice the data above isn’t linearly separable. Hence we should add features (or use non-linear model). Note that decision line between two classes have form of circle, since that we can add quadratic features to make the problem linearly separable. The idea under this displayed on image below:

Here are some test results for the implemented expand function, that is used for adding quadratic features:

# simple test on random numbers

dummy_X = np.array([
[0,0],
[1,0],
[2.61,-1.28],
[-0.59,2.1]
])

# call expand function
dummy_expanded = expand(dummy_X)

# what it should have returned:   x0       x1       x0^2     x1^2     x0*x1    1
dummy_expanded_ans = np.array([[ 0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  1.    ],
[ 1.    ,  0.    ,  1.    ,  0.    ,  0.    ,  1.    ],
[ 2.61  , -1.28  ,  6.8121,  1.6384, -3.3408,  1.    ],
[-0.59  ,  2.1   ,  0.3481,  4.41  , -1.239 ,  1.    ]])

## Logistic regression

To classify objects we will obtain probability of object belongs to class ‘1’. To predict probability we will use output of linear model and logistic function:

def probability(X, w):
"""
Given input features and weights
return predicted probabilities of y==1 given x, P(y=1|x), see description above

:param X: feature matrix X of shape [n_samples,6] (expanded)
:param w: weight vector w of shape [6] for each of the expanded features
:returns: an array of predicted probabilities in [0,1] interval.
"""

return 1. / (1 + np.exp(-np.dot(X, w)))

In logistic regression the optimal parameters w are found by cross-entropy minimization:

def compute_loss(X, y, w):
"""
Given feature matrix X [n_samples,6], target vector [n_samples] of 1/0,
and weight vector w [6], compute scalar loss function using formula above.
"""
return -np.mean(y*np.log(probability(X, w)) + (1-y)*np.log(1-probability(X, w)))

Since we train our model with gradient descent, we should compute gradients. To be specific, we need the following derivative of loss function over each weight:

Here is the derivation (can be found here too):

def compute_grad(X, y, w):
"""
Given feature matrix X [n_samples,6], target vector [n_samples] of 1/0,
and weight vector w [6], compute vector [6] of derivatives of L over each weights.
"""

return np.dot((probability(X, w) - y), X) / X.shape[0]

## Training

In this section we’ll use the functions we wrote to train our classifier using stochastic gradient descent. We shall try to change hyper-parameters like batch size, learning rate and so on to find the best one.

## Mini-batch SGD

Stochastic gradient descent just takes a random example on each iteration, calculates a gradient of the loss on it and makes a step:

w = np.array([0, 0, 0, 0, 0, 1]) # initialize

eta = 0.05 # learning rate
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
ind = np.random.choice(X_expanded.shape[0], batch_size)
loss[i] = compute_loss(X_expanded, y, w)
dw = compute_grad(X_expanded[ind, :], y[ind], w)
w = w - eta*dw

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD where batch-size=4.

## SGD with momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in image below. It does this by adding a fraction α of the update vector of the past time step to the current update vector.

eta = 0.05 # learning rate
alpha = 0.9 # momentum
nu = np.zeros_like(w)
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
ind = np.random.choice(X_expanded.shape[0], batch_size)
loss[i] = compute_loss(X_expanded, y, w)
dw = compute_grad(X_expanded[ind, :], y[ind], w)
nu = alpha*nu + eta*dw
w = w - nu

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD + momentum  where batch-size=4. As can be seen, the loss function drops much faster, leading to a faster convergence.

## RMSprop

We also need to implement RMSPROP algorithm, which use squared gradients to adjust learning rate as follows:

eta = 0.05 # learning rate
alpha = 0.9 # momentum
G = np.zeros_like(w)
eps = 1e-8
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
ind = np.random.choice(X_expanded.shape[0], batch_size)
loss[i] = compute_loss(X_expanded, y, w)
dw = compute_grad(X_expanded[ind, :], y[ind], w)
G = alpha*G + (1-alpha)*dw**2
w = w - eta*dw / np.sqrt(G + eps)

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD + RMSProp where batch-size=4. As can be seen again, the loss function drops much faster, leading to a faster convergence.

# 2. Planar data classification with a neural network with one hidden layer, an implementation from scratch

In this assignment a neural net with a single hidden layer will be trained from scratch. We shall see a big difference between this model and the one implemented using logistic regression.

We shall learn how to:

• Implement a 2-class classification neural network with a single hidden layer
• Use units with a non-linear activation function, such as tanh
• Compute the cross entropy loss
• Implement forward and backward propagation

## Dataset

The following figure visualizes a “flower” 2-class dataset that we shall work on, the colors indicates the class labels.  We have m = 400 training examples.

## Simple Logistic Regression

Before building a full neural network, lets first see how logistic regression performs on this problem. We can use sklearn’s built-in functions to do that, by running the code below to train a logistic regression classifier on the dataset.

# Train the logistic regression classifier
clf = sklearn.linear_model.LogisticRegressionCV();
clf.fit(X.T, Y.T);

We can now plot the decision boundary of the model and accuracy with the following code.

# Plot the decision boundary for logistic regression
plot_decision_boundary(lambda x: clf.predict(x), X, Y)
plt.title("Logistic Regression")

# Print accuracy
LR_predictions = clf.predict(X.T)
print ('Accuracy of logistic regression: %d ' % float((np.dot(Y,LR_predictions) + np.dot(1-Y,1-LR_predictions))/float(Y.size)*100) +
'% ' + "(percentage of correctly labelled datapoints)")
Accuracy of logistic regression: 47 % (percentage of correctly labelled datapoints)

 Accuracy 47%

Interpretation: The dataset is not linearly separable, so logistic regression doesn’t perform well. Hopefully a neural network will do better. Let’s try this now!

## Neural Network model

Logistic regression did not work well on the “flower dataset”. We are going to train a Neural Network with a single hidden layer, by implementing the network with python numpy from scratch.

Here is our model:

The general methodology to build a Neural Network is to:

1. Define the neural network structure ( # of input units,  # of hidden units, etc).
2. Initialize the model's parameters
3. Loop:
- Implement forward propagation
- Compute loss
- Implement backward propagation to get the gradients


### Defining the neural network structure

Define three variables and the function layer_sizes:

- n_x: the size of the input layer
- n_h: the size of the hidden layer (set this to 4)
- n_y: the size of the output layer

def layer_sizes(X, Y):
"""
Arguments:
X -- input dataset of shape (input size, number of examples)
Y -- labels of shape (output size, number of examples)

Returns:
n_x -- the size of the input layer
n_h -- the size of the hidden layer
n_y -- the size of the output layer
"""

### Initialize the model’s parameters

Implement the function initialize_parameters().

Instructions:

• Make sure the parameters’ sizes are right. Refer to the neural network figure above if needed.
• We will initialize the weights matrices with random values.
• Use: np.random.randn(a,b) * 0.01 to randomly initialize a matrix of shape (a,b).
• We will initialize the bias vectors as zeros.
• Use: np.zeros((a,b)) to initialize a matrix of shape (a,b) with zeros.
def initialize_parameters(n_x, n_h, n_y):
"""
Argument:
n_x -- size of the input layer
n_h -- size of the hidden layer
n_y -- size of the output layer

Returns:
params -- python dictionary containing the parameters:
W1 -- weight matrix of shape (n_h, n_x)
b1 -- bias vector of shape (n_h, 1)
W2 -- weight matrix of shape (n_y, n_h)
b2 -- bias vector of shape (n_y, 1)
"""


### The Loop

Implement forward_propagation().

Instructions:

• Look above at the mathematical representation of the classifier.
• We can use the function sigmoid().
• We can use the function np.tanh(). It is part of the numpy library.
• The steps we have to implement are:
1. Retrieve each parameter from the dictionary “parameters” (which is the output of initialize_parameters()) by using parameters[".."].
2. Implement Forward Propagation. Compute Z[1],A[1],Z[2]Z[1],A[1],Z[2] and A[2]A[2] (the vector of all the predictions on all the examples in the training set).
• Values needed in the backpropagation are stored in “cache“. The cache will be given as an input to the backpropagation function.
def forward_propagation(X, parameters):
"""
Argument:
X -- input data of size (n_x, m)
parameters -- python dictionary containing the parameters (output of initialization function)

Returns:
A2 -- The sigmoid output of the second activation
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
"""


Implement compute_cost() to compute the value of the cost J.  There are many ways to implement the cross-entropy loss.

def compute_cost(A2, Y, parameters):
"""
Computes the cross-entropy cost given in equation (13)

Arguments:
A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
Y -- "true" labels vector of shape (1, number of examples)
parameters -- python dictionary containing the parameters W1, b1, W2 and b2

Returns:
cost -- cross-entropy cost given equation (13)
"""


Using the cache computed during forward propagation, we can now implement backward propagation.

Implement the function backward_propagation().

Instructions: Backpropagation is usually the hardest (most mathematical) part in deep learning. The following figure is taken from is the slide from the lecture on backpropagation. We’ll want to use the six equations on the right of this slide, since we are building a vectorized implementation.

def backward_propagation(parameters, cache, X, Y):
"""
Implement the backward propagation using the instructions above.

Arguments:
parameters -- python dictionary containing our parameters
cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
X -- input data of shape (2, number of examples)
Y -- "true" labels vector of shape (1, number of examples)

Returns:
grads -- python dictionary containing the gradients with respect to different parameters
"""


Implement the update rule. Use gradient descent. We have to use (dW1, db1, dW2, db2) in order to update (W1, b1, W2, b2).

General gradient descent rule: θ=θα(J/θ) where α is the learning rate and θ
represents a parameter.

Illustration: The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.

def update_parameters(parameters, grads, learning_rate = 1.2):
"""

Arguments:
parameters -- python dictionary containing our parameters

Returns:
parameters -- python dictionary containing our updated parameters
"""


### Integrate previous parts in nn_model()

Build the neural network model in nn_model().

Instructions: The neural network model has to use the previous functions in the right order.

def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
"""
Arguments:
X -- dataset of shape (2, number of examples)
Y -- labels of shape (1, number of examples)
n_h -- size of the hidden layer
num_iterations -- Number of iterations in gradient descent loop
print_cost -- if True, print the cost every 1000 iterations

Returns:
parameters -- parameters learnt by the model. They can then be used to predict.
"""


### Predictions

Use the model to predict by building predict(). Use forward propagation to predict results.

def predict(parameters, X):
"""
Using the learned parameters, predicts a class for each example in X

Arguments:
parameters -- python dictionary containing our parameters
X -- input data of size (n_x, m)

Returns
predictions -- vector of predictions of our model (red: 0 / blue: 1)
"""


It is time to run the model and see how it performs on a planar dataset. Run the following code to test our model with a single hidden layer of nh hidden units.

# Build a model with a n_h-dimensional hidden layer
parameters = nn_model(X, Y, n_h = 4, num_iterations = 10000, print_cost=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))
Cost after iteration 0: 0.693048
Cost after iteration 1000: 0.288083
Cost after iteration 2000: 0.254385
Cost after iteration 3000: 0.233864
Cost after iteration 4000: 0.226792
Cost after iteration 5000: 0.222644
Cost after iteration 6000: 0.219731
Cost after iteration 7000: 0.217504
Cost after iteration 8000: 0.219471
Cost after iteration 9000: 0.218612

 Cost after iteration 9000 0.218607
# Print accuracy
predictions = predict(parameters, X)
print ('Accuracy: %d' % float((np.dot(Y,predictions.T) + np.dot(1-Y,1-predictions.T))/float(Y.size)*100) + '%')

Accuracy: 90%

Accuracy is really high compared to Logistic Regression. The model has learnt the leaf patterns of the flower! Neural networks are able to learn even highly non-linear decision boundaries, unlike logistic regression.

Now, let’s try out several hidden layer sizes. We can observe different behaviors of the model for various hidden layer sizes. The results are shown below.

### Tuning hidden layer size

Accuracy for 1 hidden units: 67.5 %
Accuracy for 2 hidden units: 67.25 %
Accuracy for 3 hidden units: 90.75 %
Accuracy for 4 hidden units: 90.5 %
Accuracy for 5 hidden units: 91.25 %
Accuracy for 20 hidden units: 90.0 %
Accuracy for 50 hidden units: 90.25 %


Interpretation:

• The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.
• The best hidden layer size seems to be around n_h = 5. Indeed, a value around here seems to fits the data well without also incurring noticable overfitting.
• We shall also learn later about regularization, which lets us use very large models (such as n_h = 50) without much overfitting.

# 3. Getting deeper with Keras

• Tensorflow is a powerful and flexible tool, but coding large neural architectures with it is tedious.
• There are plenty of deep learning toolkits that work on top of it like Slim, TFLearn, Sonnet, Keras.
• Choice is matter of taste and particular task
• We’ll be using Keras to predict handwritten digits with the mnist dataset.
• The following figure shows 225 sample images from the dataset.

## The pretty keras

Using only the following few lines of code we can learn a simple deep neural net with 3 dense hidden layers and with Relu activation, with dropout 0.5 after each dense layer.

import keras
from keras.models import Sequential
import keras.layers as ll

model = Sequential(name="mlp")

# network body

# output layer: 10 neurons for each class with softmax

# categorical_crossentropy is our good old crossentropy
# but applied for one-hot-encoded vectors


The following shows the summary of the model:

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
input_12 (InputLayer)        (None, 28, 28)            0
_________________________________________________________________
flatten_12 (Flatten)         (None, 784)               0
_________________________________________________________________
dense_35 (Dense)             (None, 128)               100480
_________________________________________________________________
activation_25 (Activation)   (None, 128)               0
_________________________________________________________________
dropout_22 (Dropout)         (None, 128)               0
_________________________________________________________________
dense_36 (Dense)             (None, 128)               16512
_________________________________________________________________
activation_26 (Activation)   (None, 128)               0
_________________________________________________________________
dropout_23 (Dropout)         (None, 128)               0
_________________________________________________________________
dense_37 (Dense)             (None, 128)               16512
_________________________________________________________________
activation_27 (Activation)   (None, 128)               0
_________________________________________________________________
dropout_24 (Dropout)         (None, 128)               0
_________________________________________________________________
dense_38 (Dense)             (None, 10)                1290
=================================================================
Total params: 134,794
Trainable params: 134,794
Non-trainable params: 0
_________________________________________________________________


### Model interface

Keras models follow Scikit-learn‘s interface of fit/predict with some notable extensions. Let’s take a tour.

# fit(X,y) ships with a neat automatic logging.
#          Highly customizable under the hood.
model.fit(X_train, y_train,
validation_data=(X_val, y_val), epochs=13);

Train on 50000 samples, validate on 10000 samples
Epoch 1/13
50000/50000 [==============================] - 14s - loss: 0.1489 - acc: 0.9587 - val_loss: 0.0950 - val_acc: 0.9758
Epoch 2/13
50000/50000 [==============================] - 12s - loss: 0.1543 - acc: 0.9566 - val_loss: 0.0957 - val_acc: 0.9735
Epoch 3/13
50000/50000 [==============================] - 11s - loss: 0.1509 - acc: 0.9586 - val_loss: 0.0985 - val_acc: 0.9752
Epoch 4/13
50000/50000 [==============================] - 11s - loss: 0.1515 - acc: 0.9577 - val_loss: 0.0967 - val_acc: 0.9752
Epoch 5/13
50000/50000 [==============================] - 11s - loss: 0.1471 - acc: 0.9596 - val_loss: 0.1008 - val_acc: 0.9737
Epoch 6/13
50000/50000 [==============================] - 11s - loss: 0.1488 - acc: 0.9598 - val_loss: 0.0989 - val_acc: 0.9749
Epoch 7/13
50000/50000 [==============================] - 11s - loss: 0.1495 - acc: 0.9592 - val_loss: 0.1011 - val_acc: 0.9748
Epoch 8/13
50000/50000 [==============================] - 11s - loss: 0.1434 - acc: 0.9604 - val_loss: 0.1005 - val_acc: 0.9761
Epoch 9/13
50000/50000 [==============================] - 11s - loss: 0.1514 - acc: 0.9590 - val_loss: 0.0951 - val_acc: 0.9759
Epoch 10/13
50000/50000 [==============================] - 11s - loss: 0.1424 - acc: 0.9613 - val_loss: 0.0995 - val_acc: 0.9739
Epoch 11/13
50000/50000 [==============================] - 11s - loss: 0.1408 - acc: 0.9625 - val_loss: 0.0977 - val_acc: 0.9751
Epoch 12/13
50000/50000 [==============================] - 11s - loss: 0.1413 - acc: 0.9601 - val_loss: 0.0938 - val_acc: 0.9753
Epoch 13/13
50000/50000 [==============================] - 11s - loss: 0.1430 - acc: 0.9619 - val_loss: 0.0981 - val_acc: 0.9761

As we could see, with a simple model without any convolution layers we could obtain more than 97.5% accuracy on the validation dataset.

The following figures show the weights learnt at different layers.

# Some Tips & tricks to improve accuracy

Here are some tips on what we can do to improve accuracy:

• Network size
• More neurons,
• More layers, (docs)
• Nonlinearities in the hidden layers
• tanh, relu, leaky relu, etc
• Larger networks may take more epochs to train, so don’t discard the net just because it could didn’t beat the baseline in 5 epochs.
• Early Stopping
• Training for 100 epochs regardless of anything is probably a bad idea.
• Some networks converge over 5 epochs, others – over 500.
• Way to go: stop when validation score is 10 iterations past maximum
• Faster optimization
• Converge faster and sometimes reach better optima
• It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
• Regularize to prevent overfitting
• Data augmemntation – getting 5x as large dataset for free is a great deal
• https://keras.io/preprocessing/image/
• Zoom-in+slice = move
• Rotate+zoom(to remove black stripes)
• any other perturbations
• Simple way to do that (if we have PIL/Image):
• from scipy.misc import imrotate,imresize
• and a few slicing
• Stay realistic. There’s usually no point in flipping dogs upside down as that is not the way we usually see them.

# Unsupervised Deep learning with AutoEncoders on the MNIST dataset (with Tensorflow in Python)

• Deep learning,  although primarily used for supervised classification / regression problems, can also be used as an unsupervised ML technique, the autoencoder being a classic example. As explained here,  the aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of  dimensionality reduction.
• Deep learning can be used to learn a different representation (typically a set of input features in a  low-dimensional space) of the data that can be used for pre-training for example in transfer-learning.
• In this article, a few vanilla autoencoder implementations will be demonstrated for the mnist dataset.
• The following figures describe the theory (ref: coursera course Neural Networks for Machine Learning, 2012 by Prof. Hinton, university of Toronto). As explained, the autoencoder with back-propagation-based implementation can be used to generalize the linear dimension-reduction techniques as PCA, since the hidden layers can learn non-linear manifolds with non-linear activation functions (e.g., relu , sigmoid).

• The input and output units of an autoencoder are identical, the idea is to learn the input itself as a different representation with one or multiple hidden layer(s).
• The mnist images are of size 28×28, so the number of nodes in the input and the output layer are always 784 for the autoencoders shown in this article.
• The left side of an auto-encoder network is typically a mirror image of the right side and the weights are tied (weights learnt in the left hand side of the network are reused, to better reproduce the input at the output).
• The next figures and animations show the outputs for the following simple autoencoder with just 1 hidden layer, with the input mnist data. A Relu activation function is used in the hidden layer.  It also uses the L2 regularization on the weights learnt. As can be seen from the next figure, the inputs are kind of reproduced with some variation at the output layer, as expected.

• The next animations visualize the hidden layer weights learnt (for the 400 hidden units) and the output of the autoencoder with the same input training dataset, with a different value of the regularization parameter.

• The next figure visualizes the hidden layer weights learnt with yet another different regulariation parameter value.

• The next animation visualizes the output of the autoencoder with the same input training dataset, but this time no activation function  being used at the hidden layer.

• The next animations show the results with a deeper autoencoder with 3 hidden layers (the architecture shown below). As before, the weights are tied and in this case no activation function is used, with L2 regularization on the weights.

• Let’s implement a more deeper autoencoder. The next animations show the results with a deeper autoencoder with 5 hidden layers (the architecture shown below). As before, the weights are tied and in this case no activation function is used, with L2 regularization on the weights.

# Dogs vs. Cats: Image Classification with Deep Learning using TensorFlow in Python

## The problem

Given a set of labeled images of  cats and dogs, a  machine learning model  is to be learnt and later it is to be used to classify a set of new images as cats or dogs. This problem appeared in a Kaggle competition and the images are taken from this kaggle dataset.

• The original dataset contains a huge number of images (25,000 labeled cat/dog images for training and 12,500 unlabeled images for test).
• Only a few sample images are chosen (1100 labeled images for cat/dog as training and 1000 images from the test dataset) from the dataset, just for the sake of  quick demonstration of how to solve this problem using deep learning (motivated by the Udacity course Deep Learning by Google), which is going to be described (along with the results) in this article.
• The sample test images chosen are manually labeled to compute model accuracy later with the model-predicted labels.
• The accuracy on the test dataset is not going to be good in general for the above-mentioned reason. In order to obtain good accuracy  on the test dataset using deep learning, we need to train the models with a large number of input images (e.g., with all the training images from the kaggle dataset).
• A few sample labeled images from the training dataset are shown below.

Dogs

Cats

• As a pre-processing step, all the images are first resized to 50×50 pixel images.

## Classification with a few off-the-self classifiers

• First, each image from the training dataset is fattened and represented as 2500-length vectors (one for each channel).
• Next, a few sklearn models are trained on this flattened data. Here are the results

As shown above, the test accuracy is quite poor with a few sophisticated off-the-self classifiers.

## Classifying images using Deep Learning with Tensorflow

Now let’s first train a logistic regression and then a couple of neural network models by introducing L2 regularization for both the models.

• First, all the images are converted to gray-scale images.
• The following figures visualize the weights learnt for the cat vs. the dog class during training the logistic regression  model with SGD with L2-regularization (λ=0.1, batch size=128).

Test accuracy: 53.6%

• The following animation visualizes the weights learnt for 400 randomly selected hidden units using a neural net with a single hidden layer with 4096 hidden nodes by training the neural net model  with SGD with L2-regularization (λ1=λ2=0.05, batch size=128).Minibatch loss at step 0: 198140.156250
Minibatch accuracy: 50.0%
Validation accuracy: 50.0%Minibatch loss at step 500: 0.542070
Minibatch accuracy: 89.8%
Validation accuracy: 57.0%Minibatch loss at step 1000: 0.474844
Minibatch accuracy: 96.9%
Validation accuracy: 60.0%

Minibatch loss at step 1500: 0.571939
Minibatch accuracy: 85.9%
Validation accuracy: 56.0%

Minibatch loss at step 2000: 0.537061
Minibatch accuracy: 91.4%
Validation accuracy: 63.0%

Minibatch loss at step 2500: 0.751552
Minibatch accuracy: 75.8%
Validation accuracy: 57.0%

Minibatch loss at step 3000: 0.579084
Minibatch accuracy: 85.9%
Validation accuracy: 54.0%

Test accuracy: 57.8%


Clearly, the model learnt above overfits the training dataset, the test accuracy improved a bit, but still quite poor.

• Now, let’s train a deeper neural net with a two hidden layers, first one with 1024 nodes and second one with 64 nodes.
• Minibatch loss at step 0: 1015.947266
Minibatch accuracy: 43.0%
Validation accuracy: 50.0%
• Minibatch loss at step 500: 0.734610
Minibatch accuracy: 79.7%
Validation accuracy: 55.0%
• Minibatch loss at step 1000: 0.615992
Minibatch accuracy: 93.8%
Validation accuracy: 55.0%
• Minibatch loss at step 1500: 0.670009
Minibatch accuracy: 82.8%
Validation accuracy: 56.0%
• Minibatch loss at step 2000: 0.798796
Minibatch accuracy: 77.3%
Validation accuracy: 58.0%
• Minibatch loss at step 2500: 0.717479
Minibatch accuracy: 84.4%
Validation accuracy: 55.0%
• Minibatch loss at step 3000: 0.631013
Minibatch accuracy: 90.6%
Validation accuracy: 57.0%
• Minibatch loss at step 3500: 0.739071
Minibatch accuracy: 75.8%
Validation accuracy: 54.0%
• Minibatch loss at step 4000: 0.698650
Minibatch accuracy: 84.4%
Validation accuracy: 55.0%
• Minibatch loss at step 4500: 0.666173
Minibatch accuracy: 85.2%
Validation accuracy: 51.0%
• Minibatch loss at step 5000: 0.614820
Minibatch accuracy: 92.2%
Validation accuracy: 58.0%
Test accuracy: 55.2%

• The following animation visualizes the weights learnt for 400 randomly selected hidden units from the first hidden layer, by training the neural net model with SGD with L2-regularization (λ1=λ2=λ3=0.1, batch size=128, dropout rate=0.6).
• The next animation visualizes the weights learnt and then the weights learnt for all the 64 hidden units for the second hidden layer.
• Clearly, the second deeper neural net model learnt above overfits the training dataset more, the test accuracy decreased a bit.

## Classifying images with Deep Convolution Network

Let’s use the following conv-net shown in the next figure.

As shown above, the ConvNet uses:

• convolution layers each with
• 5×5 kernel
• 16 filters
• 1×1 stride
• Max pooling layers each with
• 2×2 kernel
• 2×2 stride
• 64 hidden nodes
• 128 batch size
• 5K iterations
• 0.7 dropout rate
• No learning decay

Results

Minibatch loss at step 0: 1.783917
Minibatch accuracy: 55.5%
Validation accuracy: 50.0%

Minibatch loss at step 500: 0.269719
Minibatch accuracy: 89.1%
Validation accuracy: 54.0%

Minibatch loss at step 1000: 0.045729
Minibatch accuracy: 96.9%
Validation accuracy: 61.0%

Minibatch loss at step 1500: 0.015794
Minibatch accuracy: 100.0%
Validation accuracy: 61.0%

Minibatch loss at step 2000: 0.028912
Minibatch accuracy: 98.4%
Validation accuracy: 64.0%

Minibatch loss at step 2500: 0.007787
Minibatch accuracy: 100.0%
Validation accuracy: 62.0%

Minibatch loss at step 3000: 0.001591
Minibatch accuracy: 100.0%
Validation accuracy: 63.0%

Test accuracy: 61.3%


The following animations show the features learnt (for the first 16 images for each SGD batch) at different convolution and Maxpooling layers:

• Clearly, the simple convolution neural net outperforms all the previous models in terms of test accuracy, as shown below.

• Only 1100 labeled images (randomly chosen from the training dataset) were used to train the model and predict 1000 test images (randomly chosen from the test dataset). Clearly the accuracy can be improved a lot if a large number of images are used for training with deeper / more complex networks (with more parameters to learn).

# Deep Learning with TensorFlow in Python: Convolution Neural Nets

The following problems appeared in the assignments in the Udacity course Deep Learning (by Google). The descriptions of the problems are taken from the assignments (continued from the last post).

## Classifying the alphabets with notMNIST dataset with Deep Network

Here is how some sample images from the dataset look like:

Let’s try to get the best performance using a multi-layer model! (The best reported test accuracy using a deep network is 97.1%).

• One avenue you can explore is to add multiple layers.
• Another one is to use learning rate decay.

### Learning L2-Regularized  Deep Neural Network with SGD

The following figure recapitulates the neural network with a 3 hidden layers, the first one with 2048 nodes,  the second one with 512 nodes and the third one with with 128 nodes, each one with Relu intermediate outputs. The L2 regularizations applied on the lossfunction for the weights learnt at the input and the hidden layers are λ1, λ2, λ3 and λ4, respectively.

The next 3 animations visualize the weights learnt for 400 randomly selected nodes from hidden layer 1 (out of 2096 nodes), then another 400 randomly selected nodes from hidden layer 2 (out of 512 nodes) and finally at all 128 nodes from hidden layer 3, at different steps using SGD and L2 regularized loss function (with λλλλ4
=0.01).  As can be seen below, the weights learnt are gradually capturing (as the SGD step increases) the different features of the alphabets at the corresponding output neurons.

Results with SGD

Initialized
Validation accuracy: 27.6%
Minibatch loss at step 0: 4.638808
Minibatch accuracy: 7.8%
Validation accuracy: 27.6%

Validation accuracy: 86.3%
Minibatch loss at step 500: 1.906724
Minibatch accuracy: 86.7%
Validation accuracy: 86.3%

Validation accuracy: 86.9%
Minibatch loss at step 1000: 1.333355
Minibatch accuracy: 87.5%
Validation accuracy: 86.9%

Validation accuracy: 87.3%
Minibatch loss at step 1500: 1.056811
Minibatch accuracy: 84.4%
Validation accuracy: 87.3%

Validation accuracy: 87.5%
Minibatch loss at step 2000: 0.633034
Minibatch accuracy: 93.8%
Validation accuracy: 87.5%

Validation accuracy: 87.5%
Minibatch loss at step 2500: 0.696114
Minibatch accuracy: 85.2%
Validation accuracy: 87.5%

Validation accuracy: 88.3%
Minibatch loss at step 3000: 0.737464
Minibatch accuracy: 86.7%
Validation accuracy: 88.3%

Test accuracy: 93.6%

Batch size = 128, number of iterations = 3001 and Drop-out rate = 0.8 for training dataset are used for the above set of experiments, with learning decay. We can play with the hyper-parameters to get better test accuracy.

## Convolution Neural Network

Previously  we trained fully connected networks to classify notMNIST characters. The goal of this assignment is to make the neural network convolutional.

Let’s build a small network with two convolutional layers, followed by one fully connected layer. Convolutional networks are more expensive computationally, so we’ll limit its depth and number of fully connected nodes. The below figure shows the simplified architecture of the convolution neural net.

As shown above, the ConvNet uses:

• 2 convolution layers each with
• 5×5 kernel
• 16 filters
• 2×2 strides
• 64 hidden nodes
• 16 batch size
• 1K iterations

### Results

Initialized
Minibatch loss at step 0: 3.548937
Minibatch accuracy: 18.8%
Validation accuracy: 10.0%

Minibatch loss at step 50: 1.781176
Minibatch accuracy: 43.8%
Validation accuracy: 64.7%

Minibatch loss at step 100: 0.882739
Minibatch accuracy: 75.0%
Validation accuracy: 69.5%

Minibatch loss at step 150: 0.980598
Minibatch accuracy: 62.5%
Validation accuracy: 74.5%

Minibatch loss at step 200: 0.794144
Minibatch accuracy: 81.2%
Validation accuracy: 77.6%

Minibatch loss at step 250: 1.191971
Minibatch accuracy: 62.5%
Validation accuracy: 79.1%

Minibatch loss at step 300: 0.441911
Minibatch accuracy: 87.5%
Validation accuracy: 80.5%

Minibatch loss at step 350: 0.605005
Minibatch accuracy: 81.2%
Validation accuracy: 79.3%

Minibatch loss at step 400: 1.032123
Minibatch accuracy: 68.8%
Validation accuracy: 81.5%

Minibatch loss at step 450: 0.869944
Minibatch accuracy: 75.0%
Validation accuracy: 82.1%

Minibatch loss at step 500: 0.530418
Minibatch accuracy: 81.2%
Validation accuracy: 81.2%

Minibatch loss at step 550: 0.227771
Minibatch accuracy: 93.8%
Validation accuracy: 81.8%

Minibatch loss at step 600: 0.697444
Minibatch accuracy: 75.0%
Validation accuracy: 82.5%

Minibatch loss at step 650: 0.862341
Minibatch accuracy: 68.8%
Validation accuracy: 83.0%

Minibatch loss at step 700: 0.336292
Minibatch accuracy: 87.5%
Validation accuracy: 81.8%

Minibatch loss at step 750: 0.213392
Minibatch accuracy: 93.8%
Validation accuracy: 82.6%

Minibatch loss at step 800: 0.553639
Minibatch accuracy: 75.0%
Validation accuracy: 83.3%

Minibatch loss at step 850: 0.533049
Minibatch accuracy: 87.5%
Validation accuracy: 81.7%

Minibatch loss at step 900: 0.415935
Minibatch accuracy: 87.5%
Validation accuracy: 83.9%

Minibatch loss at step 950: 0.290436
Minibatch accuracy: 93.8%
Validation accuracy: 84.0%

Minibatch loss at step 1000: 0.400648
Minibatch accuracy: 87.5%
Validation accuracy: 84.0%

Test accuracy: 90.3%

The following figures visualize the feature representations at different layers for the first 16 images for the last batch with SGD during training:

The next animation shows how the features learnt at convolution layer 1 change with iterations.

## Convolution Neural Network with Max Pooling

The convolutional model above uses convolutions with stride 2 to reduce the dimensionality. Replace the strides by a max pooling operation of stride 2 and kernel size 2. The below figure shows the simplified architecture of the convolution neural net with MAX Pooling layers.

As shown above, the ConvNet uses:

• 2 convolution layers each with
• 5×5 kernel
• 16 filters
• 1×1 stride
• 2×2 Max-pooling
• 64 hidden nodes
• 16 batch size
• 1K iterations

Results

Initialized
Minibatch loss at step 0: 4.934033
Minibatch accuracy: 6.2%
Validation accuracy: 8.9%

Minibatch loss at step 50: 2.305100
Minibatch accuracy: 6.2%
Validation accuracy: 11.7%

Minibatch loss at step 100: 2.319777
Minibatch accuracy: 0.0%
Validation accuracy: 14.8%

Minibatch loss at step 150: 2.285996
Minibatch accuracy: 18.8%
Validation accuracy: 11.5%

Minibatch loss at step 200: 1.988467
Minibatch accuracy: 25.0%
Validation accuracy: 22.9%

Minibatch loss at step 250: 2.196230
Minibatch accuracy: 12.5%
Validation accuracy: 27.8%

Minibatch loss at step 300: 0.902828
Minibatch accuracy: 68.8%
Validation accuracy: 55.4%

Minibatch loss at step 350: 1.078835
Minibatch accuracy: 62.5%
Validation accuracy: 70.1%

Minibatch loss at step 400: 1.749521
Minibatch accuracy: 62.5%
Validation accuracy: 70.3%

Minibatch loss at step 450: 0.896893
Minibatch accuracy: 75.0%
Validation accuracy: 79.5%

Minibatch loss at step 500: 0.610678
Minibatch accuracy: 81.2%
Validation accuracy: 79.5%

Minibatch loss at step 550: 0.212040
Minibatch accuracy: 93.8%
Validation accuracy: 81.0%

Minibatch loss at step 600: 0.785649
Minibatch accuracy: 75.0%
Validation accuracy: 81.8%

Minibatch loss at step 650: 0.775520
Minibatch accuracy: 68.8%
Validation accuracy: 82.2%

Minibatch loss at step 700: 0.322183
Minibatch accuracy: 93.8%
Validation accuracy: 81.8%

Minibatch loss at step 750: 0.213779
Minibatch accuracy: 100.0%
Validation accuracy: 82.9%

Minibatch loss at step 800: 0.795744
Minibatch accuracy: 62.5%
Validation accuracy: 83.7%

Minibatch loss at step 850: 0.767435
Minibatch accuracy: 87.5%
Validation accuracy: 81.7%

Minibatch loss at step 900: 0.354712
Minibatch accuracy: 87.5%
Validation accuracy: 83.8%

Minibatch loss at step 950: 0.293992
Minibatch accuracy: 93.8%
Validation accuracy: 84.3%

Minibatch loss at step 1000: 0.384624
Minibatch accuracy: 87.5%
Validation accuracy: 84.2%

Test accuracy: 90.5%

As can be seen from the above results, with MAX POOLING, the test accuracy increased slightly.

The following figures visualize the feature representations at different layers for the first 16 images during training with Max Pooling:

Till now the convnets we have tried are small enough and we did not obtain high enough accuracy on the test dataset. Next we shall make our convnet deep to increase the test accuracy.

## Deep Convolution Neural Network with Max Pooling

Let’s try with a few convnets:

1. The following ConvNet uses:

• 2 convolution layers (with Relu) each using
• 3×3 kernel
• 16 filters
• 1×1 stride
• 2×2 Max-pooling
• all weights initialized with truncated normal distribution with sd 0.01
• single hidden layer (fully connected) with 1024 hidden nodes
• 128 batch size
• 3K iterations
• 0.01 (=λ1=λ2) for regularization
• No dropout
• No learning decay

Results

Minibatch loss at step 0: 2.662903
Minibatch accuracy: 7.8%
Validation accuracy: 10.0%

Minibatch loss at step 500: 2.493813
Minibatch accuracy: 11.7%
Validation accuracy: 10.0%

Minibatch loss at step 1000: 0.848911
Minibatch accuracy: 82.8%
Validation accuracy: 79.6%

Minibatch loss at step 1500: 0.806191
Minibatch accuracy: 79.7%
Validation accuracy: 81.8%

Minibatch loss at step 2000: 0.617905
Minibatch accuracy: 85.9%
Validation accuracy: 84.5%

Minibatch loss at step 2500: 0.594710
Minibatch accuracy: 83.6%
Validation accuracy: 85.7%

Minibatch loss at step 3000: 0.435352
Minibatch accuracy: 91.4%
Validation accuracy: 87.2%

Test accuracy: 93.4%

As we can see, by introducing couple of convolution layers, the accuracy increased from 90% (refer to the earlier blog) to 93.4% under the same settings.

Here is how the hidden layer weights (400 out of 1024 chosen randomly) changes, although the features don’t clearly resemble the alphabets anymore, which is quite expected.

2. The following ConvNet uses:

• 2 convolution layers (with Relu) each using
• 3×3 kernel
• 32 filters
• 1×1 stride
• 2×2 Max-pooling
• all weights initialized with truncated normal distribution with sd 0.1
• hidden layers (fully connected) both with 256 hidden nodes
• 128 batch size
• 6K iterations
• 0.7 dropout
• learning decay starting with 0.1

Results

Minibatch loss at step 0: 9.452210
Minibatch accuracy: 10.2%
Validation accuracy: 9.7%
Minibatch loss at step 500: 0.611396
Minibatch accuracy: 81.2%
Validation accuracy: 81.2%
Minibatch loss at step 1000: 0.442578
Minibatch accuracy: 85.9%
Validation accuracy: 83.3%
Minibatch loss at step 1500: 0.523506
Minibatch accuracy: 83.6%
Validation accuracy: 84.8%
Minibatch loss at step 2000: 0.411259
Minibatch accuracy: 89.8%
Validation accuracy: 85.8%
Minibatch loss at step 2500: 0.507267
Minibatch accuracy: 82.8%
Validation accuracy: 85.9%
Minibatch loss at step 3000: 0.414740
Minibatch accuracy: 89.1%
Validation accuracy: 86.6%
Minibatch loss at step 3500: 0.432177
Minibatch accuracy: 85.2%
Validation accuracy: 87.0%
Minibatch loss at step 4000: 0.501300
Minibatch accuracy: 85.2%
Validation accuracy: 87.1%
Minibatch loss at step 4500: 0.391587
Minibatch accuracy: 89.8%
Validation accuracy: 87.7%
Minibatch loss at step 5000: 0.347674
Minibatch accuracy: 90.6%
Validation accuracy: 88.1%
Minibatch loss at step 5500: 0.259942
Minibatch accuracy: 91.4%
Validation accuracy: 87.8%
Minibatch loss at step 6000: 0.392562
Minibatch accuracy: 85.9%
Validation accuracy: 88.4%

Test accuracy: 94.6%

3. The following ConvNet uses:

• 3 convolution layers (with Relu) each using
• 5×5 kernel
• with 16, 32 and 64 filters, respectively
• 1×1 stride
• 2×2 Max-pooling
• all weights initialized with truncated normal distribution with sd 0.1
• hidden layers (fully connected) with 256, 128 and 64 hidden nodes respectively
• 128 batch size
• 10K iterations
• 0.7 dropout
• learning decay starting with 0.1

Results

Minibatch loss at step 0: 6.788681
Minibatch accuracy: 12.5%
Validation accuracy: 9.8%
Minibatch loss at step 500: 0.804718
Minibatch accuracy: 75.8%
Validation accuracy: 74.9%
Minibatch loss at step 1000: 0.464696
Minibatch accuracy: 86.7%
Validation accuracy: 82.8%
Minibatch loss at step 1500: 0.684611
Minibatch accuracy: 80.5%
Validation accuracy: 85.2%
Minibatch loss at step 2000: 0.352865
Minibatch accuracy: 91.4%
Validation accuracy: 85.9%
Minibatch loss at step 2500: 0.505062
Minibatch accuracy: 84.4%
Validation accuracy: 87.3%
Minibatch loss at step 3000: 0.352783
Minibatch accuracy: 87.5%
Validation accuracy: 87.0%
Minibatch loss at step 3500: 0.411505
Minibatch accuracy: 88.3%
Validation accuracy: 87.9%
Minibatch loss at step 4000: 0.457463
Minibatch accuracy: 84.4%
Validation accuracy: 88.1%
Minibatch loss at step 4500: 0.369346
Minibatch accuracy: 89.8%
Validation accuracy: 88.7%
Minibatch loss at step 5000: 0.323142
Minibatch accuracy: 89.8%
Validation accuracy: 88.5%
Minibatch loss at step 5500: 0.245018
Minibatch accuracy: 93.8%
Validation accuracy: 89.0%
Minibatch loss at step 6000: 0.480509
Minibatch accuracy: 85.9%
Validation accuracy: 89.2%
Minibatch loss at step 6500: 0.297886
Minibatch accuracy: 92.2%
Validation accuracy: 89.3%
Minibatch loss at step 7000: 0.309768
Minibatch accuracy: 90.6%
Validation accuracy: 89.3%
Minibatch loss at step 7500: 0.280219
Minibatch accuracy: 92.2%
Validation accuracy: 89.5%
Minibatch loss at step 8000: 0.260540
Minibatch accuracy: 93.8%
Validation accuracy: 89.7%
Minibatch loss at step 8500: 0.345161
Minibatch accuracy: 88.3%
Validation accuracy: 89.6%
Minibatch loss at step 9000: 0.343074
Minibatch accuracy: 87.5%
Validation accuracy: 89.8%
Minibatch loss at step 9500: 0.324757
Minibatch accuracy: 92.2%
Validation accuracy: 89.9%
Minibatch loss at step 10000: 0.513597
Minibatch accuracy: 83.6%
Validation accuracy: 90.0%

Test accuracy: 95.5%

To be continued…