Autonomous Driving – Car detection with YOLO Model with Keras in Python

In this article, object detection using the very powerful YOLO model will be described, particularly in the context of car detection for autonomous driving. This problem appeared as an assignment in the coursera course Convolution Networks which is a part of the Deep Learning Specialization (taught by Prof. Andrew Ng.,  from Stanford and deeplearning.ai, the lecture videos corresponding to the YOLO algorithm can be found here).  The problem description is taken straightaway from the assignment.

Given a set of images (a car detection dataset), the goal is to detect objects (cars) in those images using a pre-trained YOLO (You Only Look Once) model, with bounding boxes. Many of the ideas are from the two original YOLO papers: Redmon et al., 2016  and Redmon and Farhadi, 2016 .

Some Theory

Let’s first clear the concepts regarding classification, localization, detection and how the object detection problem can be transformed to supervised machine learning problem and subsequently can be solved using a deep convolution neural network. As can be seen from the next figure,

  • Image classification with localization aims to find the location of an object in an image by not only classifying the image (e.g., a binary classification problem: whether there is a car in an image or not), but also finding a bounding box around the object, if one found.
  • Detection goes a level further by aiming to identify multiple instances of same/ different types of objects, by marking their locations (the localization problem usually tries to find a single object location).
  • The localization problem can be converted to a supervised machine learning multi-class classification problem in the following way: in addition to the class label of the object to be identified, the output vector corresponding to an input training image must also contain the location (bounding box coordinates relative to image size) of the object.
  • A typical output data vector will contain 8 entries for a 4-class classification, as shown in the next figure, the first entry will correspond to whether or not an object of any from the 3 classes of objects. In case one is present in an image, the next 4 entries will define the bounding box containing the object, followed by 3 binary values for the 3 class labels indicating the class of the object. In case none of the objects are present, the first entry will be 0 and the others will be ignored.

 

f1.png

  • Now moving from localization to detection, one can proceed in two steps as shown below in the next figure: first use small tightly cropped images to train a convolution neural net for image classification and then use sliding windows of different window sizes (smaller to larger) to classify a test image within that window using the convnet learnt and run the windows sequentially through the entire image, but it’s infeasibly slow computationally.
  • However, as shown in the next figure, the convolutional implementation of the sliding windows by replacing the fully-connected layers by 1×1 filters makes it possible to simultaneously classify the image-subset inside all possible sliding windows parallelly, making it much more efficient computationally.

 

f2.png

  • The convolutional sliding windows, although computationally much more efficient, still has the problem of detecting the accurate bounding boxes, since the boxes don’t align with the sliding windows and the object shapes also tend to be different.
  • YOLO algorithm overcomes this limitation by dividing a training image into grids and assigning an object to a grid if and only if the center of the object falls inside the grid, that way each object in a training image can get assigned to exactly one grid and then the corresponding bounding box is represented by the coordinates relative to the grid. The next figure described the details of the algorithm.
  • In the test images, multiple adjacent grids may think that an object actually belongs to them, in order to resolve the iou (intersection of union) measure is used to find the maximum overlap and the non-maximum-suppression algorithm is used to discard all the other bounding boxes with low-confidence of containing an object, keeping the one with the highest confidence among the competing ones and discard the others.
  • Still there is a problem of multiple objects falling in the same grid. Multiple anchor boxes (of different shapes) are used to resolve the problem, each anchor box of a particular shape being likely to eventually detect  an object of a particular shape.

 

f3.png

The following figure shows the slides taken from the presentation You Only Look Once: Unified, Real-Time Object Detection in the CVPR 2016 summarizing the algorithm:

yolo_cvpr.png

Problem Statement

Let’s assume that we are working on a self-driving car. As a critical component of this project, we’d like to first build a car detection system. To collect data, we’ve mounted a camera to the hood (meaning the front) of the car, which takes pictures of the road ahead every few seconds while we drive around.

The above pictures are taken from a car-mounted camera while driving around Silicon Valley.  We would like to especially thank drive.ai for providing this dataset! Drive.ai is a company building the brains of self-driving vehicles.

driveai.png

We’ve gathered all these images into a folder and have labelled them by drawing bounding boxes around every car we found. Here’s an example of what our bounding boxes look like.

Definition of a box
box_label.png

 

If we have 80 classes that we want YOLO to recognize, we can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. Here we will use both representations, depending on which is more convenient for a particular step.

In this exercise, we shall learn how YOLO works, then apply it to car detection. Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for our use.  The instructions for how to do it can be obtained from here and here.

 

YOLO

YOLO (“you only look once“) is a popular algorithm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

Model details

First things to know:

  • The input is a batch of images of shape (m, 608, 608, 3).
  • The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If we expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So we can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Let’s look in greater detail at what this encoding represents.

Encoding architecture for YOLO

architecture.png

If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

Flattening the last two last dimensions

flatten.png

 

Now, for each box (of each cell) we will compute the following element-wise product and extract a probability that the box contains a certain class.

Find the class detected by each box

probability_extraction.png

Here’s one way to visualize what YOLO is predicting on an image:

  • For each of the 19×19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
  • Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:

proba_map.png

Each of the 19×19 grid cells colored according to which class has the largest predicted probability in that cell.

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:

anchor_map.png

Each cell gives us 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking once at the image (one forward pass through the network)! Different colors denote different classes.

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, we’ll use non-max suppression. Specifically, we’ll carry out these steps:

  • Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class).
  • Select only one box when several boxes overlap with each other and detect the same object.

 

Filtering with a threshold on class scores

We are going to apply a first filter by thresholding. We would like to get rid of any box for which the class “score” is less than a chosen threshold.

The model gives us a total of 19x19x5x85 numbers, with each box described by 85 numbers. It’ll be convenient to rearrange the (19,19,5,85) (or (19,19,425)) dimensional tensor into the following variables:

  • box_confidence: tensor of shape (19×19,5,1) containing pc (confidence probability that there’s some object) for each of the 5 boxes predicted in each of the 19×19 cells.
  • boxes: tensor of shape (19×19,5,4) containing (bx,by,bh,bw) for each of the 5 boxes per cell.
  • box_class_probs: tensor of shape (19×19,5,80) containing the detection probabilities (c1,c2,…c80) for each of the 80 classes for each of the 5 boxes per cell.

Exercise: Implement yolo_filter_boxes().

  • Compute box scores by doing the element-wise product as described in the above figure.
  • For each box, find:
    • the index of the class with the maximum box score.
    • the corresponding box score.
  • Create a mask by using a threshold.  The mask should be True for the boxes you want to keep.
  • Use TensorFlow to apply the mask to box_class_scores, boxes and box_classes to filter out the boxes we don’t want.
    We should be left with just the subset of boxes we want to keep.

Let’s first load the packages and dependencies that are going to be useful.

import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.io
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D
from keras.models import load_model, Model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body

 


def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
 """Filters YOLO boxes by thresholding on object and class confidence.

 Arguments:
 box_confidence -- tensor of shape (19, 19, 5, 1)
 boxes -- tensor of shape (19, 19, 5, 4)
 box_class_probs -- tensor of shape (19, 19, 5, 80)
 threshold -- real value, if [ highest class probability score = threshold)

 # Step 4: Apply the mask to scores, boxes and classes

return scores, boxes, classes

 

Non-max suppression

Even after filtering by thresholding over the classes scores, we still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).

non-max-suppression.png

n this example, the model has predicted 3 cars, but it’s actually 3 predictions of the same car. Running non-max suppression (NMS) will select only the most accurate (highest probability) one of the 3 boxes.

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

Definition of “Intersection over Union”

iou.png

 

Exercise: Implement iou(). Some hints:

  • In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
  • To calculate the area of a rectangle we need to multiply its height (y2 – y1) by its width (x2 – x1)
  • We’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
    xi1 = maximum of the x1 coordinates of the two boxes
    yi1 = maximum of the y1 coordinates of the two boxes
    xi2 = minimum of the x2 coordinates of the two boxes
    yi2 = minimum of the y2 coordinates of the two boxes

In this code, we use the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) the lower-right corner.


def iou(box1, box2):
 """Implement the intersection over union (IoU) between box1 and box2

 Arguments:
 box1 -- first box, list object with coordinates (x1, y1, x2, y2)
 box2 -- second box, list object with coordinates (x1, y1, x2, y2)
 """

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.

# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)

# compute the IoU

return iou

 

We are now ready to implement non-max suppression. The key steps are:

  • Select the box that has the highest score.
  • Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
  • Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

Exercise: Implement yolo_non_max_suppression() using TensorFlow. TensorFlow has two built-in functions that are used to implement non-max suppression (so we don’t actually need to use your iou() implementation):

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
 """
 Applies Non-max suppression (NMS) to set of boxes

 Arguments:
 scores -- tensor of shape (None,), output of yolo_filter_boxes()
 boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
 classes -- tensor of shape (None,), output of yolo_filter_boxes()
 max_boxes -- integer, maximum number of predicted boxes you'd like
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 Returns:
 scores -- tensor of shape (, None), predicted score for each box
 boxes -- tensor of shape (4, None), predicted box coordinates
 classes -- tensor of shape (, None), predicted class for each box

 Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
 function will transpose the shapes of scores, boxes, classes. This is made for convenience.
 """

 max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
 K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

 # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep

 # Use K.gather() to select only nms_indices from scores, boxes and classes

 return scores, boxes, classes

 

Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions we’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail we have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which are provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh)

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape)

YOLO’s network was trained to run on 608×608 images. If we are testing this data on a different size image – for example, the car detection dataset had 720×1280 images – his step rescales the boxes so that they can be plotted on top of the original 720×1280 image.


def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
 """
 Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.

 Arguments:
 yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
 box_confidence: tensor of shape (None, 19, 19, 5, 1)
 box_xy: tensor of shape (None, 19, 19, 5, 2)
 box_wh: tensor of shape (None, 19, 19, 5, 2)
 box_class_probs: tensor of shape (None, 19, 19, 5, 80)
 image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
 max_boxes -- integer, maximum number of predicted boxes you'd like
 score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 Returns:
 scores -- tensor of shape (None, ), predicted score for each box
 boxes -- tensor of shape (None, 4), predicted box coordinates
 classes -- tensor of shape (None,), predicted class for each box
 """

 # Retrieve outputs of the YOLO model

 # Convert boxes to be ready for filtering functions 

 # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold

 # Scale boxes back to original image shape.

 # Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold 

 return scores, boxes, classes

 

Summary for YOLO:

  • Input image (608, 608, 3)
  • The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
  • After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
    • Each cell in a 19×19 grid over the input image gives 425 numbers.
    • 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
    • 85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we’d like to detect.
  • We then select only few boxes based on:
    • Score-thresholding: throw away boxes that have detected a class with a score less than the threshold.
    • Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes.
  • This gives us YOLO’s final output.

 

Test YOLO pretrained model on images

In this part, we are going to use a pre-trained model and test it on the car detection dataset. As usual, we start by creating a session to start your graph. Run the following cell.

sess = K.get_session()

Defining classes, anchors and image shape.

Recall that we are trying to detect 80 classes, and are using 5 anchor boxes. We have gathered the information about the 80 classes and 5 boxes in two files “coco_classes.txt” and “yolo_anchors.txt”. Let’s load these quantities into the model by running the next cell.

The car detection dataset has 720×1280 images, which we’ve pre-processed into 608×608 images.

class_names = read_classes(“coco_classes.txt”)
anchors = read_anchors(“yolo_anchors.txt”)
image_shape = (720., 1280.)

 

Loading a pretrained model

Training a YOLO model takes a very long time and requires a fairly large dataset of labelled bounding boxes for a large range of target classes. We are going to load an existing pretrained Keras YOLO model stored in “yolo.h5”. (These weights come from the official YOLO website, and were converted using a function written by Allan Zelener.  Technically, these are the parameters from the “YOLOv2” model, but we will more simply refer to it as “YOLO” in this notebook.)

yolo_model = load_model(“yolo.h5”)

This loads the weights of a trained YOLO model. Here’s a summary of the layers our model contains.

yolo_model.summary()

____________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
===========================================================================
input_1 (InputLayer) (None, 608, 608, 3) 0
____________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
____________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 608, 608, 32) 128 conv2d_1[0][0]
____________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
____________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
____________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
____________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 304, 304, 64) 256 conv2d_2[0][0]
____________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
____________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
____________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 152, 152, 128 73728 max_pooling2d_2[0][0]
____________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 152, 152, 128 512 conv2d_3[0][0]
____________________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_3[0][0]
____________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
____________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 152, 152, 64) 256 conv2d_4[0][0]
____________________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
____________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 152, 152, 128 73728 leaky_re_lu_4[0][0]
____________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 152, 152, 128 512 conv2d_5[0][0]
____________________________________________________________________________________________
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_5[0][0]
____________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
____________________________________________________________________________________________
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
____________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 76, 76, 256) 1024 conv2d_6[0][0]
____________________________________________________________________________________________
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
____________________________________________________________________________________________
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
____________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 76, 76, 128) 512 conv2d_7[0][0]
____________________________________________________________________________________________
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
____________________________________________________________________________________________
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
____________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 76, 76, 256) 1024 conv2d_8[0][0]
____________________________________________________________________________________________
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
____________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
____________________________________________________________________________________________
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
____________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 38, 38, 512) 2048 conv2d_9[0][0]
____________________________________________________________________________________________
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
____________________________________________________________________________________________
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
____________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 38, 38, 256) 1024 conv2d_10[0][0]
____________________________________________________________________________________________
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
____________________________________________________________________________________________
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
____________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 38, 38, 512) 2048 conv2d_11[0][0]
____________________________________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
____________________________________________________________________________________________
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
____________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 38, 38, 256) 1024 conv2d_12[0][0]
____________________________________________________________________________________________
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
____________________________________________________________________________________________
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
____________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 38, 38, 512) 2048 conv2d_13[0][0]
____________________________________________________________________________________________
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
____________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
____________________________________________________________________________________________
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
____________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 19, 19, 1024) 4096 conv2d_14[0][0]
____________________________________________________________________________________________
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
____________________________________________________________________________________________
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
____________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 19, 19, 512) 2048 conv2d_15[0][0]
____________________________________________________________________________________________
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
____________________________________________________________________________________________
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
____________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 19, 19, 1024) 4096 conv2d_16[0][0]
____________________________________________________________________________________________
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
____________________________________________________________________________________________
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
____________________________________________________________________________________________
batch_normalization_17 (BatchNo (None, 19, 19, 512) 2048 conv2d_17[0][0]
____________________________________________________________________________________________
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
____________________________________________________________________________________________
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
____________________________________________________________________________________________
batch_normalization_18 (BatchNo (None, 19, 19, 1024) 4096 conv2d_18[0][0]
____________________________________________________________________________________________
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
____________________________________________________________________________________________
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
____________________________________________________________________________________________
batch_normalization_19 (BatchNo (None, 19, 19, 1024) 4096 conv2d_19[0][0]
____________________________________________________________________________________________
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
____________________________________________________________________________________________
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
____________________________________________________________________________________________
batch_normalization_21 (BatchNo (None, 38, 38, 64) 256 conv2d_21[0][0]
____________________________________________________________________________________________
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
____________________________________________________________________________________________
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
____________________________________________________________________________________________
batch_normalization_20 (BatchNo (None, 19, 19, 1024) 4096 conv2d_20[0][0]
____________________________________________________________________________________________
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
____________________________________________________________________________________________
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
____________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
leaky_re_lu_20[0][0]
____________________________________________________________________________________________
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
____________________________________________________________________________________________
batch_normalization_22 (BatchNo (None, 19, 19, 1024) 4096 conv2d_22[0][0]
____________________________________________________________________________________________
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
____________________________________________________________________________________________
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
===========================================================================
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672
____________________________________________________________________________________________

model.png

Reminder: this model converts a pre-processed batch of input images (shape: (m, 608, 608, 3)) into a tensor of shape (m, 19, 19, 5, 85) as explained in the above Figure.

Convert output of the model to usable bounding box tensors

The output of yolo_model is a (m, 19, 19, 5, 85) tensor that needs to pass through non-trivial processing and conversion. The following code does this.

yolo_outputs = yolo_head(yolo_model.output, anchors, len(class_names))

We added yolo_outputs to your graph. This set of 4 tensors is ready to be used as input by our yolo_eval function.

Filtering boxes

yolo_outputs gave us all the predicted boxes of yolo_model in the correct format. We’re now ready to perform filtering and select only the best boxes. Lets now call yolo_eval, which you had previously implemented, to do this.

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

Run the graph on an image

Let the fun begin. We have created a (sess) graph that can be summarized as follows:

  1. yolo_model.input is given to yolo_model. The model is used to compute the output yolo_model.output
  2. yolo_model.output is processed by yolo_head. It gives us yolo_outputs
  3. yolo_outputs goes through a filtering function, yolo_eval. It outputs your predictions: scores, boxes, classes

Exercise: Implement predict() which runs the graph to test YOLO on an image. We shall need to run a TensorFlow session, to have it compute scores, boxes, classes.

The code below also uses the following function:

image, image_data = preprocess_image(“images/” + image_file, model_image_size = (608, 608))

which outputs:

  • image: a python (PIL) representation of your image used for drawing boxes. You won’t need to use it.
  • image_data: a numpy-array representing the image. This will be the input to the CNN.

Important note: when a model uses BatchNorm (as is the case in YOLO), we will need to pass an additional placeholder in the feed_dict {K.learning_phase(): 0}.


def predict(sess, image_file):
"""
Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.

Arguments:
sess -- your tensorflow/Keras session containing the YOLO graph
image_file -- name of an image stored in the "images" folder.

Returns:
out_scores -- tensor of shape (None, ), scores of the predicted boxes
out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
out_classes -- tensor of shape (None, ), class index of the predicted boxes

Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes.
"""

 # Preprocess your image

 # Run the session with the correct tensors and choose the correct placeholders in the
 # feed_dict. We'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})

 # Print predictions info
print('Found {} boxes for {}'.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image
image.save(os.path.join("out", image_file), quality=90)
# Display the results
output_image = scipy.misc.imread(os.path.join("out", image_file))
imshow(output_image)

 return out_scores, out_boxes, out_classes

Let’s Run the following cell on the following “test.jpg” image to verify that our function is correct.

Inputtest.jpg

out_scores, out_boxes, out_classes = predict(sess, “test.jpg”)

The following figure shows the output after car detection. Each of the bounding boxes have the name of the object detected on the top left along with the confidence value.

Output (with detected cars with YOLO)

Found 7 boxes for test.jpg
car 0.60 (925, 285) (1045, 374)
car 0.66 (706, 279) (786, 350)
bus 0.67 (5, 266) (220, 407)
car 0.70 (947, 324) (1280, 705)
car 0.74 (159, 303) (346, 440)
car 0.80 (761, 282) (942, 412)
car 0.89 (367, 300) (745, 648)

test


The following animation shows the output Images with detected objects (cars) using YOLO for a set of input images.

cars.gif



What we should remember:

  • YOLO is a state-of-the-art object detection model that is fast and accurate.
  • It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
  • The encoding can be seen as a grid where each of the 19×19 cells contains information about 5 boxes.
  • You filter through all the boxes using non-max suppression. Specifically:
    Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes.
  • Intersection over Union (IoU) thresholding to eliminate overlapping boxes.
  • Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise.

 


References: The ideas presented in this notebook came primarily from the two YOLO papers. The implementation here also took significant inspiration and used many components from Allan Zelener’s github repository. The pretrained weights used in this exercise came from the official YOLO website.

  1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi – You Only Look Once: Unified, Real-Time Object Detection (2015)
  2. Joseph Redmon, Ali Farhadi – YOLO9000: Better, Faster, Stronger (2016)
  3. Allan Zelener – YAD2K: Yet Another Darknet 2 Keras
  4. The official YOLO website .

Car detection dataset: Creative Commons License.

The Drive.ai Sample Dataset (provided by drive.ai) is licensed under a Creative Commons Attribution 4.0 International License.

Advertisements

Hand-Gesture Classification using Deep Convolution and Residual Neural Network (ResNet-50) with Tensorflow / Keras in Python

In this article, first an application of convolution net to classify a set of hand-sign images is going to be discussed.  Later the accuracy of this classifier will be improved using a deep res-net. These problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deep-learning specialization) by the Stanford Prof. Andrew Ng. (deeplearning.ai). The problem descriptions are taken straightaway from the course itself.

1. Hand-gesture Classification with Convolution Neural Network

In this assignment, the following tasks are going to be accomplished:

  • Implement a fully functioning ConvNet using TensorFlow.
  • Build and train a ConvNet in TensorFlow for a classification problem

This assignment is going to be done using tensorflow.

First the necessary packages are loaded:

import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import scipy
from PIL import Image
from scipy import ndimage
import tensorflow as tf
from tensorflow.python.framework import ops
from cnn_utils import *

%matplotlib inline
np.random.seed(1)

Next the “SIGNS” dataset is loaded that we are going to use. The SIGNS dataset is a collection of 6 signs representing numbers from 0 to 5, as shown in the next figure. The output classes are shown with one hot encoding.

# Loading the data (signs)
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
signs_data_kiank.pngThe next figures show a few randomly sampled images for each class label from the training dataset. There are 180 images for each class and a total of 108 images in the training dataset.

012345
number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)

The following steps are to be executed to train a conv-net model with tensorflow using the trainign dataset and then classify the images from the test dataset using the model.

Create placeholders

TensorFlow requires that we create placeholders for the input data that will be fed into the model when running the session.

Let’s implement the function below to create placeholders for the input image X and the output Y. We should not define the number of training examples for the moment. To do so, we could use “None” as the batch size, it will give us the flexibility to choose it later. Hence X should be of dimension [None, n_H0, n_W0, n_C0] and Y should be of dimension  [None, n_y].

def create_placeholders(n_H0, n_W0, n_C0, n_y):
    """
    Creates the placeholders for the tensorflow session.
    
    Arguments:
    n_H0 -- scalar, height of an input image
    n_W0 -- scalar, width of an input image
    n_C0 -- scalar, number of channels of the input
    n_y -- scalar, number of classes
        
    Returns:
    X -- placeholder for the data input, of shape [None, n_H0, n_W0, n_C0] and dtype "float"
    Y -- placeholder for the input labels, of shape [None, n_y] and dtype "float"
    """

    X = tf.placeholder(tf.float32, shape=(None, n_H0, n_W0, n_C0))
    Y = tf.placeholder(tf.float32, shape=(None, n_y))
    
    return X, Y

 

Initialize parameters

Let’s initialize weights/filters W1 and Wusing xavier_initializer.
We don’t need to worry about bias variables as you will soon see that TensorFlow functions take care of the bias. Note also that you will only initialize the weights/filters for the conv2d functions. TensorFlow initializes the layers for the fully connected part automatically.

def initialize_parameters():
    """
    Initializes weight parameters to build a neural network with tensorflow. The shapes are:
                        W1 : [4, 4, 3, 8]
                        W2 : [2, 2, 8, 16]
    Returns:
    parameters -- a dictionary of tensors containing W1, W2
    """
    
    tf.set_random_seed(1)                              # so that our "random" numbers match ours
        
    W1 = tf.get_variable("W1", (4, 4, 3, 8), initializer = tf.contrib.layers.xavier_initializer(seed = 0)) 
    W2 = tf.get_variable("W2", (2, 2, 8, 16), initializer = tf.contrib.layers.xavier_initializer(seed = 0)) 

    parameters = {"W1": W1,
                  "W2": W2}
    
    return parameters

Forward propagation

Next we need to implement the forward_propagation function below to build the following model:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED.

We need to use the following built-in tensorflow functions:

  • tf.nn.conv2d(X,W1, strides = [1,s,s,1], padding = ‘SAME’): given an input XX and a group of filters W1W1, this function convolves W1W1‘s filters on X. The third input ([1,f,f,1]) represents the strides for each dimension of the input (m, n_H_prev, n_W_prev, n_C_prev). You can read the full documentation here
  • tf.nn.max_pool(A, ksize = [1,f,f,1], strides = [1,s,s,1], padding = ‘SAME’): given an input A, this function uses a window of size (f, f) and strides of size (s, s) to carry out max pooling over each window. You can read the full documentation here
  • tf.nn.relu(Z1): computes the elementwise ReLU of Z1 (which can be any shape). You can read the full documentation here.
  • tf.contrib.layers.flatten(P): given an input P, this function flattens each example into a 1D vector it while maintaining the batch-size. It returns a flattened tensor with shape [batch_size, k]. You can read the full documentation here.
  • tf.contrib.layers.fully_connected(F, num_outputs): given a the flattened input F, it returns the output computed using a fully connected layer. You can read the full documentation here.

In detail, we will use the following parameters for all the steps:

 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use an 8 by 8 filter size and an 8 by 8 stride, padding is "SAME"
 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use a 4 by 4 filter size and a 4 by 4 stride, padding is "SAME"
 - Flatten the previous output.
 - FULLYCONNECTED (FC) layer: Apply a fully connected layer without an non-linear activation function. Do not call the softmax here. This will result in 6 neurons in the output layer, which then get passed later to a softmax. In TensorFlow, the softmax and cost function are lumped together into a single function, which you'll call in a different function when computing the cost. 
def forward_propagation(X, parameters):
    """
    Implements the forward propagation for the model:
    CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED
    
    Arguments:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "W2"
                  the shapes are given in initialize_parameters

    Returns:
    Z3 -- the output of the last LINEAR unit
    """    

Compute cost

Next step is to implement the compute cost function using the following tensorflow functions:

  • tf.nn.softmax_cross_entropy_with_logits(logits = Z3, labels = Y): computes the softmax entropy loss. This function both computes the softmax activation function as well as the resulting loss. You can check the full documentation here.
  • tf.reduce_mean: computes the mean of elements across dimensions of a tensor. Use this to sum the losses over all the examples to get the overall cost. You can check the full documentation here.
def compute_cost(Z3, Y):
    """
    Computes the cost
    
    Arguments:
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3
    
    Returns:
    cost - Tensor of the cost function
    """

Model

Finally we need to merge the helper functions we implemented above to build a model and train it on the SIGNS dataset.

The model should:

  • create placeholders
  • initialize parameters
  • forward propagate
  • compute the cost
  • create an optimizer

Finally we need to create a session and run a for loop for num_epochs, get the mini-batches, and then for each mini-batch you will optimize the function.

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.009,
          num_epochs = 100, minibatch_size = 64, print_cost = True):
    """
    Implements a three-layer ConvNet in Tensorflow:
    CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED
    
    Arguments:
    X_train -- training set, of shape (None, 64, 64, 3)
    Y_train -- test set, of shape (None, n_y = 6)
    X_test -- training set, of shape (None, 64, 64, 3)
    Y_test -- test set, of shape (None, n_y = 6)
    learning_rate -- learning rate of the optimization
    num_epochs -- number of epochs of the optimization loop
    minibatch_size -- size of a minibatch
    print_cost -- True to print the cost every 100 epochs
    
    Returns:
    train_accuracy -- real number, accuracy on the train set (X_train)
    test_accuracy -- real number, testing accuracy on the test set (X_test)
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

Then let’s train the model for 100 epochs.

_, _, parameters = model(X_train, Y_train, X_test, Y_test)
 with the following output:
Cost after epoch 0: 1.918487
Cost after epoch 5: 1.875008
Cost after epoch 10: 1.813409
Cost after epoch 15: 1.667654
Cost after epoch 20: 1.444399
Cost after epoch 25: 1.203926
Cost after epoch 30: 1.028009
Cost after epoch 35: 0.887578
Cost after epoch 40: 0.791803
Cost after epoch 45: 0.712319
Cost after epoch 50: 0.655244
Cost after epoch 55: 0.597494
Cost after epoch 60: 0.556236
Cost after epoch 65: 0.525260
Cost after epoch 70: 0.484548
Cost after epoch 75: 0.477365
Cost after epoch 80: 0.451908
Cost after epoch 85: 0.415393
Cost after epoch 90: 0.386501
Cost after epoch 95: 0.373167

gr

Tensor(“Mean_1:0”, shape=(), dtype=float32)
Train Accuracy: 0.894444
Test Accuracy: 0.841667

 

2. Improving the Accuracy of the Hand-Gesture Classifier with Residual Networks

Now we shall learn how to build very deep convolutional networks, using Residual Networks (ResNets). In theory, very deep networks can represent very complex functions; but in practice, they are hard to train. Residual Networks, introduced by He et al., allow to train much deeper networks than were previously practically feasible.

In this assignment, the following tasks we are going to accomplish:

  • Implement the basic building blocks of ResNets.
  • Put together these building blocks to implement and train a state-of-the-art neural network for image classification.

This assignment will be done in Keras.

Let’s first load the following required packages.

import numpy as np
from keras import layers
from keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from keras.models import Model, load_model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot_ng as pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from resnets_utils import *
from keras.initializers import glorot_uniform
import scipy.misc
from matplotlib.pyplot import imshow
%matplotlib inline
import keras.backend as K
K.set_image_data_format('channels_last')
K.set_learning_phase(1)

The problem of very deep neural networks

In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow.

During training, we might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:

vanishing_grad_kiank.png

We are now going to solve this problem by building a Residual Network!

Building a Residual Network

In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly back-propagated to earlier layers:

skip_connection_kiank.png

The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, we can form a very deep network.

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different. We are going to implement both of them.

1 – The identity block

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l]) has the same dimension as the output activation (say a[l+2]). To flesh out the different steps of what happens in a ResNet’s identity block, here is an alternative diagram showing the individual steps:

idblock2_kiank.pngThe upper path is the “shortcut path.” The lower path is the “main path.” In this diagram, we have also made explicit the CONV2D and ReLU steps in each layer. To speed up training we have also added a BatchNorm step. 

In this exercise, we’ll actually implement a slightly more powerful version of this identity block, in which the skip connection “skips over” 3 hidden layers rather than 2 layers. It looks like this:

idblock3_kiank.png

Here’re the individual steps.

First component of main path:

  • The first CONV2D has F1 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2a'. Use 0 as the seed for the random initialization.
  • The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

  • The second CONV2D has F2 filters of shape (f,fand a stride of (1,1). Its padding is “same” and its name should be conv_name_base + '2b'. Use 0 as the seed for the random initialization.
  • The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

  • The third CONV2D has F3 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2c'. Use 0 as the seed for the random initialization.
  • The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Final step:

  • The shortcut and the input are added together.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Now let’s implement the ResNet identity block.

  • To implement the Conv2D step: See reference
  • To implement BatchNorm: See reference (axis: Integer, the axis that should be normalized (typically the channels axis))
  • For the activation, use: Activation('relu')(X)
  • To add the value passed forward by the shortcut: See reference
defidentity_block(X, f, filters, stage, block):
    """
    Implementation of the identity block as defined in Figure 3
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    
    Returns:
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    """
    ### The first Component ###
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value. You'll need this later to add back to the main path. 
    X_shortcut = X
    
    # First component of main path
    X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)

    ### The second Component ###

    # ...

    ### The third Component ###

    # ...

    return X

2 – The convolutional block

Next, the ResNet “convolutional block” is the other type of block. We can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:

convblock_kiank.png

The CONV2D layer in the shortcut path is used to resize the input x to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. For example, to reduce the activation dimensions’s height and width by a factor of 2, we can use a 1×1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.

The details of the convolutional block are as follows.

First component of main path:

  • The first CONV2D has F1 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '2a'.
  • The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

  • The second CONV2D has F2 filters of (f,f) and a stride of (1,1). Its padding is “same” and it’s name should be conv_name_base + '2b'.
  • The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

  • The third CONV2D has F3 filters of (1,1) and a stride of (1,1). Its padding is “valid” and it’s name should be conv_name_base + '2c'.
  • The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Shortcut path:

  • The CONV2D has F3 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '1'.
  • The BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '1'.

Final step:

  • The shortcut and the main path values are added together.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Let’s now implement the convolutional block.

defconvolutional_block(X, f, filters, stage, block, s = 2):
    """
    Implementation of the convolutional block as defined in Figure 4
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    s -- Integer, specifying the stride to be used
    
    Returns:
    X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
    """
    
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value
    X_shortcut = X


    ##### MAIN PATH #####
    # First component of main path 
    X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)
    
    # Second component of main path
    # ...
    # Third component of main path
    # ...
    ##### SHORTCUT PATH ####
    # ...
    # Final step: Add shortcut value to main path, and pass it through a RELU activation
    # ...
    return X

 

3 – Building our first ResNet model (50 layers)

We now have the necessary blocks to build a very deep ResNet. The following figure describes in detail the architecture of this neural network. “ID BLOCK” in the diagram stands for “Identity block,” and “ID BLOCK x3” means we should stack 3 identity blocks together.

resnet_kiank.png

The details of this ResNet-50 model are:

  • Zero-padding pads the input with a pad of (3,3)
  • Stage 1:
    • The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”.
    • BatchNorm is applied to the channels axis of the input.
    • MaxPooling uses a (3,3) window and a (2,2) stride.
  • Stage 2:
    • The convolutional block uses three set of filters of size [64,64,256], “f” is 3, “s” is 1 and the block is “a”.
    • The 2 identity blocks use three set of filters of size [64,64,256], “f” is 3 and the blocks are “b” and “c”.
  • Stage 3:
    • The convolutional block uses three set of filters of size [128,128,512], “f” is 3, “s” is 2 and the block is “a”.
    • The 3 identity blocks use three set of filters of size [128,128,512], “f” is 3 and the blocks are “b”, “c” and “d”.
  • Stage 4:
    • The convolutional block uses three set of filters of size [256, 256, 1024], “f” is 3, “s” is 2 and the block is “a”.
    • The 5 identity blocks use three set of filters of size [256, 256, 1024], “f” is 3 and the blocks are “b”, “c”, “d”, “e” and “f”.
  • Stage 5:
    • The convolutional block uses three set of filters of size [512, 512, 2048], “f” is 3, “s” is 2 and the block is “a”.
    • The 2 identity blocks use three set of filters of size [512, 512, 2048], “f” is 3 and the blocks are “b” and “c”.
  • The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
  • The flatten doesn’t have any hyperparameters or name.
  • The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

Let’s implement the ResNet with 50 layers described in the figure above.

We’ll need to use this function:

Here’re some other functions we used in the code below:

def ResNet50(input_shape = (64, 64, 3), classes = 6):
    """
    Implementation of the popular ResNet50 the following architecture:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
    -> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> TOPLAYER

    Arguments:
    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    Returns:
    model -- a Model() instance in Keras
    """
    
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)

    
    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)
    
    # Stage 1
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # Stage 2
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], stage = 2, block='a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='b')
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='c')

    # ...

    # ...

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='softmax', name='fc' + str(classes), kernel_initializer = glorot_uniform(seed=0))(X)
        
    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

Next, let’s build the model’s graph. We have 6 output classes for the hand-signs dataset.

model = ResNet50(input_shape = (64, 64, 3), classes = 6)

We need to configure the learning process by compiling the model.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

The model is now ready to be trained. The only thing we need is to pass the same hand-signs dataset that we used earlier. We need to load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Convert training and test labels to one hot matrices
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T

print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))
number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)

Now let’s train our  resnet model on 20 epochs with a batch size of 32.

model.fit(X_train, Y_train, epochs = 20, batch_size = 32)
Epoch 1/20
1080/1080 [==============================] - 173s - loss: 2.0610 - acc: 0.3435   
Epoch 2/20
1080/1080 [==============================] - 149s - loss: 1.8561 - acc: 0.4259   
Epoch 3/20
1080/1080 [==============================] - 147s - loss: 2.0284 - acc: 0.4343   
Epoch 4/20
1080/1080 [==============================] - 151s - loss: 1.7140 - acc: 0.4500   
Epoch 5/20
1080/1080 [==============================] - 134s - loss: 1.4401 - acc: 0.5676   
Epoch 6/20
1080/1080 [==============================] - 128s - loss: 1.1950 - acc: 0.6481   
Epoch 7/20
1080/1080 [==============================] - 129s - loss: 0.9886 - acc: 0.7426   
Epoch 8/20
1080/1080 [==============================] - 133s - loss: 1.2155 - acc: 0.6843   
Epoch 9/20
1080/1080 [==============================] - 131s - loss: 0.8536 - acc: 0.8185   
Epoch 10/20
1080/1080 [==============================] - 132s - loss: 0.9502 - acc: 0.7565   
Epoch 11/20
1080/1080 [==============================] - 129s - loss: 0.8180 - acc: 0.8111   
Epoch 12/20
1080/1080 [==============================] - 130s - loss: 0.7060 - acc: 0.8343   
Epoch 13/20
1080/1080 [==============================] - 130s - loss: 0.8687 - acc: 0.8148   
Epoch 14/20
1080/1080 [==============================] - 130s - loss: 0.8282 - acc: 0.8509   
Epoch 15/20
1080/1080 [==============================] - 130s - loss: 0.9303 - acc: 0.7972   
Epoch 16/20
1080/1080 [==============================] - 146s - loss: 1.1211 - acc: 0.7870   
Epoch 17/20
1080/1080 [==============================] - 143s - loss: 0.9337 - acc: 0.7824   
Epoch 18/20
1080/1080 [==============================] - 150s - loss: 0.3976 - acc: 0.8870   
Epoch 19/20
1080/1080 [==============================] - 143s - loss: 0.2532 - acc: 0.9407   
Epoch 20/20
1080/1080 [==============================] - 133s - loss: 0.2528 - acc: 0.9556

Let’s see how this model performs on the test set.

preds = model.evaluate(X_test, Y_test)
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))
Loss = 0.36906948487
Test Accuracy = 0.891666662693

We can also print a summary of your model by running the following code.

model.summary()
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_1 (InputLayer)             (None, 64, 64, 3)     0                                            
____________________________________________________________________________________________________
zero_padding2d_1 (ZeroPadding2D) (None, 70, 70, 3)     0                                            
____________________________________________________________________________________________________
conv1 (Conv2D)                   (None, 32, 32, 64)    9472                                         
____________________________________________________________________________________________________
bn_conv1 (BatchNormalization)    (None, 32, 32, 64)    256                                          
____________________________________________________________________________________________________
activation_4 (Activation)        (None, 32, 32, 64)    0                                            
____________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)   (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2a_branch2a (Conv2D)          (None, 15, 15, 64)    4160                                         
____________________________________________________________________________________________________
bn2a_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_5 (Activation)        (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2a_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
____________________________________________________________________________________________________
bn2a_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_6 (Activation)        (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2a_branch1 (Conv2D)           (None, 15, 15, 256)   16640                                        
____________________________________________________________________________________________________
res2a_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
____________________________________________________________________________________________________
bn2a_branch1 (BatchNormalization (None, 15, 15, 256)   1024                                         
____________________________________________________________________________________________________
bn2a_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
____________________________________________________________________________________________________
add_2 (Add)                      (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
activation_7 (Activation)        (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
res2b_branch2a (Conv2D)          (None, 15, 15, 64)    16448                                        
____________________________________________________________________________________________________
bn2b_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_8 (Activation)        (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2b_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
____________________________________________________________________________________________________
bn2b_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_9 (Activation)        (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2b_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
____________________________________________________________________________________________________
bn2b_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
____________________________________________________________________________________________________
add_3 (Add)                      (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
activation_10 (Activation)       (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
res2c_branch2a (Conv2D)          (None, 15, 15, 64)    16448                                        
____________________________________________________________________________________________________
bn2c_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_11 (Activation)       (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2c_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
____________________________________________________________________________________________________
bn2c_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_12 (Activation)       (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2c_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
____________________________________________________________________________________________________
bn2c_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
____________________________________________________________________________________________________
add_4 (Add)                      (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
activation_13 (Activation)       (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
res3a_branch2a (Conv2D)          (None, 8, 8, 128)     32896                                        
____________________________________________________________________________________________________
bn3a_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_14 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3a_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
____________________________________________________________________________________________________
bn3a_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_15 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3a_branch1 (Conv2D)           (None, 8, 8, 512)     131584                                       
____________________________________________________________________________________________________
res3a_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
____________________________________________________________________________________________________
bn3a_branch1 (BatchNormalization (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
bn3a_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
add_5 (Add)                      (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
activation_16 (Activation)       (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
res3b_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
____________________________________________________________________________________________________
bn3b_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_17 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3b_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
____________________________________________________________________________________________________
bn3b_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_18 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3b_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
____________________________________________________________________________________________________
bn3b_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
add_6 (Add)                      (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
activation_19 (Activation)       (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
res3c_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
____________________________________________________________________________________________________
bn3c_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_20 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3c_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
____________________________________________________________________________________________________
bn3c_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_21 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3c_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
____________________________________________________________________________________________________
bn3c_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
add_7 (Add)                      (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
activation_22 (Activation)       (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
res3d_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
____________________________________________________________________________________________________
bn3d_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_23 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3d_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
____________________________________________________________________________________________________
bn3d_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_24 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3d_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
____________________________________________________________________________________________________
bn3d_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
add_8 (Add)                      (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
activation_25 (Activation)       (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
res4a_branch2a (Conv2D)          (None, 4, 4, 256)     131328                                       
____________________________________________________________________________________________________
bn4a_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_26 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4a_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4a_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_27 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4a_branch1 (Conv2D)           (None, 4, 4, 1024)    525312                                       
____________________________________________________________________________________________________
res4a_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4a_branch1 (BatchNormalization (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
bn4a_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_9 (Add)                      (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_28 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4b_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4b_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_29 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4b_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4b_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_30 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4b_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4b_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_10 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_31 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4c_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4c_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_32 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4c_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4c_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_33 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4c_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4c_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_11 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_34 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4d_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4d_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_35 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4d_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4d_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_36 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4d_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4d_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_12 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_37 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4e_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4e_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_38 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4e_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4e_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_39 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4e_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4e_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_13 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_40 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4f_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4f_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_41 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4f_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4f_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_42 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4f_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4f_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_14 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_43 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res5a_branch2a (Conv2D)          (None, 2, 2, 512)     524800                                       
____________________________________________________________________________________________________
bn5a_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_44 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5a_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
____________________________________________________________________________________________________
bn5a_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_45 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5a_branch1 (Conv2D)           (None, 2, 2, 2048)    2099200                                      
____________________________________________________________________________________________________
res5a_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
____________________________________________________________________________________________________
bn5a_branch1 (BatchNormalization (None, 2, 2, 2048)    8192                                         
____________________________________________________________________________________________________
bn5a_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
____________________________________________________________________________________________________
add_15 (Add)                     (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
activation_46 (Activation)       (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
res5b_branch2a (Conv2D)          (None, 2, 2, 512)     1049088                                      
____________________________________________________________________________________________________
bn5b_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_47 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5b_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
____________________________________________________________________________________________________
bn5b_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_48 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5b_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
____________________________________________________________________________________________________
bn5b_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
____________________________________________________________________________________________________
add_16 (Add)                     (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
activation_49 (Activation)       (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
res5c_branch2a (Conv2D)          (None, 2, 2, 512)     1049088                                      
____________________________________________________________________________________________________
bn5c_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_50 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5c_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
____________________________________________________________________________________________________
bn5c_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_51 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5c_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
____________________________________________________________________________________________________
bn5c_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
____________________________________________________________________________________________________
add_17 (Add)                     (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
activation_52 (Activation)       (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
avg_pool (AveragePooling2D)      (None, 1, 1, 2048)    0                                            
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 2048)          0                                            
____________________________________________________________________________________________________
fc6 (Dense)                      (None, 6)             12294                                        
====================================================================================================
Total params: 23,600,006.0
Trainable params: 23,546,886.0
Non-trainable params: 53,120.0
_____________________________

Finally, the next figure shows the visualization of our ResNet50.

archres.png

Key points

  • Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.
  • The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
  • There are two main type of blocks: The identity block and the convolutional block.
  • Very deep Residual Networks are built by stacking these blocks together.

References

This article presents the ResNet algorithm due to He et al. (2015). The implementation here also took significant inspiration and follows the structure given in the github repository of Francois Chollet:

Deep Learning & Art: Neural Style Transfer – An Implementation with Tensorflow (using Transfer Learning with a Pre-trained VGG-19 Network) in Python

 

This problem appeared as an assignment in the online coursera course Convolution Neural Networks by Prof Andrew Ng, (deeplearing.ai).  The description of the problem is taken straightway from the assignment.

Neural Style Transfer algorithm was created by Gatys et al. (2015) , the paper can be found here .

In this assignment, we shall:

  • Implement the neural style transfer algorithm
  • Generate novel artistic images using our algorithm

Most of the algorithms we’ve studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, we  shall optimize a cost function to get pixel values!

Problem Statement

Neural Style Transfer (NST) is one of the most fun techniques in deep learning. As seen below, it merges two images, namely,

  1. a “content” image (C) and
  2. a “style” image (S),

to create a “generated” image (G). The generated image G combines the “content” of the image C with the “style” of image S.

In this example, we are going to generate an image of the Louvre museum in Paris (content image C), mixed with a painting by Claude Monet, a leader of the impressionist movement (style image S).

louvre_generated.png

Let’s see how we can do this.

Transfer Learning

Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that. The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

Following the original NST paper, we shall use the VGG network. Specifically, we’ll use VGG-19, a 19-layer version of the VGG network. This model has already been trained on the very large ImageNet database, and thus has learned to recognize a variety of low level features (at the earlier layers) and high level features (at the deeper layers). The following figure (taken from the google image search results) shows how a VGG-19 convolution neural net looks like, without the last fully-connected (FC) layers.

vg-19

We run the following code to load parameters from the pre-trained VGG-19 model serialized in a matlab file. This takes a few seconds.

model = load_vgg_model(“imagenet-vgg-verydeep-19.mat”)
import pprint
pprint.pprint(model)

{‘avgpool1’: <tf.Tensor ‘AvgPool_5:0’ shape=(1, 150, 200, 64) dtype=float32>,
‘avgpool2’: <tf.Tensor ‘AvgPool_6:0’ shape=(1, 75, 100, 128) dtype=float32>,
‘avgpool3’: <tf.Tensor ‘AvgPool_7:0’ shape=(1, 38, 50, 256) dtype=float32>,
‘avgpool4’: <tf.Tensor ‘AvgPool_8:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘avgpool5’: <tf.Tensor ‘AvgPool_9:0’ shape=(1, 10, 13, 512) dtype=float32>,
‘conv1_1’: <tf.Tensor ‘Relu_16:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv1_2’: <tf.Tensor ‘Relu_17:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv2_1’: <tf.Tensor ‘Relu_18:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv2_2’: <tf.Tensor ‘Relu_19:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv3_1’: <tf.Tensor ‘Relu_20:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_2’: <tf.Tensor ‘Relu_21:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_3’: <tf.Tensor ‘Relu_22:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_4’: <tf.Tensor ‘Relu_23:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv4_1’: <tf.Tensor ‘Relu_24:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_2’: <tf.Tensor ‘Relu_25:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_3’: <tf.Tensor ‘Relu_26:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_4’: <tf.Tensor ‘Relu_27:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv5_1’: <tf.Tensor ‘Relu_28:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_2’: <tf.Tensor ‘Relu_29:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_3’: <tf.Tensor ‘Relu_30:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_4’: <tf.Tensor ‘Relu_31:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘input’: <tensorflow.python.ops.variables.Variable object at 0x7f7a5bf8f7f0>}

The next figure shows the content image (C) – the Louvre museum’s pyramid surrounded by old Paris buildings, against a sunny sky with a few clouds.

louvre_small.jpg

For the above content image, the activation outputs from the convolution layers are visualized in the next few figures.

1

3

2

4

5

6

7.png

 

How to ensure that the generated image G matches the content of the image C?

As we know, the earlier (shallower) layers of a ConvNet tend to detect lower-level features such as edges and simple textures, and the later (deeper) layers tend to detect higher-level features such as more complex textures as well as object classes.

We would like the “generated” image G to have similar content as the input image C. Suppose we have chosen some layer’s activations to represent the content of an image. In practice, we shall get the most visually pleasing results if we choose a layer in the middle of the network – neither too shallow nor too deep.

8.png

First we need to compute the “content cost” using TensorFlow.

  • The content cost takes a hidden layer activation of the neural network, and measures how different a(C) and a(G) are.
  • When we minimize the content cost later, this will help make sure G
    has similar content as C.

def compute_content_cost(a_C, a_G):
“””
Computes the content cost

Arguments:
a_C — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G

Returns:
J_content — scalar that we need to compute using equation 1 above.
“””

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape a_C and a_G
a_C_unrolled = tf.reshape(tf.transpose(a_C), (m, n_H * n_W, n_C))
a_G_unrolled = tf.reshape(tf.transpose(a_G), (m, n_H * n_W, n_C))

# compute the cost with tensorflow
J_content = tf.reduce_sum((a_C_unrolled – a_G_unrolled)**2 / (4.* n_H * n_W *  \
n_C))

return J_content

 

Computing the style cost

For our running example, we will use the following style image (S). This painting was painted in the style of impressionism, by  Claude Monet .

claude-monet

9.png

def gram_matrix(A):
“””
Argument:
A — matrix of shape (n_C, n_H*n_W)

Returns:
GA — Gram matrix of A, of shape (n_C, n_C)
“””

GA = tf.matmul(A, tf.transpose(A))
return GA

10.png

def compute_layer_style_cost(a_S, a_G):
“””
Arguments:
a_S — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G

Returns:
J_style_layer — tensor representing a scalar value, style cost defined above by equation (2)
“””

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape the images to have them of shape (n_C, n_H*n_W)
a_S = tf.reshape(tf.transpose(a_S), (n_C, n_H * n_W))
a_G = tf.reshape(tf.transpose(a_G), (n_C, n_H * n_W))

# Computing gram_matrices for both images S and G (≈2 lines)
GS = gram_matrix(a_S)
GG = gram_matrix(a_G)

# Computing the loss
J_style_layer = tf.reduce_sum((GS – GG)**2 / (4.* (n_H * n_W * n_C)**2))

return J_style_layer

11.png

  • The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
  • Minimizing the style cost will cause the image G to follow the style of the image S.

 

Defining the total cost to optimize

Finally, let’s create and implement a cost function that minimizes both the style and the content cost. The formula is:

 

12.png

def total_cost(J_content, J_style, alpha = 10, beta = 40):
“””
Computes the total cost function

Arguments:
J_content — content cost coded above
J_style — style cost coded above
alpha — hyperparameter weighting the importance of the content cost
beta — hyperparameter weighting the importance of the style cost

Returns:
J — total cost as defined by the formula above.
“””

J = alpha * J_content + beta * J_style
return J

  • The total cost is a linear combination of the content cost J_content(C,G) and the style cost J_style(S,G).
  • α and β are hyperparameters that control the relative weighting between content and style.

 

Solving the optimization problem

Finally, let’s put everything together to implement Neural Style Transfer!

Here’s what the program will have to do:

  • Create an Interactive Session
  • Load the content image
  • Load the style image
  • Randomly initialize the image to be generated
  • Load the VGG19 model
  • Build the TensorFlow graph:
    • Run the content image through the VGG19 model and compute the content cost.
    • Run the style image through the VGG19 model and compute the style cost
      Compute the total cost.
    • Define the optimizer and the learning rate.
  • Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.

Let’s first load, reshape, and normalize our “content” image (the Louvre museum picture) and “style” image (Claude Monet’s painting).

Now, we initialize the “generated” image as a noisy image created from the content_image. By initializing the pixels of the generated image to be mostly noise but still slightly correlated with the content image, this will help the content of the “generated” image more rapidly match the content of the “content” image. The following figure shows the noisy image:

13.png

Next, let’s load the pre-trained VGG-19 model.

To get the program to compute the content cost, we will now assign a_C and a_G to be the appropriate hidden layer activations. We will use layer conv4_2 to compute the content cost. We need to do the following:

  • Assign the content image to be the input to the VGG model.
  • Set a_C to be the tensor giving the hidden layer activation for layer “conv4_2”.
  • Set a_G to be the tensor giving the hidden layer activation for the same layer.
  • Compute the content cost using a_C and a_G.

Next, we need to compute the style cost and compute the total cost J by taking a linear combination of the two. Use alpha = 10 and beta = 40.

Then we are going to  set up the Adam optimizer in TensorFlow, using a learning rate of 2.0.

Finally, we need to initialize the variables of the tensorflow graph, assign the input image (initial generated image) as the input of the VGG19 model and runs the model to minimize the total cost J for a large number of iterations.

Results

The following figures / animations show the generated images (G) with different content (C) and style images (S) at different iterations in the optimization process.


Content

louvre_small

Style (Claud Monet’s The Poppy Field near Argenteuil)

claude-monet

Generated

l



Content

cat.jpg

Style

paint.jpg

Generated

cat


Content

circle.jpg

Style

sea.jpg

Generated

circle


Content

content1.jpg

Style (Van Gogh’s The Starry Night)

vstyle.jpg

Generated

content1


Content

content2.jpg

Style

style2.jpg

Generated

content2.gif


Content (Victoria Memorial Hall)

vic.jpg

Style (Van Gogh’s The Starry Night)

vstyle.jpg

Generated

vic


Content (Taj Mahal)

taj.jpg

Style (Van Gogh’s Starry Night Over the Rhone)

van8.jpg

Generated

taj.gif


Content

in

Style (Claud Monet’s Sunset in Venice)

monet5.png

Generated

in.gif


Content (Visva Bharati)biswa.jpg

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )
aban2.jpg

Generated

biswa.gif

 


Content (Howrah Bridge)

hwhbr.jpg

Style (Van Gogh’s The Starry Night)

vstyle.jpg

Generated

hwhbr.gif

 


Content (Leonardo Da Vinci’s Mona Lisa)

monalisa

Style (Van Gogh’s The Starry Night)
vstyle.jpg

Generatedmonalisa


Content (My sketch: Rabindranath Tagore)

rabi.png

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )
aban2.jpg

Generated
rabi.gif


Content (me)

me.jpg

Style (Van Gogh’s Irises)

van.jpg

Generated

me.gif

 


Content

me.jpg

Style

stars.jpg

Generated

stars.gif


Content

me2

Style (Publo Picaso’s Factory at Horto de Ebro)

picaso3.jpg

Generated

me2.gif


The following animations show how the generated image changes with the change in VGG-19 convolution layer used for computing content cost.

Content

content1.jpg

Style (Van Gogh’s The Starry Night)

vstyle.jpg

Generated

convolution layer 3_2 used

content1_32

convolution layer 4_2 used

content1_42.gif

convolution layer 5_2 used

content1_52

 

Some Deep Learning with Python, TensorFlow and Keras

 

The following problems are taken from a few assignments from the coursera courses Introduction to Deep Learning (by Higher School of Economics) and Neural Networks and Deep Learning (by Prof Andrew Ng, deeplearning.ai). The problem descriptions are taken straightaway from the assignments.

 

1. Linear models, Optimization

In this assignment a linear classifier will be implemented and it will be trained using stochastic gradient descent with numpy.

Two-dimensional classification

To make things more intuitive, let’s solve a 2D classification problem with synthetic data.

data

Features

As we  can notice the data above isn’t linearly separable. Hence we should add features (or use non-linear model). Note that decision line between two classes have form of circle, since that we can add quadratic features to make the problem linearly separable. The idea under this displayed on image below:

kernel.png

Here are some test results for the implemented expand function, that is used for adding quadratic features:

# simple test on random numbers

dummy_X = np.array([
        [0,0],
        [1,0],
        [2.61,-1.28],
        [-0.59,2.1]
    ])

# call expand function
dummy_expanded = expand(dummy_X)

# what it should have returned:   x0       x1       x0^2     x1^2     x0*x1    1
dummy_expanded_ans = np.array([[ 0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  1.    ],
                               [ 1.    ,  0.    ,  1.    ,  0.    ,  0.    ,  1.    ],
                               [ 2.61  , -1.28  ,  6.8121,  1.6384, -3.3408,  1.    ],
                               [-0.59  ,  2.1   ,  0.3481,  4.41  , -1.239 ,  1.    ]])

 

Logistic regression

To classify objects we will obtain probability of object belongs to class ‘1’. To predict probability we will use output of linear model and logistic function:

f1.png

def probability(X, w):
    """
    Given input features and weights
    return predicted probabilities of y==1 given x, P(y=1|x), see description above
        
    :param X: feature matrix X of shape [n_samples,6] (expanded)
    :param w: weight vector w of shape [6] for each of the expanded features
    :returns: an array of predicted probabilities in [0,1] interval.
    """

    return 1. / (1 + np.exp(-np.dot(X, w)))

In logistic regression the optimal parameters w are found by cross-entropy minimization:

f2.png

def compute_loss(X, y, w):
    """
    Given feature matrix X [n_samples,6], target vector [n_samples] of 1/0,
    and weight vector w [6], compute scalar loss function using formula above.
    """
    return -np.mean(y*np.log(probability(X, w)) + (1-y)*np.log(1-probability(X, w)))

Since we train our model with gradient descent, we should compute gradients. To be specific, we need the following derivative of loss function over each weight:

f31

Here is the derivation (can be found here too):

proof.png

def compute_grad(X, y, w):
    """
    Given feature matrix X [n_samples,6], target vector [n_samples] of 1/0,
    and weight vector w [6], compute vector [6] of derivatives of L over each weights.
    """
    
    return np.dot((probability(X, w) - y), X) / X.shape[0]

Training

In this section we’ll use the functions we wrote to train our classifier using stochastic gradient descent. We shall try to change hyper-parameters like batch size, learning rate and so on to find the best one.

Mini-batch SGD

Stochastic gradient descent just takes a random example on each iteration, calculates a gradient of the loss on it and makes a step:

f4.png

w = np.array([0, 0, 0, 0, 0, 1]) # initialize

eta = 0.05 # learning rate
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
    dw = compute_grad(X_expanded[ind, :], y[ind], w)
    w = w - eta*dw

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD where batch-size=4.

gd.gif

SGD with momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in image below. It does this by adding a fraction α of the update vector of the past time step to the current update vector.

f5

sgd.png

eta = 0.05 # learning rate
alpha = 0.9 # momentum
nu = np.zeros_like(w)
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
    dw = compute_grad(X_expanded[ind, :], y[ind], w)
    nu = alpha*nu + eta*dw
    w = w - nu

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD + momentum  where batch-size=4. As can be seen, the loss function drops much faster, leading to a faster convergence.

mgd.gif

RMSprop

We also need to implement RMSPROP algorithm, which use squared gradients to adjust learning rate as follows:

f6

eta = 0.05 # learning rate
alpha = 0.9 # momentum
G = np.zeros_like(w)
eps = 1e-8
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
     dw = compute_grad(X_expanded[ind, :], y[ind], w)
     G = alpha*G + (1-alpha)*dw**2
     w = w - eta*dw / np.sqrt(G + eps)

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD + RMSProp where batch-size=4. As can be seen again, the loss function drops much faster, leading to a faster convergence.

rgd.gif

2. Planar data classification with a neural network with one hidden layer, an implementation from scratch

In this assignment a neural net with a single hidden layer will be trained from scratch. We shall see a big difference between this model and the one implemented using logistic regression.

We shall learn how to:

  • Implement a 2-class classification neural network with a single hidden layer
  • Use units with a non-linear activation function, such as tanh
  • Compute the cross entropy loss
  • Implement forward and backward propagation

Dataset

The following figure visualizes a “flower” 2-class dataset that we shall work on, the colors indicates the class labels.  We have m = 400 training examples.

flower.png

Simple Logistic Regression

Before building a full neural network, lets first see how logistic regression performs on this problem. We can use sklearn’s built-in functions to do that, by running the code below to train a logistic regression classifier on the dataset.

# Train the logistic regression classifier
clf = sklearn.linear_model.LogisticRegressionCV();
clf.fit(X.T, Y.T);

We can now plot the decision boundary of the model and accuracy with the following code.

# Plot the decision boundary for logistic regression
plot_decision_boundary(lambda x: clf.predict(x), X, Y)
plt.title("Logistic Regression")

# Print accuracy
LR_predictions = clf.predict(X.T)
print ('Accuracy of logistic regression: %d ' % float((np.dot(Y,LR_predictions) + np.dot(1-Y,1-LR_predictions))/float(Y.size)*100) +
       '% ' + "(percentage of correctly labelled datapoints)")
Accuracy of logistic regression: 47 % (percentage of correctly labelled datapoints)
sklearn.png
Accuracy 47%

Interpretation: The dataset is not linearly separable, so logistic regression doesn’t perform well. Hopefully a neural network will do better. Let’s try this now!

 

Neural Network model

Logistic regression did not work well on the “flower dataset”. We are going to train a Neural Network with a single hidden layer, by implementing the network with python numpy from scratch.

Here is our model:

classification_kiank.png

The general methodology to build a Neural Network is to:

1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
2. Initialize the model's parameters
3. Loop:
    - Implement forward propagation
    - Compute loss
    - Implement backward propagation to get the gradients
    - Update parameters (gradient descent)

Defining the neural network structure

Define three variables and the function layer_sizes:

- n_x: the size of the input layer
- n_h: the size of the hidden layer (set this to 4) 
- n_y: the size of the output layer
def layer_sizes(X, Y):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    
    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """

Initialize the model’s parameters

Implement the function initialize_parameters().

Instructions:

  • Make sure the parameters’ sizes are right. Refer to the neural network figure above if needed.
  • We will initialize the weights matrices with random values.
    • Use: np.random.randn(a,b) * 0.01 to randomly initialize a matrix of shape (a,b).
  • We will initialize the bias vectors as zeros.
    • Use: np.zeros((a,b)) to initialize a matrix of shape (a,b) with zeros.
def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    
    Returns:
    params -- python dictionary containing the parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """

The Loop

Implement forward_propagation().

Instructions:

  • Look above at the mathematical representation of the classifier.
  • We can use the function sigmoid().
  • We can use the function np.tanh(). It is part of the numpy library.
  • The steps we have to implement are:
    1. Retrieve each parameter from the dictionary “parameters” (which is the output of initialize_parameters()) by using parameters[".."].
    2. Implement Forward Propagation. Compute Z[1],A[1],Z[2]Z[1],A[1],Z[2] and A[2]A[2] (the vector of all the predictions on all the examples in the training set).
  • Values needed in the backpropagation are stored in “cache“. The cache will be given as an input to the backpropagation function.
def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing the parameters (output of initialization function)
    
    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """

Implement compute_cost() to compute the value of the cost J.  There are many ways to implement the cross-entropy loss.

f1.png

def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)
    
    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing the parameters W1, b1, W2 and b2
    
    Returns:
    cost -- cross-entropy cost given equation (13)
    """    

Using the cache computed during forward propagation, we can now implement backward propagation.

Implement the function backward_propagation().

Instructions: Backpropagation is usually the hardest (most mathematical) part in deep learning. The following figure is taken from is the slide from the lecture on backpropagation. We’ll want to use the six equations on the right of this slide, since we are building a vectorized implementation.

grad_summary.png

def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.
    
    Arguments:
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    
    Returns:
    grads -- python dictionary containing the gradients with respect to different parameters
    """

Implement the update rule. Use gradient descent. We have to use (dW1, db1, dW2, db2) in order to update (W1, b1, W2, b2).

General gradient descent rule: θ=θα(J/θ) where α is the learning rate and θ
represents a parameter.

Illustration: The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.

sgd.gif

def update_parameters(parameters, grads, learning_rate = 1.2):
    """
    Updates parameters using the gradient descent update rule given above
    
    Arguments:
    parameters -- python dictionary containing our parameters 
    grads -- python dictionary containing our gradients 
    
    Returns:
    parameters -- python dictionary containing our updated parameters 
    """

Integrate previous parts in nn_model()

Build the neural network model in nn_model().

Instructions: The neural network model has to use the previous functions in the right order.

def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    
    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

Predictions

Use the model to predict by building predict(). Use forward propagation to predict results.

f2.png

def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X
    
    Arguments:
    parameters -- python dictionary containing our parameters 
    X -- input data of size (n_x, m)
    
    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """

It is time to run the model and see how it performs on a planar dataset. Run the following code to test our model with a single hidden layer of nh hidden units.

# Build a model with a n_h-dimensional hidden layer
parameters = nn_model(X, Y, n_h = 4, num_iterations = 10000, print_cost=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))
Cost after iteration 0: 0.693048
Cost after iteration 1000: 0.288083
Cost after iteration 2000: 0.254385
Cost after iteration 3000: 0.233864
Cost after iteration 4000: 0.226792
Cost after iteration 5000: 0.222644
Cost after iteration 6000: 0.219731
Cost after iteration 7000: 0.217504
Cost after iteration 8000: 0.219471
Cost after iteration 9000: 0.218612

nnout.png

Cost after iteration 9000 0.218607
# Print accuracy
predictions = predict(parameters, X)
print ('Accuracy: %d' % float((np.dot(Y,predictions.T) + np.dot(1-Y,1-predictions.T))/float(Y.size)*100) + '%')
 Accuracy: 90%

Accuracy is really high compared to Logistic Regression. The model has learnt the leaf patterns of the flower! Neural networks are able to learn even highly non-linear decision boundaries, unlike logistic regression.

Now, let’s try out several hidden layer sizes. We can observe different behaviors of the model for various hidden layer sizes. The results are shown below.

Tuning hidden layer size

Accuracy for 1 hidden units: 67.5 %
Accuracy for 2 hidden units: 67.25 %
Accuracy for 3 hidden units: 90.75 %
Accuracy for 4 hidden units: 90.5 %
Accuracy for 5 hidden units: 91.25 %
Accuracy for 20 hidden units: 90.0 %
Accuracy for 50 hidden units: 90.25 %

hh.png

Interpretation:

  • The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.
  • The best hidden layer size seems to be around n_h = 5. Indeed, a value around here seems to fits the data well without also incurring noticable overfitting.
  • We shall also learn later about regularization, which lets us use very large models (such as n_h = 50) without much overfitting.

 

3. Getting deeper with Keras

 

  • Tensorflow is a powerful and flexible tool, but coding large neural architectures with it is tedious.
  • There are plenty of deep learning toolkits that work on top of it like Slim, TFLearn, Sonnet, Keras.
  • Choice is matter of taste and particular task
  • We’ll be using Keras to predict handwritten digits with the mnist dataset.
  • The following figure shows 225 sample images from the dataset.

 

mnist.png

 

The pretty keras

Using only the following few lines of code we can learn a simple deep neural net with 3 dense hidden layers and with Relu activation, with dropout 0.5 after each dense layer.

import keras
from keras.models import Sequential
import keras.layers as ll

model = Sequential(name="mlp")
model.add(ll.InputLayer([28, 28]))
model.add(ll.Flatten())

# network body
model.add(ll.Dense(128))
model.add(ll.Activation('relu'))
model.add(ll.Dropout(0.5))

model.add(ll.Dense(128))
model.add(ll.Activation('relu'))
model.add(ll.Dropout(0.5))

model.add(ll.Dense(128))
model.add(ll.Activation('relu'))
model.add(ll.Dropout(0.5))

# output layer: 10 neurons for each class with softmax
model.add(ll.Dense(10, activation='softmax'))

# categorical_crossentropy is our good old crossentropy
# but applied for one-hot-encoded vectors
model.compile("adam", "categorical_crossentropy", metrics=["accuracy"])

The following shows the summary of the model:

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_12 (InputLayer)        (None, 28, 28)            0         
_________________________________________________________________
flatten_12 (Flatten)         (None, 784)               0         
_________________________________________________________________
dense_35 (Dense)             (None, 128)               100480    
_________________________________________________________________
activation_25 (Activation)   (None, 128)               0         
_________________________________________________________________
dropout_22 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_36 (Dense)             (None, 128)               16512     
_________________________________________________________________
activation_26 (Activation)   (None, 128)               0         
_________________________________________________________________
dropout_23 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_37 (Dense)             (None, 128)               16512     
_________________________________________________________________
activation_27 (Activation)   (None, 128)               0         
_________________________________________________________________
dropout_24 (Dropout)         (None, 128)               0         
_________________________________________________________________
dense_38 (Dense)             (None, 10)                1290      
=================================================================
Total params: 134,794
Trainable params: 134,794
Non-trainable params: 0
_________________________________________________________________

Model interface

Keras models follow Scikit-learn‘s interface of fit/predict with some notable extensions. Let’s take a tour.

# fit(X,y) ships with a neat automatic logging.
#          Highly customizable under the hood.
model.fit(X_train, y_train,
          validation_data=(X_val, y_val), epochs=13);
 Train on 50000 samples, validate on 10000 samples
Epoch 1/13
50000/50000 [==============================] - 14s - loss: 0.1489 - acc: 0.9587 - val_loss: 0.0950 - val_acc: 0.9758
Epoch 2/13
50000/50000 [==============================] - 12s - loss: 0.1543 - acc: 0.9566 - val_loss: 0.0957 - val_acc: 0.9735
Epoch 3/13
50000/50000 [==============================] - 11s - loss: 0.1509 - acc: 0.9586 - val_loss: 0.0985 - val_acc: 0.9752
Epoch 4/13
50000/50000 [==============================] - 11s - loss: 0.1515 - acc: 0.9577 - val_loss: 0.0967 - val_acc: 0.9752
Epoch 5/13
50000/50000 [==============================] - 11s - loss: 0.1471 - acc: 0.9596 - val_loss: 0.1008 - val_acc: 0.9737
Epoch 6/13
50000/50000 [==============================] - 11s - loss: 0.1488 - acc: 0.9598 - val_loss: 0.0989 - val_acc: 0.9749
Epoch 7/13
50000/50000 [==============================] - 11s - loss: 0.1495 - acc: 0.9592 - val_loss: 0.1011 - val_acc: 0.9748
Epoch 8/13
50000/50000 [==============================] - 11s - loss: 0.1434 - acc: 0.9604 - val_loss: 0.1005 - val_acc: 0.9761
Epoch 9/13
50000/50000 [==============================] - 11s - loss: 0.1514 - acc: 0.9590 - val_loss: 0.0951 - val_acc: 0.9759
Epoch 10/13
50000/50000 [==============================] - 11s - loss: 0.1424 - acc: 0.9613 - val_loss: 0.0995 - val_acc: 0.9739
Epoch 11/13
50000/50000 [==============================] - 11s - loss: 0.1408 - acc: 0.9625 - val_loss: 0.0977 - val_acc: 0.9751
Epoch 12/13
50000/50000 [==============================] - 11s - loss: 0.1413 - acc: 0.9601 - val_loss: 0.0938 - val_acc: 0.9753
Epoch 13/13
50000/50000 [==============================] - 11s - loss: 0.1430 - acc: 0.9619 - val_loss: 0.0981 - val_acc: 0.9761

As we could see, with a simple model without any convolution layers we could obtain more than 97.5% accuracy on the validation dataset.

The following figures show the weights learnt at different layers.

l1l2l3l4

Some Tips & tricks to improve accuracy

Here are some tips on what we can do to improve accuracy:

  • Network size
    • More neurons,
    • More layers, (docs)
    • Nonlinearities in the hidden layers
      • tanh, relu, leaky relu, etc
    • Larger networks may take more epochs to train, so don’t discard the net just because it could didn’t beat the baseline in 5 epochs.
  • Early Stopping
    • Training for 100 epochs regardless of anything is probably a bad idea.
    • Some networks converge over 5 epochs, others – over 500.
    • Way to go: stop when validation score is 10 iterations past maximum
  • Faster optimization
    • rmsprop, nesterov_momentum, adam, adagrad and so on.
      • Converge faster and sometimes reach better optima
      • It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
  • Regularize to prevent overfitting
  • Data augmemntation – getting 5x as large dataset for free is a great deal
    • https://keras.io/preprocessing/image/
    • Zoom-in+slice = move
    • Rotate+zoom(to remove black stripes)
    • any other perturbations
    • Simple way to do that (if we have PIL/Image):
      • from scipy.misc import imrotate,imresize
      • and a few slicing
    • Stay realistic. There’s usually no point in flipping dogs upside down as that is not the way we usually see them.

Unsupervised Deep learning with AutoEncoders on the MNIST dataset (with Tensorflow in Python)

  • Deep learning,  although primarily used for supervised classification / regression problems, can also be used as an unsupervised ML technique, the autoencoder being a classic example. As explained here,  the aim of an autoencoder is to learn a representation (encoding) for a set of data, typically for the purpose of  dimensionality reduction.
  • Deep learning can be used to learn a different representation (typically a set of input features in a  low-dimensional space) of the data that can be used for pre-training for example in transfer-learning.
  • In this article, a few vanilla autoencoder implementations will be demonstrated for the mnist dataset.
  • The following figures describe the theory (ref: coursera course Neural Networks for Machine Learning, 2012 by Prof. Hinton, university of Toronto). As explained, the autoencoder with back-propagation-based implementation can be used to generalize the linear dimension-reduction techniques as PCA, since the hidden layers can learn non-linear manifolds with non-linear activation functions (e.g., relu , sigmoid).

 

p2p3

 

  • The input and output units of an autoencoder are identical, the idea is to learn the input itself as a different representation with one or multiple hidden layer(s).
  • The mnist images are of size 28×28, so the number of nodes in the input and the output layer are always 784 for the autoencoders shown in this article.
  • The left side of an auto-encoder network is typically a mirror image of the right side and the weights are tied (weights learnt in the left hand side of the network are reused, to better reproduce the input at the output).
  • The next figures and animations show the outputs for the following simple autoencoder with just 1 hidden layer, with the input mnist data. A Relu activation function is used in the hidden layer.  It also uses the L2 regularization on the weights learnt. As can be seen from the next figure, the inputs are kind of reproduced with some variation at the output layer, as expected.

aec1
a1i_0.001

a1o_0.001

  • The next animations visualize the hidden layer weights learnt (for the 400 hidden units) and the output of the autoencoder with the same input training dataset, with a different value of the regularization parameter.

 

a1h_10

a1o_10

 

  • The next figure visualizes the hidden layer weights learnt with yet another different regulariation parameter value.

 

hidden

 

  • The next animation visualizes the output of the autoencoder with the same input training dataset, but this time no activation function  being used at the hidden layer.

 

a1o_0.001.na

  • The next animations show the results with a deeper autoencoder with 3 hidden layers (the architecture shown below). As before, the weights are tied and in this case no activation function is used, with L2 regularization on the weights.

 

aec2

a2i_0.001a2o_0.001

 

  • Let’s implement a more deeper autoencoder. The next animations show the results with a deeper autoencoder with 5 hidden layers (the architecture shown below). As before, the weights are tied and in this case no activation function is used, with L2 regularization on the weights.

 

aec3
a2i_0.001a3o_0.00001

 

Dogs vs. Cats: Image Classification with Deep Learning using TensorFlow in Python

The problem

Given a set of labeled images of  cats and dogs, a  machine learning model  is to be learnt and later it is to be used to classify a set of new images as cats or dogs. This problem appeared in a Kaggle competition and the images are taken from this kaggle dataset.

  • The original dataset contains a huge number of images (25,000 labeled cat/dog images for training and 12,500 unlabeled images for test).
  • Only a few sample images are chosen (1100 labeled images for cat/dog as training and 1000 images from the test dataset) from the dataset, just for the sake of  quick demonstration of how to solve this problem using deep learning (motivated by the Udacity course Deep Learning by Google), which is going to be described (along with the results) in this article.
  • The sample test images chosen are manually labeled to compute model accuracy later with the model-predicted labels.
  • The accuracy on the test dataset is not going to be good in general for the above-mentioned reason. In order to obtain good accuracy  on the test dataset using deep learning, we need to train the models with a large number of input images (e.g., with all the training images from the kaggle dataset).
  • A few sample labeled images from the training dataset are shown below.

Dogs
download.png

Cats
download (1).png

  • As a pre-processing step, all the images are first resized to 50×50 pixel images.

Classification with a few off-the-self classifiers

  • First, each image from the training dataset is fattened and represented as 2500-length vectors (one for each channel).
  • Next, a few sklearn models are trained on this flattened data. Here are the results

m1
m2
m4m3

As shown above, the test accuracy is quite poor with a few sophisticated off-the-self classifiers.

Classifying images using Deep Learning with Tensorflow

Now let’s first train a logistic regression and then a couple of neural network models by introducing L2 regularization for both the models.

  • First, all the images are converted to gray-scale images.
  • The following figures visualize the weights learnt for the cat vs. the dog class during training the logistic regression  model with SGD with L2-regularization (λ=0.1, batch size=128).cd_nn_no_hidden.png

clr.png

Test accuracy: 53.6%
  • The following animation visualizes the weights learnt for 400 randomly selected hidden units using a neural net with a single hidden layer with 4096 hidden nodes by training the neural net model  with SGD with L2-regularization (λ1=λ2=0.05, batch size=128).cd_nn_hidden1.pngMinibatch loss at step 0: 198140.156250
    Minibatch accuracy: 50.0%
    Validation accuracy: 50.0%Minibatch loss at step 500: 0.542070
    Minibatch accuracy: 89.8%
    Validation accuracy: 57.0%Minibatch loss at step 1000: 0.474844
    Minibatch accuracy: 96.9%
    Validation accuracy: 60.0%

    Minibatch loss at step 1500: 0.571939
    Minibatch accuracy: 85.9%
    Validation accuracy: 56.0%

    Minibatch loss at step 2000: 0.537061
    Minibatch accuracy: 91.4%
    Validation accuracy: 63.0%

    Minibatch loss at step 2500: 0.751552
    Minibatch accuracy: 75.8%
    Validation accuracy: 57.0%

    Minibatch loss at step 3000: 0.579084
    Minibatch accuracy: 85.9%
    Validation accuracy: 54.0%

    Test accuracy: 57.8%
    

 

n1.gif

Clearly, the model learnt above overfits the training dataset, the test accuracy improved a bit, but still quite poor.

  • Now, let’s train a deeper neural net with a two hidden layers, first one with 1024 nodes and second one with 64 nodes.
    cd_nn_hidden2.png
  • Minibatch loss at step 0: 1015.947266
    Minibatch accuracy: 43.0%
    Validation accuracy: 50.0%
  • Minibatch loss at step 500: 0.734610
    Minibatch accuracy: 79.7%
    Validation accuracy: 55.0%
  • Minibatch loss at step 1000: 0.615992
    Minibatch accuracy: 93.8%
    Validation accuracy: 55.0%
  • Minibatch loss at step 1500: 0.670009
    Minibatch accuracy: 82.8%
    Validation accuracy: 56.0%
  • Minibatch loss at step 2000: 0.798796
    Minibatch accuracy: 77.3%
    Validation accuracy: 58.0%
  • Minibatch loss at step 2500: 0.717479
    Minibatch accuracy: 84.4%
    Validation accuracy: 55.0%
  • Minibatch loss at step 3000: 0.631013
    Minibatch accuracy: 90.6%
    Validation accuracy: 57.0%
  • Minibatch loss at step 3500: 0.739071
    Minibatch accuracy: 75.8%
    Validation accuracy: 54.0%
  • Minibatch loss at step 4000: 0.698650
    Minibatch accuracy: 84.4%
    Validation accuracy: 55.0%
  • Minibatch loss at step 4500: 0.666173
    Minibatch accuracy: 85.2%
    Validation accuracy: 51.0%
  • Minibatch loss at step 5000: 0.614820
    Minibatch accuracy: 92.2%
    Validation accuracy: 58.0%
Test accuracy: 55.2%
  • The following animation visualizes the weights learnt for 400 randomly selected hidden units from the first hidden layer, by training the neural net model with SGD with L2-regularization (λ1=λ2=λ3=0.1, batch size=128, dropout rate=0.6).h1c
  • The next animation visualizes the weights learnt and then the weights learnt for all the 64 hidden units for the second hidden layer.h2c
  • Clearly, the second deeper neural net model learnt above overfits the training dataset more, the test accuracy decreased a bit.

Classifying images with Deep Convolution Network

Let’s use the following conv-net shown in the next figure.

 

f8.png

As shown above, the ConvNet uses:

  • convolution layers each with
    • 5×5 kernel
    • 16 filters
    • 1×1 stride 
    • SAME padding
  • Max pooling layers each with
    • 2×2 kernel
    • 2×2 stride 
  • 64 hidden nodes
  • 128 batch size
  • 5K iterations
  • 0.7 dropout rate
  • No learning decay

Results

Minibatch loss at step 0: 1.783917
Minibatch accuracy: 55.5%
Validation accuracy: 50.0%

Minibatch loss at step 500: 0.269719
Minibatch accuracy: 89.1%
Validation accuracy: 54.0%

Minibatch loss at step 1000: 0.045729
Minibatch accuracy: 96.9%
Validation accuracy: 61.0%

Minibatch loss at step 1500: 0.015794
Minibatch accuracy: 100.0%
Validation accuracy: 61.0%

Minibatch loss at step 2000: 0.028912
Minibatch accuracy: 98.4%
Validation accuracy: 64.0%

Minibatch loss at step 2500: 0.007787
Minibatch accuracy: 100.0%
Validation accuracy: 62.0%

Minibatch loss at step 3000: 0.001591
Minibatch accuracy: 100.0%
Validation accuracy: 63.0%

Test accuracy: 61.3%

The following animations show the features learnt (for the first 16 images for each SGD batch) at different convolution and Maxpooling layers:

c1n.gif

c2n.gif

c3n

c4n

  • Clearly, the simple convolution neural net outperforms all the previous models in terms of test accuracy, as shown below.

f14.png

  • Only 1100 labeled images (randomly chosen from the training dataset) were used to train the model and predict 1000 test images (randomly chosen from the test dataset). Clearly the accuracy can be improved a lot if a large number of images are used for training with deeper / more complex networks (with more parameters to learn).

 

 

 

Deep Learning with TensorFlow in Python: Convolution Neural Nets

The following problems appeared in the assignments in the Udacity course Deep Learning (by Google). The descriptions of the problems are taken from the assignments (continued from the last post).

Classifying the alphabets with notMNIST dataset with Deep Network

Here is how some sample images from the dataset look like:

im13

Let’s try to get the best performance using a multi-layer model! (The best reported test accuracy using a deep network is 97.1%).

  • One avenue you can explore is to add multiple layers.
  • Another one is to use learning rate decay.

 

Learning L2-Regularized  Deep Neural Network with SGD

The following figure recapitulates the neural network with a 3 hidden layers, the first one with 2048 nodes,  the second one with 512 nodes and the third one with with 128 nodes, each one with Relu intermediate outputs. The L2 regularizations applied on the lossfunction for the weights learnt at the input and the hidden layers are λ1, λ2, λ3 and λ4, respectively.

nn_hidden3.png

 

The next 3 animations visualize the weights learnt for 400 randomly selected nodes from hidden layer 1 (out of 2096 nodes), then another 400 randomly selected nodes from hidden layer 2 (out of 512 nodes) and finally at all 128 nodes from hidden layer 3, at different steps using SGD and L2 regularized loss function (with λλλλ4
=0.01).  As can be seen below, the weights learnt are gradually capturing (as the SGD step increases) the different features of the alphabets at the corresponding output neurons.

 

hidden1

 

hidden2
hidden3

Results with SGD

Initialized
Validation accuracy: 27.6%
Minibatch loss at step 0: 4.638808
Minibatch accuracy: 7.8%
Validation accuracy: 27.6%

Validation accuracy: 86.3%
Minibatch loss at step 500: 1.906724
Minibatch accuracy: 86.7%
Validation accuracy: 86.3%

Validation accuracy: 86.9%
Minibatch loss at step 1000: 1.333355
Minibatch accuracy: 87.5%
Validation accuracy: 86.9%

Validation accuracy: 87.3%
Minibatch loss at step 1500: 1.056811
Minibatch accuracy: 84.4%
Validation accuracy: 87.3%

Validation accuracy: 87.5%
Minibatch loss at step 2000: 0.633034
Minibatch accuracy: 93.8%
Validation accuracy: 87.5%

Validation accuracy: 87.5%
Minibatch loss at step 2500: 0.696114
Minibatch accuracy: 85.2%
Validation accuracy: 87.5%

Validation accuracy: 88.3%
Minibatch loss at step 3000: 0.737464
Minibatch accuracy: 86.7%
Validation accuracy: 88.3%

Test accuracy: 93.6%

f2

 

Batch size = 128, number of iterations = 3001 and Drop-out rate = 0.8 for training dataset are used for the above set of experiments, with learning decay. We can play with the hyper-parameters to get better test accuracy.

Convolution Neural Network

Previously  we trained fully connected networks to classify notMNIST characters. The goal of this assignment is to make the neural network convolutional.

Let’s build a small network with two convolutional layers, followed by one fully connected layer. Convolutional networks are more expensive computationally, so we’ll limit its depth and number of fully connected nodes. The below figure shows the simplified architecture of the convolution neural net.

f2.png

As shown above, the ConvNet uses:

  • 2 convolution layers each with
    • 5×5 kernel
    • 16 filters
    • 2×2 strides 
    • SAME padding
  • 64 hidden nodes
  • 16 batch size
  • 1K iterations


Results

Initialized
Minibatch loss at step 0: 3.548937
Minibatch accuracy: 18.8%
Validation accuracy: 10.0%

Minibatch loss at step 50: 1.781176
Minibatch accuracy: 43.8%
Validation accuracy: 64.7%

Minibatch loss at step 100: 0.882739
Minibatch accuracy: 75.0%
Validation accuracy: 69.5%

Minibatch loss at step 150: 0.980598
Minibatch accuracy: 62.5%
Validation accuracy: 74.5%

Minibatch loss at step 200: 0.794144
Minibatch accuracy: 81.2%
Validation accuracy: 77.6%

Minibatch loss at step 250: 1.191971
Minibatch accuracy: 62.5%
Validation accuracy: 79.1%

Minibatch loss at step 300: 0.441911
Minibatch accuracy: 87.5%
Validation accuracy: 80.5%

Minibatch loss at step 350: 0.605005
Minibatch accuracy: 81.2%
Validation accuracy: 79.3%

Minibatch loss at step 400: 1.032123
Minibatch accuracy: 68.8%
Validation accuracy: 81.5%

Minibatch loss at step 450: 0.869944
Minibatch accuracy: 75.0%
Validation accuracy: 82.1%

Minibatch loss at step 500: 0.530418
Minibatch accuracy: 81.2%
Validation accuracy: 81.2%

Minibatch loss at step 550: 0.227771
Minibatch accuracy: 93.8%
Validation accuracy: 81.8%

Minibatch loss at step 600: 0.697444
Minibatch accuracy: 75.0%
Validation accuracy: 82.5%

Minibatch loss at step 650: 0.862341
Minibatch accuracy: 68.8%
Validation accuracy: 83.0%

Minibatch loss at step 700: 0.336292
Minibatch accuracy: 87.5%
Validation accuracy: 81.8%

Minibatch loss at step 750: 0.213392
Minibatch accuracy: 93.8%
Validation accuracy: 82.6%

Minibatch loss at step 800: 0.553639
Minibatch accuracy: 75.0%
Validation accuracy: 83.3%

Minibatch loss at step 850: 0.533049
Minibatch accuracy: 87.5%
Validation accuracy: 81.7%

Minibatch loss at step 900: 0.415935
Minibatch accuracy: 87.5%
Validation accuracy: 83.9%

Minibatch loss at step 950: 0.290436
Minibatch accuracy: 93.8%
Validation accuracy: 84.0%

Minibatch loss at step 1000: 0.400648
Minibatch accuracy: 87.5%
Validation accuracy: 84.0%

Test accuracy: 90.3%

f3.png

 

The following figures visualize the feature representations at different layers for the first 16 images for the last batch with SGD during training:

f4f5f6f7

The next animation shows how the features learnt at convolution layer 1 change with iterations.

c1.gif

 


Convolution Neural Network with Max Pooling

The convolutional model above uses convolutions with stride 2 to reduce the dimensionality. Replace the strides by a max pooling operation of stride 2 and kernel size 2. The below figure shows the simplified architecture of the convolution neural net with MAX Pooling layers.

f8.png

As shown above, the ConvNet uses:

  • 2 convolution layers each with
    • 5×5 kernel
    • 16 filters
    • 1×1 stride
    • 2×2 Max-pooling 
    • SAME padding
  • 64 hidden nodes
  • 16 batch size
  • 1K iterations


Results

Initialized
Minibatch loss at step 0: 4.934033
Minibatch accuracy: 6.2%
Validation accuracy: 8.9%

Minibatch loss at step 50: 2.305100
Minibatch accuracy: 6.2%
Validation accuracy: 11.7%

Minibatch loss at step 100: 2.319777
Minibatch accuracy: 0.0%
Validation accuracy: 14.8%

Minibatch loss at step 150: 2.285996
Minibatch accuracy: 18.8%
Validation accuracy: 11.5%

Minibatch loss at step 200: 1.988467
Minibatch accuracy: 25.0%
Validation accuracy: 22.9%

Minibatch loss at step 250: 2.196230
Minibatch accuracy: 12.5%
Validation accuracy: 27.8%

Minibatch loss at step 300: 0.902828
Minibatch accuracy: 68.8%
Validation accuracy: 55.4%

Minibatch loss at step 350: 1.078835
Minibatch accuracy: 62.5%
Validation accuracy: 70.1%

Minibatch loss at step 400: 1.749521
Minibatch accuracy: 62.5%
Validation accuracy: 70.3%

Minibatch loss at step 450: 0.896893
Minibatch accuracy: 75.0%
Validation accuracy: 79.5%

Minibatch loss at step 500: 0.610678
Minibatch accuracy: 81.2%
Validation accuracy: 79.5%

Minibatch loss at step 550: 0.212040
Minibatch accuracy: 93.8%
Validation accuracy: 81.0%

Minibatch loss at step 600: 0.785649
Minibatch accuracy: 75.0%
Validation accuracy: 81.8%

Minibatch loss at step 650: 0.775520
Minibatch accuracy: 68.8%
Validation accuracy: 82.2%

Minibatch loss at step 700: 0.322183
Minibatch accuracy: 93.8%
Validation accuracy: 81.8%

Minibatch loss at step 750: 0.213779
Minibatch accuracy: 100.0%
Validation accuracy: 82.9%

Minibatch loss at step 800: 0.795744
Minibatch accuracy: 62.5%
Validation accuracy: 83.7%

Minibatch loss at step 850: 0.767435
Minibatch accuracy: 87.5%
Validation accuracy: 81.7%

Minibatch loss at step 900: 0.354712
Minibatch accuracy: 87.5%
Validation accuracy: 83.8%

Minibatch loss at step 950: 0.293992
Minibatch accuracy: 93.8%
Validation accuracy: 84.3%

Minibatch loss at step 1000: 0.384624
Minibatch accuracy: 87.5%
Validation accuracy: 84.2%

Test accuracy: 90.5%

f9.png

As can be seen from the above results, with MAX POOLING, the test accuracy increased slightly.

The following figures visualize the feature representations at different layers for the first 16 images during training with Max Pooling:

f10f11f12f13

Till now the convnets we have tried are small enough and we did not obtain high enough accuracy on the test dataset. Next we shall make our convnet deep to increase the test accuracy.

Deep Convolution Neural Network with Max Pooling 

Let’s Try to get the best performance you can using a convolutional net. Look for exampleat the classic LeNet5 architecture, adding Dropout, and/or adding learning rate decay.

Let’s try with a few convnets:

1. The following ConvNet uses:

  • 2 convolution layers (with Relu) each using
    • 3×3 kernel
    • 16 filters
    • 1×1 stride
    • 2×2 Max-pooling 
    • SAME padding
  • all weights initialized with truncated normal distribution with sd 0.01
  • single hidden layer (fully connected) with 1024 hidden nodes
  • 128 batch size
  • 3K iterations
  • 0.01 (=λ1=λ2) for regularization
  • No dropout
  • No learning decay

Results

Minibatch loss at step 0: 2.662903
Minibatch accuracy: 7.8%
Validation accuracy: 10.0%

Minibatch loss at step 500: 2.493813
Minibatch accuracy: 11.7%
Validation accuracy: 10.0%

Minibatch loss at step 1000: 0.848911
Minibatch accuracy: 82.8%
Validation accuracy: 79.6%

Minibatch loss at step 1500: 0.806191
Minibatch accuracy: 79.7%
Validation accuracy: 81.8%

Minibatch loss at step 2000: 0.617905
Minibatch accuracy: 85.9%
Validation accuracy: 84.5%

Minibatch loss at step 2500: 0.594710
Minibatch accuracy: 83.6%
Validation accuracy: 85.7%

Minibatch loss at step 3000: 0.435352
Minibatch accuracy: 91.4%
Validation accuracy: 87.2%

Test accuracy: 93.4%

 a1.png


As we can see, by introducing couple of convolution layers, the accuracy increased from 90% (refer to the earlier blog) to 93.4% under the same settings.

Here is how the hidden layer weights (400 out of 1024 chosen randomly) changes, although the features don’t clearly resemble the alphabets anymore, which is quite expected.

a1.gif

2. The following ConvNet uses:

  • 2 convolution layers (with Relu) each using
    • 3×3 kernel
    • 32 filters
    • 1×1 stride
    • 2×2 Max-pooling 
    • SAME padding
  • all weights initialized with truncated normal distribution with sd 0.1
  • hidden layers (fully connected) both with 256 hidden nodes
  • 128 batch size
  • 6K iterations
  • 0.7 dropout
  • learning decay starting with 0.1

Results

Minibatch loss at step 0: 9.452210
Minibatch accuracy: 10.2%
Validation accuracy: 9.7%
Minibatch loss at step 500: 0.611396
Minibatch accuracy: 81.2%
Validation accuracy: 81.2%
Minibatch loss at step 1000: 0.442578
Minibatch accuracy: 85.9%
Validation accuracy: 83.3%
Minibatch loss at step 1500: 0.523506
Minibatch accuracy: 83.6%
Validation accuracy: 84.8%
Minibatch loss at step 2000: 0.411259
Minibatch accuracy: 89.8%
Validation accuracy: 85.8%
Minibatch loss at step 2500: 0.507267
Minibatch accuracy: 82.8%
Validation accuracy: 85.9%
Minibatch loss at step 3000: 0.414740
Minibatch accuracy: 89.1%
Validation accuracy: 86.6%
Minibatch loss at step 3500: 0.432177
Minibatch accuracy: 85.2%
Validation accuracy: 87.0%
Minibatch loss at step 4000: 0.501300
Minibatch accuracy: 85.2%
Validation accuracy: 87.1%
Minibatch loss at step 4500: 0.391587
Minibatch accuracy: 89.8%
Validation accuracy: 87.7%
Minibatch loss at step 5000: 0.347674
Minibatch accuracy: 90.6%
Validation accuracy: 88.1%
Minibatch loss at step 5500: 0.259942
Minibatch accuracy: 91.4%
Validation accuracy: 87.8%
Minibatch loss at step 6000: 0.392562
Minibatch accuracy: 85.9%
Validation accuracy: 88.4%

Test accuracy: 94.6%

p1

 

3. The following ConvNet uses:

  • 3 convolution layers (with Relu) each using
    • 5×5 kernel
    • with 16, 32 and 64 filters, respectively
    • 1×1 stride
    • 2×2 Max-pooling 
    • SAME padding
  • all weights initialized with truncated normal distribution with sd 0.1
  • hidden layers (fully connected) with 256, 128 and 64 hidden nodes respectively
  • 128 batch size
  • 10K iterations
  • 0.7 dropout
  • learning decay starting with 0.1

Results

Minibatch loss at step 0: 6.788681
Minibatch accuracy: 12.5%
Validation accuracy: 9.8%
Minibatch loss at step 500: 0.804718
Minibatch accuracy: 75.8%
Validation accuracy: 74.9%
Minibatch loss at step 1000: 0.464696
Minibatch accuracy: 86.7%
Validation accuracy: 82.8%
Minibatch loss at step 1500: 0.684611
Minibatch accuracy: 80.5%
Validation accuracy: 85.2%
Minibatch loss at step 2000: 0.352865
Minibatch accuracy: 91.4%
Validation accuracy: 85.9%
Minibatch loss at step 2500: 0.505062
Minibatch accuracy: 84.4%
Validation accuracy: 87.3%
Minibatch loss at step 3000: 0.352783
Minibatch accuracy: 87.5%
Validation accuracy: 87.0%
Minibatch loss at step 3500: 0.411505
Minibatch accuracy: 88.3%
Validation accuracy: 87.9%
Minibatch loss at step 4000: 0.457463
Minibatch accuracy: 84.4%
Validation accuracy: 88.1%
Minibatch loss at step 4500: 0.369346
Minibatch accuracy: 89.8%
Validation accuracy: 88.7%
Minibatch loss at step 5000: 0.323142
Minibatch accuracy: 89.8%
Validation accuracy: 88.5%
Minibatch loss at step 5500: 0.245018
Minibatch accuracy: 93.8%
Validation accuracy: 89.0%
Minibatch loss at step 6000: 0.480509
Minibatch accuracy: 85.9%
Validation accuracy: 89.2%
Minibatch loss at step 6500: 0.297886
Minibatch accuracy: 92.2%
Validation accuracy: 89.3%
Minibatch loss at step 7000: 0.309768
Minibatch accuracy: 90.6%
Validation accuracy: 89.3%
Minibatch loss at step 7500: 0.280219
Minibatch accuracy: 92.2%
Validation accuracy: 89.5%
Minibatch loss at step 8000: 0.260540
Minibatch accuracy: 93.8%
Validation accuracy: 89.7%
Minibatch loss at step 8500: 0.345161
Minibatch accuracy: 88.3%
Validation accuracy: 89.6%
Minibatch loss at step 9000: 0.343074
Minibatch accuracy: 87.5%
Validation accuracy: 89.8%
Minibatch loss at step 9500: 0.324757
Minibatch accuracy: 92.2%
Validation accuracy: 89.9%
Minibatch loss at step 10000: 0.513597
Minibatch accuracy: 83.6%
Validation accuracy: 90.0%

Test accuracy: 95.5%

To be continued…

Deep Learning with TensorFlow in Python

The following problems appeared in the first few assignments in the Udacity course Deep Learning (by Google). The descriptions of the problems are taken from the assignments.

Classifying the letters with notMNIST dataset

Let’s first learn about simple data curation practices, and familiarize ourselves with some of the data that are going to be used for deep learning using tensorflow. The notMNIST dataset to be used with python experiments. This dataset is designed to look like the classic MNIST dataset, while looking a little more like real data: it’s a harder task, and the data is a lot less ‘clean’ than MNIST.

Preprocessing

First the dataset needs to be downloaded and extracted to a local machine. The data consists of characters rendered in a variety of fonts on a 28×28 image. The labels are limited to ‘A’ through ‘J’ (10 classes). The training set has about 500k and the testset 19000 labelled examples. Given these sizes, it should be possible to train models quickly on any machine.

Let’s take a peek at some of the data to make sure it looks sensible. Each exemplar should be an image of a character A through J rendered in a different font. Display a sample of the images downloaded.

im1.pngim2im3im4im5im6im7im8im9im10

Now let’s load the data in a more manageable format. Let’s convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road. A few images might not be readable, we’ll just skip them. Also the data is expected to be balanced across classes. Let’s verify that. The following output shows the dimension of the ndarray for each class.

(52909, 28, 28)
(52911, 28, 28)
(52912, 28, 28)
(52911, 28, 28)
(52912, 28, 28)
(52912, 28, 28)
(52912, 28, 28)
(52912, 28, 28)
(52912, 28, 28)
(52911, 28, 28)

Now some more preprocessing is needed:

  1. First the training data (for different classes) needs to be merged and pruned.
  2. The labels will be stored into a separate array of integers from 0 to 9.
  3. Also let’s create a validation dataset for hyperparameter tuning.
  4. Finally the data needs to be randomized. It’s important to have the labels well shuffled for the training and test distributions to match.

After preprocessing, let’s peek a few samples from the training dataset and the next figure shows how it looks.

im11.png

Here is how the validation dataset looks (a few samples chosen).

im12.png

Here is how the test dataset looks (a few samples chosen).

im13.png

Trying a few off-the-shelf classifiers

Let’s get an idea of what an off-the-shelf classifier can give us on this data. It’s always good to check that there is something to learn, and that it’s a problem that is not so trivial that a canned solution solves it. Let’s first train a simple LogisticRegression model from sklearn (using default parameters) on this data using 5000 training samples.

The following figure shows the output from the logistic regression model trained, its accuracy on the test dataset (also the confusion matrix) and a few test instances classified wrongly (predicted labels along with the true labels) by the model.

im14.png

 

Deep Learning with TensorFlow

Now let’s progressively train deeper and more accurate models using TensorFlow. Again, we need to do the following preprocessing:

  1. Reformat into a shape that’s more adapted to the models we’re going to train: data as a flat matrix,
  2. labels as 1-hot encodings.

Now let’s train a multinomial logistic regression using simple stochastic gradient descent.

TensorFlow works like this:

First we need to describe the computation that is to be performed: what the inputs, the variables, and the operations look like. These get created as nodes over a computation graph.

Then we can run the operations on this graph as many times as we want by calling session.run(), providing it outputs to fetch from the graph that get returned.

  1. Let’s load all the data into TensorFlow and build the computation graph corresponding to our training.
  2. Let’s use stochastic gradient descent training (with ~3k steps), which is much faster. The graph will be similar to batch gradient descent, except that instead of holding all the training data into a constant node, we create a Placeholder node which will be fed actual data at every call of session.run().

Logistic Regression with SGD

The following shows the fully connected computation graph and the results obtained.

nn_no_hidden.png

im14_5im15

Neural Network with SGD

Now let’s turn the logistic regression example with SGD into a 1-hidden layer neural network with rectified linear units nn.relu() and 1024 hidden nodes. As can be seen from the below results, this model improves the validation / test accuracy.

nn_hidden.png

im16im17

Regularization with Tensorflow

Previously we trained a logistic regression and a neural network model with Tensorflow. Let’s now explore the regularization techniques.

Let’s introduce and tune L2regularization for both logistic and neural network models. L2  amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, we can compute the L2loss for a tensor t using nn.l2_loss(t).

The right amount of regularization improves the validation / test accuracy, as can be seen from the following results.

L2 Regularized Logistic Regression with SGD

The following figure recapitulates the simple network without anyt hidden layer, with softmax outputs.

nn_no_hidden.png

The next figures visualize the weights learnt for the 10 output neurons at different steps using SGD and L2 regularized loss function (with λ=0.005). As can be seen below, the weights learnt are gradually capturing the different features of the letters at the corresponding output neurons.

im18.png

As can be seen, the test accuracy also gets improved to 88.3% with L2 regularized loss function (with λ=0.005).

im19.png

The following results show the accuracy and the weights learnt for couple of different values of λ (0.01 and 0.05 respectively). As can be seen, with higher values of λ, the logistic regression model tends to underfit and test accuracy decreases.

 

im20im21


L2
Regularized Neural Network with SGD

The following figure recapitulates the neural network with a single hidden layer with 1024 nodes, with relu intermediate outputs. The L2 regularizations applied on the loss function for the weights learnt at the input and the hidden layers are λ1 and λ2, respectively.

 

nn_hidden.png

The next figures visualize the weights learnt for 225 randomly selected hidden neurons (out of 1024) at different steps using SGD and L2 regularized loss function (with λλ0.01). As can be seen below, the weights learnt are gradually capturing (as the SGD steps increase) the different features of the letters at the corresponding output neurons.
im22
im23.png
im24im26im27im28im29
a1.gif
im30
If the regularization parameters used are λ1=λ2=0.005, the test accuracy increases to 91.1%, as shown below.
im25.png

Overfitting

Let’s demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

Let’s restrict the training data to n=5 batches. The following figures show how it increases the training accuracy to 100% (along with a visualization of a few randomly-picked input weights learnt) and decreases the validation and the test accuracy, since the model is not generalized enough.

 

im31im32

im33im34im35im36im37

im38

Dropouts

Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise our evaluation results would be stochastic as well. TensorFlow provides nn.dropout() for that, but you have to make sure it’s only inserted during training.

What happens to our extreme overfitting case?

As we can see from the below results, introducing a dropout rate of 0.4 increases the validation and test accuracy by reducing the overfitting.

 

im39im40

im41im42im43im44im45
im46

Till this point the highest accuracy on the test dataset using a single hidden-layer neural network is 91.1%. More hidden layers can be used / some other techniques (e.g., exponential decay in learning rate can be used) to improve the accuracy obtained (to be continued…).