Solving Some Image Processing and Computer Vision Problems with Python libraries

In this article, a few image processing / computer vision problems and their solutions  with python libraries (scikit-image, cv2) will be discussed. Some of the problems are from the exercises from this book (available on amazon). This blog will be continued here.

Removing Gaussian Noise from images by computing mean and median images with scikit-image

  1. Start with an input image.
  2. Create n (e.g, n=100) noisy images by adding i.i.d. Gaussian noise (with zero mean) to the original image, with scikit-image.
  3. Compute the mean (median) of the noisy images.
  4. Compare PSNR with the original image.
  5. Vary n and compare the results.
from skimage import img_as_float
from skimage.util import random_noise
from skimage.measure import compare_psnr
from skimage.io import imread
import matplotlib.pylab as plt
import numpy as np

im = img_as_float(imread('../new images/parrot.jpg')) # original image
np.random.seed(0)
# generate n noisy images from the original image by adding Gaussian noise
n = 25
images = np.zeros((n, im.shape[0], im.shape[1], im.shape[2]))
sigma = 0.2
for i in range(n):
    images[i,...] = random_noise(im, var=sigma**2)

im_mean = images.mean(axis=0)
im_median = np.median(images, axis=0)
plt.figure(figsize=(20,16))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, hspace=.01)
plt.subplot(221), plt.imshow(im), plt.axis('off'), plt.title('Original image', size=20)
plt.subplot(222), plt.imshow(images[0]), plt.axis('off'), plt.title('Noisy PSNR: ' + str(compare_psnr(im, images[0])), size=20)
plt.subplot(223), plt.imshow(im_mean), plt.axis('off'), plt.title('Mean PSNR: ' + str(compare_psnr(im, im_mean)), size=20)
plt.subplot(224), plt.imshow(im_median), plt.axis('off'), plt.title('Median PSNR: ' + str(compare_psnr(im, im_median)), size=20)
plt.show()

The next figure shows the original image, a noisy image generated from it by adding Gaussian noise (with 0 mean) to it and the images obtained by taking mean / median over all the n noisy images generated. As can be seen, the Gaussian noise in the images gets cancelled out by taking mean / median.

with n = 25

p25.png

with n=100
p100.png


plt.hist(images[:,100,100,0], color='red', alpha=0.2, label='red')
plt.hist(images[:,100,100,1], color='green', alpha=0.2, label='green')
plt.hist(images[:,100,100,2], color='blue', alpha=0.2, label='blue')
plt.legend()
plt.grid()
plt.show()

The next figure shows how a pixel value (that can be considered a random variable) for a particular location in different noisy images follows approximately a Gaussian distribution.

Distribution of a pixel value at location (100,100) in the noisy images

pg.png


ns = [25, 50, 100, 200]
# mean_psnrs contain the PSNR values for different n
plt.plot(ns, mean_psnrs, '.--', label='PSNR (mean)')
plt.plot(ns, median_psnrs, '.--', label='PSNR (median)')
plt.legend()
plt.xlabel('n'),  plt.ylabel('PSNR')
plt.show()

The following figure shows that the PSNR improves with large n (since by SLLN / WLLN, the sample mean converges to population mean 0 of the Gaussian noise). Also, for median the improvement in the image quality is higher for larger values of n.

psnr

Tracking Pedestrians with HOG-SVM with OpenCV / scikit-image

  1. Start with a video with pedestrians.
  2. Capture the video / extract frames from the video.
  3. For each frame
    1. Create HOG scale pyramid of the frame image.
    2. At each scale, use a sliding window to extract the corresponding block from the frame, compute the HOG descriptor features.
    3. Use cv2‘s HOGDescriptor_getDefaultPeopleDetector() – a pre-trained SVM classifier on the HOG descriptor to classify whether the corresponding block contains a pedestrian or not.
    4. Run non-max-suppression to get rid of multiple detection of the same person.
    5. Use cv2‘s  detectMultiScale() function to implement steps 3-4.

The code is adapted from the code here and here.


# HOG descriptor using default people (pedestrian) detector
hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

# run detection, using a spatial stride of 4 pixels,
# a scale stride of 1.02, and zero grouping of rectangles
# (to demonstrate that HOG will detect at potentially
# multiple places in the scale pyramid)
(foundBoundingBoxes, weights) = hog.detectMultiScale(frame, winStride=(4, 4), padding=(8, 8), scale=1.02, finalThreshold=0, useMeanshiftGrouping=False)

# convert bounding boxes from format (x1, y1, w, h) to (x1, y1, x2, y2)
rects = np.array([[x, y, x + w, y + h] for (x, y, w, h) in foundBoundingBoxes])

# run non-max suppression on the boxes based on an overlay of 65%
nmsBoundingBoxes = non_max_suppression(rects, probs=None, overlapThresh=0.65)

cv2 functions are used to extract HOG descriptor features and pedestrian detection with SVM,  whereas scikit-image functions are used to visualize the HOG features. The animations below display the original video, what HOG sees and  the detected pedestrians after non-max suppression. Notice there are a few false positive detection.

Original Videoped1

HOG-descriptor features video (what HOG sees)pedh1Original Video with detected Pedestrians ped1o

Face Detection with HaarCascade pre-trained AdaBoost classifiers with OpenCV

  1. Capture video with webcam with cv2.VideoCapture().
  2. For each frame, use the pre-trained Adaboost Cascade classifiers (the haarcascade_frontalface_default classifier for face detection and haarcascade_eye_tree_eyeglasses classifier for better detection of the eyes with glasses, from the corresponding xml files that come with cv2’s installation) using Haar-like features with cv2.CascadeClassifier().
  3. First detect the face(s) with the detectMultiScale() function and draw a bounding box. Then detect the eyes inside a detected face with the same function.
  4. The following python code snippet shows how to detect faces and eyes with cv2. The code is adapted from here.

 


# read the cascade classifiers from the xml files from the correct path into face_cascade  # and eye_cascade
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# return bounding box of the face(s) if one is detected
faces = face_cascade.detectMultiScale(gray, 1.03, 5)
for (x,y,w,h) in faces:
   frame = cv2.rectangle(frame,(x,y),(x+w,y+h),(255,0,0),2)
   roi_gray = gray[y:y+h, x:x+w]
   roi_color = frame[y:y+h, x:x+w]
   eyes = eye_cascade.detectMultiScale(roi_gray)
   for (ex,ey,ew,eh) in eyes:
      cv2.rectangle(roi_color,(ex,ey),(ex+ew,ey+eh),(0,255,0),2)

The next animation shows the results of face detection when scalefactor 1.03 was used to create the scale pyramid.  As can be seen, the eyes with the glasses on and some small faces from the photos are not detected at this scale.

fcr.gif

The next animation shows the results of face detection when scalefactor 1.3 was used to create the scale pyramid.  As can be seen, the eyes with/without the glasses on as well as most of the small faces from the photos are detected at this scale most of the time.

me

kiff

Object Tracking with OpenCV trackers

ftfttmilfttcrt

Object Saliency Detection with OpenCV

saliency

Linear / QR Barcode Generation and Detection with OpenCV

me_barcodeme_barcode_detected

me_barcode_detected

bard

Image Segmentation with Random Walk with scikit-image

rw_segmentation1.png

 

Image Segmentation with Grab-Cut with OpenCV

gr_whale.png

Segmentation with SLIC + RAG in scikit-image

me_seg

Homography with scikit-image / OpenCV

books

me10.jpg
homo1

shutterstock

me12.jpg

me_shutter

 

boimela

 

 

homo.gif

Face Morphing (Beier-Neely morphing) with Pystasm

me14     sheldon

fk.png

ad

Object Detection with YOLO DarkNet / Keras / OpenCV (Deep Learning model)

yolo_me

Semantic Segmentation with ENet / DeepLab (Deep Learning  model)

Input video and the segmented Output video
result

Input video and the segmented Output video
cycle

Neural Style Transfer with OpenCV / Torch (Deep Learning Model)

Input  & Output with style (Starry night)me_style

Text Detection with EAST (Deep Learning Model)

st

result

 

Image Colorization with Deep Learning (OpenCV / Caffe)

pcolor

result

OCR + Text Recognition with EAST + Tesseract

tes

 

Advertisements

Detection of a Human Object with HOG Descriptor Features using SVM (Primal QuadProg implementation using CVXOPT) in Python

In this article, first how to extract the HOG descriptor from an image will be discuss. Then how a support vector machine binary classifier can be trained on a dataset containing labeled images (using the extracted HOG descriptor features) and later how the SVM model can be used (along with a sliding window) to predict whether or not a human object exists in a test image will be described.  How SVM can be represented as a Primal Quadratic Programming problem and can be solved with CVXOPT that will also be discussed. This problem appeared as an assignment problem in this Computer Vision course from UCF.

Problem 1: Compute HOG features

Let’s first Implement Histogram of Orientated Gradients (HOG). The dataset to be used is the INRIA Person Dataset from here. The dataset consists of positive and negative examples for training as well as testing images. Let us do the following:

i. Take 2003 positive training images of size 96×160
ii. Take 997 negative training images of size 96×160
iii. Compute HOG for positive and negative examples.
iv. Show the visualization of HOG for some positive and negative examples.

The Histograms of Oriented Gradients for Human Detection (HOG) is a very heavily cited paper by N. Dalal and B. Triggs from CVPR 2005. The following figure shows the  algorithm proposed by them can be used to compute the HOG features for a 96×160 image:

f1.png

The next python code snippet shows some helper functions to compute the hog features:


import numpy as np
from scipy import signal
import scipy.misc

def s_x(img):
    kernel = np.array([[-1, 0, 1]])
    imgx = signal.convolve2d(img, kernel, boundary='symm', mode='same')
    return imgx
def s_y(img):
    kernel = np.array([[-1, 0, 1]]).T
    imgy = signal.convolve2d(img, kernel, boundary='symm', mode='same')
    return imgy

def grad(img):
    imgx = s_x(img)
    imgy = s_y(img)
    s = np.sqrt(imgx**2 + imgy**2)
    theta = np.arctan2(imgx, imgy) #imgy, imgx)
    theta[theta<0] = np.pi + theta[theta<0]
    return (s, theta)

 

The following figures animations show some positive and negative training examples along with the HOG features computed using the algorithm.

 

Positive Example 1

grad_phog_p
The next animation shows how the HOG features are computed using the above algorithm.

p.gif

Positive Example 2

grad_p1
hog_p1

The next animation shows how the HOG features are computed using the above algorithm.

p1

p1_.gif

Positive Example 3

grad_me
hog_me

The next animation shows how the HOG features are computed using the above algorithm.
me

Negative Example 1

grad_c.png
hog_c.png

The next animation shows how the HOG features are computed using the above algorithm.

c

Problem 2: Use sklearn’s SVC and 80-20 validation to compute accuracy on the held-out training images dataset using the extracted HOG features.

Before implementing SVC on our own with primal quadratic programming solver, let’s use the scikit-learn SVC implementation (with linear kernel) to train a support vector classifier on the training positive and negative examples using the HOG features extracted  from the training images with 80-20 validation and compute accuracy of classification on the held-out images.

The following python code does exactly that, with the X matrix containing the 1620 HOG features extracted from each image and the corresponding label (pos/neg, depending on whether human is present or not), with 98.5% accuracy on the held-out dataset.


import time
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.8, random_state=123)
timestamp1 = time.time()
clf = SVC(C=1, kernel='linear')
clf.fit(Xtrain, ytrain)
print("%d support vectors out of %d points" % (len(clf.support_vectors_), len(Xtrain)))
timestamp2 = time.time()
print "sklearn LinearSVC took %.2f seconds" % (timestamp2 - timestamp1)
ypred = clf.predict(Xtest)
print('accuracy', accuracy_score(ytest, ypred))

430 support vectors out of 2400 points
sklearn LinearSVC took 3.40 seconds
accuracy 0.985

The next figures show the confusion matrices for the prediction on the held-out dataset with the SVC model learnt.

f2

f3

 

Problem 3: Implement SVM by solving the Primal form of the problem using Quadratic Programming

Let’s implement Support Vector Machine (SVM) using Quadratic Programming. We shall use python’s CVXOPT package for this purpose. Let’s do the following:

i. Try to understand each input term in cvxopt.solvers.qp.
ii. Formulate soft- margin primal SVM in term of inputs of cvxopt.solvers.qp
iii. Show ‘P’, ‘Q’, ‘G”, ‘h’, ‘A’ and ‘b’ Matrices.
iv. Obtain parameter vector ‘w’ and bias term ‘b’ using cvxopt.solvers.qp

 

To be done

 

Problem 4: Detect Human in testing images using trained model (‘w’, ‘b’) from the last problem

Let’s use the coefficients learnt by the SVM model from the training dataset and do the following:

i. Take at least 5 testing images from Test/pos.
ii. Test the trained model over testing images. Testing can be performed using
w*feature vector + b.
iii. Use sliding window approach to obtain detection at each location in the image.
iv. Perform non-maximal suppression and choose the highest scored location.
v. Display the bounding box at the final detection.

 

To be done

 

 

 

Autonomous Driving – Car detection with YOLO Model with Keras in Python

In this article, object detection using the very powerful YOLO model will be described, particularly in the context of car detection for autonomous driving. This problem appeared as an assignment in the coursera course Convolution Networks which is a part of the Deep Learning Specialization (taught by Prof. Andrew Ng.,  from Stanford and deeplearning.ai, the lecture videos corresponding to the YOLO algorithm can be found here).  The problem description is taken straightaway from the assignment.

Given a set of images (a car detection dataset), the goal is to detect objects (cars) in those images using a pre-trained YOLO (You Only Look Once) model, with bounding boxes. Many of the ideas are from the two original YOLO papers: Redmon et al., 2016  and Redmon and Farhadi, 2016 .

Some Theory

Let’s first clear the concepts regarding classification, localization, detection and how the object detection problem can be transformed to supervised machine learning problem and subsequently can be solved using a deep convolution neural network. As can be seen from the next figure,

  • Image classification with localization aims to find the location of an object in an image by not only classifying the image (e.g., a binary classification problem: whether there is a car in an image or not), but also finding a bounding box around the object, if one found.
  • Detection goes a level further by aiming to identify multiple instances of same/ different types of objects, by marking their locations (the localization problem usually tries to find a single object location).
  • The localization problem can be converted to a supervised machine learning multi-class classification problem in the following way: in addition to the class label of the object to be identified, the output vector corresponding to an input training image must also contain the location (bounding box coordinates relative to image size) of the object.
  • A typical output data vector will contain 8 entries for a 4-class classification, as shown in the next figure, the first entry will correspond to whether or not an object of any from the 3 classes of objects. In case one is present in an image, the next 4 entries will define the bounding box containing the object, followed by 3 binary values for the 3 class labels indicating the class of the object. In case none of the objects are present, the first entry will be 0 and the others will be ignored.

 

f1.png

  • Now moving from localization to detection, one can proceed in two steps as shown below in the next figure: first use small tightly cropped images to train a convolution neural net for image classification and then use sliding windows of different window sizes (smaller to larger) to classify a test image within that window using the convnet learnt and run the windows sequentially through the entire image, but it’s infeasibly slow computationally.
  • However, as shown in the next figure, the convolutional implementation of the sliding windows by replacing the fully-connected layers by 1×1 filters makes it possible to simultaneously classify the image-subset inside all possible sliding windows parallelly, making it much more efficient computationally.

 

f2.png

  • The convolutional sliding windows, although computationally much more efficient, still has the problem of detecting the accurate bounding boxes, since the boxes don’t align with the sliding windows and the object shapes also tend to be different.
  • YOLO algorithm overcomes this limitation by dividing a training image into grids and assigning an object to a grid if and only if the center of the object falls inside the grid, that way each object in a training image can get assigned to exactly one grid and then the corresponding bounding box is represented by the coordinates relative to the grid. The next figure described the details of the algorithm.
  • In the test images, multiple adjacent grids may think that an object actually belongs to them, in order to resolve the iou (intersection of union) measure is used to find the maximum overlap and the non-maximum-suppression algorithm is used to discard all the other bounding boxes with low-confidence of containing an object, keeping the one with the highest confidence among the competing ones and discard the others.
  • Still there is a problem of multiple objects falling in the same grid. Multiple anchor boxes (of different shapes) are used to resolve the problem, each anchor box of a particular shape being likely to eventually detect  an object of a particular shape.

 

f3.png

The following figure shows the slides taken from the presentation You Only Look Once: Unified, Real-Time Object Detection in the CVPR 2016 summarizing the algorithm:

yolo_cvpr.png

Problem Statement

Let’s assume that we are working on a self-driving car. As a critical component of this project, we’d like to first build a car detection system. To collect data, we’ve mounted a camera to the hood (meaning the front) of the car, which takes pictures of the road ahead every few seconds while we drive around.

The above pictures are taken from a car-mounted camera while driving around Silicon Valley.  We would like to especially thank drive.ai for providing this dataset! Drive.ai is a company building the brains of self-driving vehicles.

driveai.png

We’ve gathered all these images into a folder and have labelled them by drawing bounding boxes around every car we found. Here’s an example of what our bounding boxes look like.

Definition of a box
box_label.png

 

If we have 80 classes that we want YOLO to recognize, we can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. Here we will use both representations, depending on which is more convenient for a particular step.

In this exercise, we shall learn how YOLO works, then apply it to car detection. Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for our use.  The instructions for how to do it can be obtained from here and here.

 

YOLO

YOLO (“you only look once“) is a popular algorithm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

Model details

First things to know:

  • The input is a batch of images of shape (m, 608, 608, 3).
  • The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If we expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So we can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Let’s look in greater detail at what this encoding represents.

Encoding architecture for YOLO

architecture.png

If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

Flattening the last two last dimensions

flatten.png

 

Now, for each box (of each cell) we will compute the following element-wise product and extract a probability that the box contains a certain class.

Find the class detected by each box

probability_extraction.png

Here’s one way to visualize what YOLO is predicting on an image:

  • For each of the 19×19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
  • Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:

proba_map.png

Each of the 19×19 grid cells colored according to which class has the largest predicted probability in that cell.

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:

anchor_map.png

Each cell gives us 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking once at the image (one forward pass through the network)! Different colors denote different classes.

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, we’ll use non-max suppression. Specifically, we’ll carry out these steps:

  • Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class).
  • Select only one box when several boxes overlap with each other and detect the same object.

 

Filtering with a threshold on class scores

We are going to apply a first filter by thresholding. We would like to get rid of any box for which the class “score” is less than a chosen threshold.

The model gives us a total of 19x19x5x85 numbers, with each box described by 85 numbers. It’ll be convenient to rearrange the (19,19,5,85) (or (19,19,425)) dimensional tensor into the following variables:

  • box_confidence: tensor of shape (19×19,5,1) containing pc (confidence probability that there’s some object) for each of the 5 boxes predicted in each of the 19×19 cells.
  • boxes: tensor of shape (19×19,5,4) containing (bx,by,bh,bw) for each of the 5 boxes per cell.
  • box_class_probs: tensor of shape (19×19,5,80) containing the detection probabilities (c1,c2,…c80) for each of the 80 classes for each of the 5 boxes per cell.

Exercise: Implement yolo_filter_boxes().

  • Compute box scores by doing the element-wise product as described in the above figure.
  • For each box, find:
    • the index of the class with the maximum box score.
    • the corresponding box score.
  • Create a mask by using a threshold.  The mask should be True for the boxes you want to keep.
  • Use TensorFlow to apply the mask to box_class_scores, boxes and box_classes to filter out the boxes we don’t want.
    We should be left with just the subset of boxes we want to keep.

Let’s first load the packages and dependencies that are going to be useful.

import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.io
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D
from keras.models import load_model, Model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body

 


def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
 """Filters YOLO boxes by thresholding on object and class confidence.

 Arguments:
 box_confidence -- tensor of shape (19, 19, 5, 1)
 boxes -- tensor of shape (19, 19, 5, 4)
 box_class_probs -- tensor of shape (19, 19, 5, 80)
 threshold -- real value, if [ highest class probability score = threshold)

 # Step 4: Apply the mask to scores, boxes and classes

return scores, boxes, classes

 

Non-max suppression

Even after filtering by thresholding over the classes scores, we still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).

non-max-suppression.png

n this example, the model has predicted 3 cars, but it’s actually 3 predictions of the same car. Running non-max suppression (NMS) will select only the most accurate (highest probability) one of the 3 boxes.

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

Definition of “Intersection over Union”

iou.png

 

Exercise: Implement iou(). Some hints:

  • In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
  • To calculate the area of a rectangle we need to multiply its height (y2 – y1) by its width (x2 – x1)
  • We’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
    xi1 = maximum of the x1 coordinates of the two boxes
    yi1 = maximum of the y1 coordinates of the two boxes
    xi2 = minimum of the x2 coordinates of the two boxes
    yi2 = minimum of the y2 coordinates of the two boxes

In this code, we use the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) the lower-right corner.


def iou(box1, box2):
 """Implement the intersection over union (IoU) between box1 and box2

 Arguments:
 box1 -- first box, list object with coordinates (x1, y1, x2, y2)
 box2 -- second box, list object with coordinates (x1, y1, x2, y2)
 """

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.

# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)

# compute the IoU

return iou

 

We are now ready to implement non-max suppression. The key steps are:

  • Select the box that has the highest score.
  • Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
  • Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

Exercise: Implement yolo_non_max_suppression() using TensorFlow. TensorFlow has two built-in functions that are used to implement non-max suppression (so we don’t actually need to use your iou() implementation):

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
 """
 Applies Non-max suppression (NMS) to set of boxes

 Arguments:
 scores -- tensor of shape (None,), output of yolo_filter_boxes()
 boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
 classes -- tensor of shape (None,), output of yolo_filter_boxes()
 max_boxes -- integer, maximum number of predicted boxes you'd like
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 Returns:
 scores -- tensor of shape (, None), predicted score for each box
 boxes -- tensor of shape (4, None), predicted box coordinates
 classes -- tensor of shape (, None), predicted class for each box

 Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
 function will transpose the shapes of scores, boxes, classes. This is made for convenience.
 """

 max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
 K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

 # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep

 # Use K.gather() to select only nms_indices from scores, boxes and classes

 return scores, boxes, classes

 

Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions we’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail we have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which are provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh)

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape)

YOLO’s network was trained to run on 608×608 images. If we are testing this data on a different size image – for example, the car detection dataset had 720×1280 images – his step rescales the boxes so that they can be plotted on top of the original 720×1280 image.


def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
 """
 Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.

 Arguments:
 yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
 box_confidence: tensor of shape (None, 19, 19, 5, 1)
 box_xy: tensor of shape (None, 19, 19, 5, 2)
 box_wh: tensor of shape (None, 19, 19, 5, 2)
 box_class_probs: tensor of shape (None, 19, 19, 5, 80)
 image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
 max_boxes -- integer, maximum number of predicted boxes you'd like
 score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 Returns:
 scores -- tensor of shape (None, ), predicted score for each box
 boxes -- tensor of shape (None, 4), predicted box coordinates
 classes -- tensor of shape (None,), predicted class for each box
 """

 # Retrieve outputs of the YOLO model

 # Convert boxes to be ready for filtering functions 

 # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold

 # Scale boxes back to original image shape.

 # Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold 

 return scores, boxes, classes

 

Summary for YOLO:

  • Input image (608, 608, 3)
  • The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
  • After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
    • Each cell in a 19×19 grid over the input image gives 425 numbers.
    • 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
    • 85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we’d like to detect.
  • We then select only few boxes based on:
    • Score-thresholding: throw away boxes that have detected a class with a score less than the threshold.
    • Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes.
  • This gives us YOLO’s final output.

 

Test YOLO pretrained model on images

In this part, we are going to use a pre-trained model and test it on the car detection dataset. As usual, we start by creating a session to start your graph. Run the following cell.

sess = K.get_session()

Defining classes, anchors and image shape.

Recall that we are trying to detect 80 classes, and are using 5 anchor boxes. We have gathered the information about the 80 classes and 5 boxes in two files “coco_classes.txt” and “yolo_anchors.txt”. Let’s load these quantities into the model by running the next cell.

The car detection dataset has 720×1280 images, which we’ve pre-processed into 608×608 images.

class_names = read_classes(“coco_classes.txt”)
anchors = read_anchors(“yolo_anchors.txt”)
image_shape = (720., 1280.)

 

Loading a pretrained model

Training a YOLO model takes a very long time and requires a fairly large dataset of labelled bounding boxes for a large range of target classes. We are going to load an existing pretrained Keras YOLO model stored in “yolo.h5”. (These weights come from the official YOLO website, and were converted using a function written by Allan Zelener.  Technically, these are the parameters from the “YOLOv2” model, but we will more simply refer to it as “YOLO” in this notebook.)

yolo_model = load_model(“yolo.h5”)

This loads the weights of a trained YOLO model. Here’s a summary of the layers our model contains.

yolo_model.summary()

____________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
===========================================================================
input_1 (InputLayer) (None, 608, 608, 3) 0
____________________________________________________________________________________________
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
____________________________________________________________________________________________
batch_normalization_1 (BatchNor (None, 608, 608, 32) 128 conv2d_1[0][0]
____________________________________________________________________________________________
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
____________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
____________________________________________________________________________________________
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
____________________________________________________________________________________________
batch_normalization_2 (BatchNor (None, 304, 304, 64) 256 conv2d_2[0][0]
____________________________________________________________________________________________
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
____________________________________________________________________________________________
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
____________________________________________________________________________________________
conv2d_3 (Conv2D) (None, 152, 152, 128 73728 max_pooling2d_2[0][0]
____________________________________________________________________________________________
batch_normalization_3 (BatchNor (None, 152, 152, 128 512 conv2d_3[0][0]
____________________________________________________________________________________________
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_3[0][0]
____________________________________________________________________________________________
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
____________________________________________________________________________________________
batch_normalization_4 (BatchNor (None, 152, 152, 64) 256 conv2d_4[0][0]
____________________________________________________________________________________________
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
____________________________________________________________________________________________
conv2d_5 (Conv2D) (None, 152, 152, 128 73728 leaky_re_lu_4[0][0]
____________________________________________________________________________________________
batch_normalization_5 (BatchNor (None, 152, 152, 128 512 conv2d_5[0][0]
____________________________________________________________________________________________
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_5[0][0]
____________________________________________________________________________________________
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
____________________________________________________________________________________________
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
____________________________________________________________________________________________
batch_normalization_6 (BatchNor (None, 76, 76, 256) 1024 conv2d_6[0][0]
____________________________________________________________________________________________
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
____________________________________________________________________________________________
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
____________________________________________________________________________________________
batch_normalization_7 (BatchNor (None, 76, 76, 128) 512 conv2d_7[0][0]
____________________________________________________________________________________________
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
____________________________________________________________________________________________
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
____________________________________________________________________________________________
batch_normalization_8 (BatchNor (None, 76, 76, 256) 1024 conv2d_8[0][0]
____________________________________________________________________________________________
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
____________________________________________________________________________________________
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
____________________________________________________________________________________________
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
____________________________________________________________________________________________
batch_normalization_9 (BatchNor (None, 38, 38, 512) 2048 conv2d_9[0][0]
____________________________________________________________________________________________
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
____________________________________________________________________________________________
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
____________________________________________________________________________________________
batch_normalization_10 (BatchNo (None, 38, 38, 256) 1024 conv2d_10[0][0]
____________________________________________________________________________________________
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
____________________________________________________________________________________________
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
____________________________________________________________________________________________
batch_normalization_11 (BatchNo (None, 38, 38, 512) 2048 conv2d_11[0][0]
____________________________________________________________________________________________
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
____________________________________________________________________________________________
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
____________________________________________________________________________________________
batch_normalization_12 (BatchNo (None, 38, 38, 256) 1024 conv2d_12[0][0]
____________________________________________________________________________________________
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
____________________________________________________________________________________________
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
____________________________________________________________________________________________
batch_normalization_13 (BatchNo (None, 38, 38, 512) 2048 conv2d_13[0][0]
____________________________________________________________________________________________
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
____________________________________________________________________________________________
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
____________________________________________________________________________________________
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
____________________________________________________________________________________________
batch_normalization_14 (BatchNo (None, 19, 19, 1024) 4096 conv2d_14[0][0]
____________________________________________________________________________________________
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
____________________________________________________________________________________________
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
____________________________________________________________________________________________
batch_normalization_15 (BatchNo (None, 19, 19, 512) 2048 conv2d_15[0][0]
____________________________________________________________________________________________
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
____________________________________________________________________________________________
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
____________________________________________________________________________________________
batch_normalization_16 (BatchNo (None, 19, 19, 1024) 4096 conv2d_16[0][0]
____________________________________________________________________________________________
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
____________________________________________________________________________________________
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
____________________________________________________________________________________________
batch_normalization_17 (BatchNo (None, 19, 19, 512) 2048 conv2d_17[0][0]
____________________________________________________________________________________________
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
____________________________________________________________________________________________
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
____________________________________________________________________________________________
batch_normalization_18 (BatchNo (None, 19, 19, 1024) 4096 conv2d_18[0][0]
____________________________________________________________________________________________
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
____________________________________________________________________________________________
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
____________________________________________________________________________________________
batch_normalization_19 (BatchNo (None, 19, 19, 1024) 4096 conv2d_19[0][0]
____________________________________________________________________________________________
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
____________________________________________________________________________________________
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
____________________________________________________________________________________________
batch_normalization_21 (BatchNo (None, 38, 38, 64) 256 conv2d_21[0][0]
____________________________________________________________________________________________
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
____________________________________________________________________________________________
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
____________________________________________________________________________________________
batch_normalization_20 (BatchNo (None, 19, 19, 1024) 4096 conv2d_20[0][0]
____________________________________________________________________________________________
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
____________________________________________________________________________________________
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
____________________________________________________________________________________________
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
leaky_re_lu_20[0][0]
____________________________________________________________________________________________
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
____________________________________________________________________________________________
batch_normalization_22 (BatchNo (None, 19, 19, 1024) 4096 conv2d_22[0][0]
____________________________________________________________________________________________
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
____________________________________________________________________________________________
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
===========================================================================
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672
____________________________________________________________________________________________

model.png

Reminder: this model converts a pre-processed batch of input images (shape: (m, 608, 608, 3)) into a tensor of shape (m, 19, 19, 5, 85) as explained in the above Figure.

Convert output of the model to usable bounding box tensors

The output of yolo_model is a (m, 19, 19, 5, 85) tensor that needs to pass through non-trivial processing and conversion. The following code does this.

yolo_outputs = yolo_head(yolo_model.output, anchors, len(class_names))

We added yolo_outputs to your graph. This set of 4 tensors is ready to be used as input by our yolo_eval function.

Filtering boxes

yolo_outputs gave us all the predicted boxes of yolo_model in the correct format. We’re now ready to perform filtering and select only the best boxes. Lets now call yolo_eval, which you had previously implemented, to do this.

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

Run the graph on an image

Let the fun begin. We have created a (sess) graph that can be summarized as follows:

  1. yolo_model.input is given to yolo_model. The model is used to compute the output yolo_model.output
  2. yolo_model.output is processed by yolo_head. It gives us yolo_outputs
  3. yolo_outputs goes through a filtering function, yolo_eval. It outputs your predictions: scores, boxes, classes

Exercise: Implement predict() which runs the graph to test YOLO on an image. We shall need to run a TensorFlow session, to have it compute scores, boxes, classes.

The code below also uses the following function:

image, image_data = preprocess_image(“images/” + image_file, model_image_size = (608, 608))

which outputs:

  • image: a python (PIL) representation of your image used for drawing boxes. You won’t need to use it.
  • image_data: a numpy-array representing the image. This will be the input to the CNN.

Important note: when a model uses BatchNorm (as is the case in YOLO), we will need to pass an additional placeholder in the feed_dict {K.learning_phase(): 0}.


def predict(sess, image_file):
"""
Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.

Arguments:
sess -- your tensorflow/Keras session containing the YOLO graph
image_file -- name of an image stored in the "images" folder.

Returns:
out_scores -- tensor of shape (None, ), scores of the predicted boxes
out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
out_classes -- tensor of shape (None, ), class index of the predicted boxes

Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes.
"""

 # Preprocess your image

 # Run the session with the correct tensors and choose the correct placeholders in the
 # feed_dict. We'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})

 # Print predictions info
print('Found {} boxes for {}'.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image
image.save(os.path.join("out", image_file), quality=90)
# Display the results
output_image = scipy.misc.imread(os.path.join("out", image_file))
imshow(output_image)

 return out_scores, out_boxes, out_classes

Let’s Run the following cell on the following “test.jpg” image to verify that our function is correct.

Inputtest.jpg

out_scores, out_boxes, out_classes = predict(sess, “test.jpg”)

The following figure shows the output after car detection. Each of the bounding boxes have the name of the object detected on the top left along with the confidence value.

Output (with detected cars with YOLO)

Found 7 boxes for test.jpg
car 0.60 (925, 285) (1045, 374)
car 0.66 (706, 279) (786, 350)
bus 0.67 (5, 266) (220, 407)
car 0.70 (947, 324) (1280, 705)
car 0.74 (159, 303) (346, 440)
car 0.80 (761, 282) (942, 412)
car 0.89 (367, 300) (745, 648)

test


The following animation shows the output Images with detected objects (cars) using YOLO for a set of input images.

cars.gif



What we should remember:

  • YOLO is a state-of-the-art object detection model that is fast and accurate.
  • It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
  • The encoding can be seen as a grid where each of the 19×19 cells contains information about 5 boxes.
  • You filter through all the boxes using non-max suppression. Specifically:
    Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes.
  • Intersection over Union (IoU) thresholding to eliminate overlapping boxes.
  • Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise.

 


References: The ideas presented in this notebook came primarily from the two YOLO papers. The implementation here also took significant inspiration and used many components from Allan Zelener’s github repository. The pretrained weights used in this exercise came from the official YOLO website.

  1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi – You Only Look Once: Unified, Real-Time Object Detection (2015)
  2. Joseph Redmon, Ali Farhadi – YOLO9000: Better, Faster, Stronger (2016)
  3. Allan Zelener – YAD2K: Yet Another Darknet 2 Keras
  4. The official YOLO website .

Car detection dataset: Creative Commons License.

The Drive.ai Sample Dataset (provided by drive.ai) is licensed under a Creative Commons Attribution 4.0 International License.

Implementing Lucas-Kanade Optical Flow algorithm in Python

In this article an implementation of the Lucas-Kanade optical flow algorithm is going to be described. This problem appeared as an assignment in this computer vision course from UCSD. The inputs will be sequences of images (subsequent frames from a video) and the algorithm will output an optical flow field (u, v) and trace the motion of the moving objects. The problem description is taken from the assignment itself.

 


Problem Statement



Single-Scale Optical Flow

  • Let’s implement the single-scale Lucas-Kanade optical flow algorithm. This involves finding the motion (u, v) that minimizes the sum-squared error of the brightness constancy equations for each pixel in a window.  The algorithm will be implemented as a function with the following inputs:

     def optical_flow(I1, I2, window_size, tau) # returns (u, v)

  • Here, u and v are the x and y components of the optical flow, I1 and I2 are two images taken at times t = 1 and t = 2 respectively, and window_size is a 1 × 2 vector storing the width and height of the window used during flow computation.
  • In addition to these inputs, a theshold τ should be added, such that if τ is larger than the smallest eigenvalue of A’A, then the the optical flow at that position should not be computed. Recall that the optical flow is only valid in regions where

f18.png
has rank 2, which is what the threshold is checking. A typical value for τ is 0.01.

  • We should try experimenting with different window sizes and find out the tradeoffs associated with using a small vs. a large window size.
  • The following figure describes the algorithm, which considers a nxn (n>=3) window around each pixel and solves a least-square problem to find the best flow vectors for the pixel.

f19.png

  • The following code-snippet shows how the algorithm is implemented in python for a gray-level image.
import numpy as np
from scipy import signal
def optical_flow(I1g, I2g, window_size, tau=1e-2):

    kernel_x = np.array([[-1., 1.], [-1., 1.]])
    kernel_y = np.array([[-1., -1.], [1., 1.]])
    kernel_t = np.array([[1., 1.], [1., 1.]])#*.25
    w = window_size/2 # window_size is odd, all the pixels with offset in between [-w, w] are inside the window
    I1g = I1g / 255. # normalize pixels
    I2g = I2g / 255. # normalize pixels
    # Implement Lucas Kanade
    # for each point, calculate I_x, I_y, I_t
    mode = 'same'
    fx = signal.convolve2d(I1g, kernel_x, boundary='symm', mode=mode)
    fy = signal.convolve2d(I1g, kernel_y, boundary='symm', mode=mode)
    ft = signal.convolve2d(I2g, kernel_t, boundary='symm', mode=mode) +
         signal.convolve2d(I1g, -kernel_t, boundary='symm', mode=mode)
    u = np.zeros(I1g.shape)
    v = np.zeros(I1g.shape)
    # within window window_size * window_size
    for i in range(w, I1g.shape[0]-w):
        for j in range(w, I1g.shape[1]-w):
            Ix = fx[i-w:i+w+1, j-w:j+w+1].flatten()
            Iy = fy[i-w:i+w+1, j-w:j+w+1].flatten()
            It = ft[i-w:i+w+1, j-w:j+w+1].flatten()
            #b = ... # get b here
            #A = ... # get A here
            # if threshold τ is larger than the smallest eigenvalue of A'A:
            nu = ... # get velocity here
            u[i,j]=nu[0]
            v[i,j]=nu[1]

    return (u,v)

 


Some Results


  • The following figures and animations show the results of the algorithm on a few image sequences. Some of these input image sequences / videos are from the course and some are collected from the internet.
  • As can be seen, the algorithm performs best if the motion of the moving object(s) in between consecutive frames is slow. To the contrary, if the motion is large, the algorithm fails and we should implement / use multiple-scale version Lucas-Kanade with image pyramids.
  • Finally,  with small window size,  the algorithm captures subtle motions but not large motions. With large size it happens the other way.


Input Sequences

sphere

shpere_cmap_15

Output Optical Flow with different window sizes

window size = 15

shpere_opt_15

window size = 21

shpere_opt_21

 



Input Sequences
rubic

Output Optical Flow
rubic_opt

rubic_cmap



Input Sequences (hamburg taxi)
taxi

taxi_cmap

Output Optical Flowtaxi_opt

 


Input Sequences
box

box_cmap

Output Optical Flow
box_opt


Input Sequences
seq

seq_cmap

Output Optical Flowseq_opt


Input Sequences    fount3.gif

fount_cmap

Output Optical Flowfount_opt


Input Sequences
corridor

Output Optical Flow
corridor_optc


Input Sequencessynth

synth'_cmap
Output Optical Flowsynth_opt


Input Sequencescars1
Output Optical Flowcars1_optcars1_cmap


Input Sequencescars2

Output Optical Flowcars2_opt

Output Optical Flowcars2_opt2cars2_cmap



Input Sequences

carsh.gif

cars3_cmap

Output Optical Flow with window size 45
cars3_opt.gif

Output Optical Flow with window size 10

cars3_opt2_10
Output Optical Flow with window size 25
cars3_opt2_25
Output Optical Flow with window size 45cars3_opt2_45


 

Efficient Graph-Based Image Segmentation in Python

In this article, an implementation of an efficient graph-based image segmentation technique will be described, this algorithm was proposed by Felzenszwalb et. al. from MIT in this paper.  The slides on this paper can be found from this link from the Stanford Vision Lab too. The algorithm is closely related to Kruskal’s algorithm for constructing a minimum spanning tree of a graph, as stated by the author and hence can  be implemented to run in O(m log m) time, where m is the number of edges in the graph.


Problem Definition and the basic idea (from the paper)



  • Let G = (V, E) be an undirected graph with vertices vi ∈ V, the set of elements to be segmented, and edges (vi, vj ) ∈ E corresponding to pairs of neighboring vertices. Each edge (vi, vj ) ∈ E has a corresponding weight w((vi, vj)), which is a non-negative measure of the dissimilarity between neighboring elements vi and vj.

  • In the case of image segmentation, the elements in V are pixels and the weight of an edge is some measure of the dissimilarity between the two pixels connected by that edge (e.g., the difference in intensity, color, motion, location or some other local attribute).


  • Particularly for the implementation described here, an edge weight function based on the absolute intensity difference (in the yiq space) between the pixels connected by an edge, w((vi, vj )) = |I(pi) − I(pj )|.


  • In the graph-based approach, a segmentation S is a partition of V into components
    such that each component (or region) C ∈ S corresponds to a connected component
    in a graph G0 = (V, E0), where E0 ⊆ E.


  • In other words, any segmentation is induced by a subset of the edges in E. There are different ways to measure the quality of a segmentation but in general we want the elements in a component to be similar, and elements in different components to be dissimilar.


  • This means that edges between two vertices in the same component should have relatively low weights, and edges between vertices in different components should have higher weights.



  • The next figure shows the steps in the algorithm. The algorithm is very similar to Kruskal’s algorithm for computing the MST for an undirected graph.


f17.png


  • The threshold function τ controls the degree to which the difference between two
    components must be greater than their internal differences in order for there to be
    evidence of a boundary between them.


  • For small components, Int(C) is not a good estimate of the local characteristics of the data. In the extreme case, when |C| = 1, Int(C) = 0. Therefore, a threshold function based on the size of the component, τ (C) = k/|C| is needed to be usedwhere |C| denotes the size of C, and k is some constant parameter.


  • That is, for small components we require stronger evidence for a boundary. In practice k sets a scale of observation, in that a larger k causes a preference for larger components.


  • In general, a Gaussian filter is used to smooth the image slightly before computing the edge weights, in order to compensate for digitization artifacts. We always use a Gaussian with σ = 0.8, which does not produce any visible change to the image but helps remove artifacts.


  • The following python code shows how to create the graph.


import numpy as np
from scipy import signal
import matplotlib.image as mpimg

def gaussian_kernel(k, s = 0.5):
    # generate a (2k+1)x(2k+1) gaussian kernel with mean=0 and sigma = s
    probs = [exp(-z*z/(2*s*s))/sqrt(2*pi*s*s) for z in range(-k,k+1)]
return np.outer(probs, probs)

def create_graph(imfile, k=1., sigma=0.8, sz=1):
    # create the pixel graph with edge weights as dissimilarities
     rgb = mpimg.imread(imfile)[:,:,:3]
     gauss_kernel = gaussian_kernel(sz, sigma)
     for i in range(3):
         rgb[:,:,i] = signal.convolve2d(rgb[:,:,i], gauss_kernel, boundary='symm', mode='same')
     yuv = rgb2yiq(rgb)
     (w, h) = yuv.shape[:2]
     edges = {}
     for i in range(yuv.shape[0]):
         for j in range(yuv.shape[1]):
             #compute edge weight for nbd pixel nodes for the node i,j
             for i1 in range(i-1, i+2):
                 for j1 in range(j-1, j+2):
                     if i1 == i and j1 == j: continue

                     if i1 >= 0 and i1 = 0 and j1 < h:
                        wt = np.abs(yuv[i,j,0]-yuv[i1,j1,0])
                        n1, n2 = ij2id(i,j,w,h), ij2id(i1,j1,w,h)
                        edges[n1, n2] = edges[n2, n1] = wt
     return edges

 


Some Results


  • The images are taken from the paper itself or from the internet. The following figures and animations show the result of segmentation as a result of iterative merging of the components (by choosing least weight edges), depending on the internal difference of the components.

  • Although in the paper the author described the best value of the parameter k to be around 300, but  since in this implementation the pixel RGB values are normalized (to have values in between 0 – 1) and then converted to YIQ values and the YIQ intensities are used for computing the weights (which are typically very small), the value of k that works best in this scenario is 0.001-0.01.

  • As we can see from the below results, higher the value of the parameter k, larger the size of the final component and lesser the number of components in the result.

  • The minimum spanning tree creation is also shown, the red edges shown in the figures are the edges chosen by the algorithm to merge the components.


Input Image



            player.png

Output Images for two different values of the parameter k

player_k_0.001.gif

player_k_0.01.gif

out_ 0.001player.png

out_ 0.010player.png


The forest created after a few iterations


mst1


Input Image


hill

Output Images for two different values of the parameter k

hill_k_0.01

out_ 0.010hill.png

hill_k_0.001

out_ 0.001hill.png


The forest created after a few iterations


mst3.png


Input Image


parrot

Output Segmented Images

parrot_k_0.001out_ 0.001parrot_Dark2.pngout_ 0.001parrot_hot.png



Input Image


road2

Segmented Output Images

road2_k_0.01
out_ 0.001road2_Set1.png



Input Image


road


Output Images for two different values of the parameter k


road_k_0.001

out_ 0.001road.png
road_k_0.01out_ 0.010road.png


The forest created after a few iterations


mst2.png

 


  Input Image (UMBC)


             umbc.png


Segmented Output Images


umbc_k_0.001out_ 0.001umbc.png


    Input Image


nj.png



Segmented Output Image with k=0.001



out_ 0.001nj.png


    Input Image


hand


Segmented Output Image with k=0.001


ring_k_0.001



    Input Image


flowers2

Segmented Output Image with k=0.001

out_ 0.001flowers2.png


    Input Image (liver)


liver

Segmented Output Images 

out_ 0.001liver.pngout_ 0.010liver.png



    Input Image


building

Segmented Output Images with different values of k

building_k_0.01

out_ 0.001building.pngout_ 0.010building.png


    Input Image


frame056

Segmented Output Image

out_ 0.001frame056.png


    Input Image


vic

Segmented Output Images for different k

vic


 

Interactive Image Segmentation with Graph-Cut in Python

In this article, interactive image segmentation with graph-cut is going to be discussed. and it will be used to segment the source object from the background in an image. This segmentation technique was proposed by Boycov and Jolli in this paper. This problem appeared as a homework assignment here., and also in this lecture video from the Coursera image processing course by Duke university.

Problem Statement: Interactive graph-cut segmentation

Let’s implement “intelligent paint” interactive segmentation tool using graph cuts algorithm on a weighted image grid. Our task will be to separate the foreground object from the background in an image.

Since it can be difficult sometimes to automatically define what’s foreground and what’s background for an image, the user is going to help us with a few interactive scribble lines using which our algorithm is going to identify the foreground and the background, after that it will be the algorithms job to obtain a complete segmentation of the foreground from the background image.

The following figures show how an input image gets scribbling from a user with two different colors (which is also going to be input to our algorithm) and the ideal segmented image output.

Scribbled Input Image                                       Expected Segmented Output Image       bj

 


The Graph-Cut Algorithm


 

The following describes how the segmentation problem is transformed into a graph-cut problem:


  1. Let’s first define the Directed Graph G = (V, E) as follows:

    1. Each of the pixels in the image is going to be a vertex in the graph. There will be another couple of special terminal vertices: a source vertex (corresponds to the foreground object) and a sink vertex (corresponds to the background object in the image). Hence, |V(G)| = width x height + 2.

    2. Next, let’s defines the edges of the graph. As obvious, there is going to be two types of edges: terminal (edges that connect the terminal nodes to the non-terminal nodes) and non-terminal (edges that connect the non-terminal nodes only).



    3. There will be a directed edge from the terminal node source to each of non-terminal nodes in the graph. Similarly,  a directed edge will be there from each non-terminal node (pixel) to the other terminal node sink. These are going to be all the terminal edges and hence, |E_T(G)| =  2 x width x height.


    4. Each of the non-terminal nodes (pixels) are going to be connected by edges with the nodes corresponding to the neighboring pixels (defined by 4 or 8 neighborhood of a pixel).  Hence, |E_N(G)| =  |Nbd| x width x height.



  2. Now let’s describe how to compute the edge weights in this graph.



    1. In order to compute the terminal edge weights, we need to estimate the feature distributions first, i.e., starting with the assumption that each of the nodes corresponding to the scribbled pixels have the probability 1.0 (since we want the solution to respect the regional hard constraints marked by the user-seeds / scribbles) to be in foreground or background object in the image (distinguished by the scribble color, e.g.), we have to compute the probability that a node belongs to the foreground (or background) for all the other non-terminal nodes.

    2. The simplest way to compute P_F and P_B is to first fit a couple of  Gaussian distributions on the scribbles by computing the parameters (μ, ∑)
      with MLE from the scribbled pixel intensities and then computing the (class-conditional) probabilities from the individual pdfs (followed by a normalization) for each of the pixels as shown in the next figures. The following code fragment show how the pdfs are being computed.


      import numpy as np
      from collections import defaultdict
      
      def compute_pdfs(imfile, imfile_scrib):
          '''
          # Compute foreground and background pdfs
          # input image and the image with user scribbles
          '''
           rgb = mpimg.imread(imfile)[:,:,:3]
           yuv = rgb2yiq(rgb)
           rgb_s = mpimg.imread(imfile_scrib)[:,:,:3]
           yuv_s = rgb2yiq(rgb_s)
           # find the scribble pixels
           scribbles = find_marked_locations(rgb, rgb_s)
           imageo = np.zeros(yuv.shape)
           # separately store background and foreground scribble pixels in the dictionary comps
           comps = defaultdict(lambda:np.array([]).reshape(0,3))
           for (i, j) in scribbles:
                imageo[i,j,:] = rgbs[i,j,:]
                # scribble color as key of comps
                comps[tuple(imageo[i,j,:])] = np.vstack([comps[tuple(imageo[i,j,:])], yuv[i,j,:]])
                mu, Sigma = {}, {}
           # compute MLE parameters for Gaussians
           for c in comps:
                mu[c] = np.mean(comps[c], axis=0)
                Sigma[c] = np.cov(comps[c].T)
           return (mu, Sigma)
      


      3. In order to compute the non-terminal edge weights, we need to compute the similarities in between a pixel node and the nodes corresponding to its neighborhood pixels, e.g., with the formula shown in the next figures (e.g., how similar neighborhood pixels are in RGB / YIQ space).


  3. Now that the underlying graph is defined, the segmentation of the foreground from the background image boils down to computing the min-cut in the graph or equivalently computing the max-flow (the dual problem) from the source to sink.


  4. The intuition is that the min-cut solution will keep the pixels with high probabilities to belong to the side of the source (foreground) node and likewise the background pixels on the other side of the cut near the sink (background) node, since it’s going to respect the (relatively) high-weight edges (by not going through the highly-similar pixels).


  5. There are several standard algorithms, e.g., Ford-Fulkerson (by finding an augmenting path with O(E max| f |) time complexity) or Edmonds-Karp (by using bfs to find the shortest augmenting path, with O(VE2) time complexity) to solve the max-flow problem, typical implementations of these algorithms run pretty fast, in polynomial time in V, E.  Here we are going to use a different implementation (with pymaxflow) based on Vladimir Kolmogorov, which is shown to run faster on many images empirically in this paper.


 

f13

f15f14

f16


Results


 

The following figures / animations show the interactive-segmentation results (computed probability densities, subset of the flow-graph & min-cut, final segmented image) on a few images,  some of them taken from the above-mentioned courses / videos, some of them taken from Berkeley Vision dataset.


Input Imagedeer

Input Image with Scribblesdeer_scribed

Fitted Densities from Color Scribblesgaussian_deer.png

maxflow_deerdeer_seg

A Tiny Sub-graph with Min-Cut

flow_deer.png


Input Image
eagle

Input Image with Scribbleseagle_scribed

Fitted Densities from Color Scribblesgaussian_eagle

maxflow_eagle

eagle.gif

A Tiny Sub-graph with Min-Cut flow_eagle


Input Image (liver)                                             
liver

Input Image with Scribblesliver_scribed

Fitted Densities from Color Scribbles
gaussian_liver

maxflow_liver



Input Imagefishes2

Input Image with Scribbles
fishes2_scribed

Fitted Densities from Color Scribblesgaussian_fishes2

maxflow_fishes2

A Tiny Sub-graph of the flow-graph with Min-Cut
flow_fishes2


Input Image
panda

Input Image with Scribblespanda_scribed

maxflow_panda


Input Imagebunny

Input Image with Scribblesbunny_scribed

Fitted Densities from Color Scribbles
gaussian_bunny

maxflow_bunny

seg_bunny

A Tiny Sub-graph of the flow-graph with Min-Cut
flow


Input Image
flowers

Input Image with Scribbles
flowers_scribedmaxflow_flowers

A Tiny Sub-graph of the flow-graph with Min-Cut

flow


Input Image                                                         Input Image with Scribbles
me3    me3_scribed

maxflow_me3


Input Imageinput1

Input Image with Scribblesinput1_scribed

maxflow_input1

A Tiny Sub-graph of the flow-graph with Min-Cut

flow.png


Input Imagerose

Input Image with Scribbles
rose_scribed

Fitted Densities from Color Scribbles
gaussian_rose

maxflow_rose



Input Image
whale

Input Image with Scribbleswhale_scribed

Fitted Densities from Color Scribblesgaussian_whale

A Tiny Sub-graph of the flow-graph with Min-Cut

flow_whalemaxflow_whalebgc_whale


Input Image (UMBC Campus Map)umbc_campus

Input Image with Scribbles
umbc_campus_scribed

maxflow_umbc_campus


Input Imagehorses

Input Image with Scribbles
horses_scribed

maxflow_horses

A Tiny Sub-graph of the flow-graph with Min-Cut (with blue foreground nodes)

flow_horses


Changing the background of an image (obtained using graph-cut segmentation) with another image’s background with cut & paste


The following figures / animation show how the background of a given image can be replaced by a new image using cut & paste (by replacing the corresponding pixels in the new image corresponding to foreground), once the foreground in the original image gets identified after segmentation.

Original Input Imagehorses

New Backgroundsky.png

bg_1.0000horses.png

bgc_horses


 

Image Colorization Using Optimization in Python

 

This article is inspired by this SIGGRAPH paper by Levin et. al, for which they took this patent , the paper was referred to in the course CS1114 from Cornell.  This method is also discussed in the coursera online image processing course by NorthWestern University. Some part of the problem description is taken from the paper itself. Also, one can refer to the implementation provided by the authors in matlab, the following link and the following python implementation in github.

 

The Problem

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.

 

The Algorithm

YUV/YIQ color space is chosen to work in, this is commonly used in video, where Y is the monochromatic luminance channel, which we will refer to simply as intensity, while U and V are the chrominance channels, encoding the color.

The algorithm is given as input an intensity volume Y(x,y,t) and outputs two color volumes U(x,y,t) and V(x,y,t). To simplify notation the boldface letters are used (e.g. r,s) to denote \left(x,y,t \right) triplets. Thus, Y(r) is the intensity of a particular pixel.

Now, One needs to impose the constraint that two neighboring pixels r,s should have similar colors if their intensities are similar. Thus, the problem is formulated as the following optimization problem that aims to minimize the difference between the  colorU(r) at pixel r and the weighted average of the colors at neighboring pixels, where w(r,s) is a weighting function that sums to one, large when Y(r) is similar to Y(s), and small when the two intensities are different. 

f12

When the intensity is constant the color should be constant, and when the intensity is an edge the color should also be an edge. Since the cost functions are quadratic and
the constraints are linear, this optimization problem yields a large, sparse system of linear equations, which may be solved using a number of standard methods.

As discussed in the paper, this algorithm is closely related to algorithms proposed for other tasks in image processing. In image segmentation algorithms based on normalized cuts [Shi and Malik 1997], one attempts to find the second smallest eigenvector of the matrix D − W where W is a npixels×npixels matrix whose elements are the pairwise affinities between pixels (i.e., the r,s entry of the matrix is w(r,s)) and D is a diagonal matrix whose diagonal elements are the sum of the affinities (in this case it is always 1). The second smallest eigenvector of any symmetric matrix A is a unit norm vector x that minimizes x^{T}Ax and is orthogonal to the first eigenvector. By direct  inspection, the quadratic form minimized by normalized cuts is exactly the cost function J, that is x^{T}(D-W)x =J(x). Thus, this algorithm minimizes the same cost function but under different constraints. In image denoising algorithms based on anisotropic diffusion [Perona and Malik 1989; Tang et al. 2001] one often minimizes a  function
similar to equation 1, but the function is applied to the image intensity as well.

The following figures show an original gray-scale image and the marked image with color-scribbles that are going to be used to compute the output colored image.

Original Gray-scale Image Input
baby

Gray-scale image Input Marked with Color-Scribbles 
baby_marked

Implementation of the Algorithm

Here are the the steps for the algorithm:

  1. Convert both the original gray-scale image and the marked image (marked with color scribbles for a few pixels by the artist) to from RGB to YUV / YIQ color space.
  2. Compute the difference image from the marked and the gray-scale image. The pixels that differ  are going to be pinned and they will appear in the output, they are directly going to be used as stand-alone constraints in the minimization problem.
  3. We need to compute the weight matrix W that depends on the similarities in the neighbor intensities for a pixel from the original gray-scale image.
  4. The optimization problem finally boils down to solving the system of linear equations of the form WU = b that has a closed form least-square solution
    U = W^{+}b = {(W^{T}W)}^{-1}W^{T}b. Same thing is to be repeated for the V channel too.
  5. However the W matrix is going to be very huge and sparse, hence sparse-matrix based implementations must be used to obtain an efficient solution. However, in this python implementation in github, the scipy sparse lil_matrix was used when constructing the sparse matrices, which is quite slow, we can construct more efficient scipy csc matrix rightaway, by using a dictionary to store the weights initially. It is much faster. The python code in the next figure shows my implementation for computing the weight matrix W.
  6. Once W is computed it’s just a matter of obtaining the least-square solution, by computing the pseudo-inverse, which can be more efficiently computed with LU factorization and a sparse LU solver, as in this  python implementation in github.
  7. Once the solution of the optimization problem is obtained in YUV / YIQ space, it needs to be converted back to RGB. The following formula is used for conversion.rgbyiq.png

import scipy.sparse as sp
from collections import defaultdict

def compute_W(Y, colored):

    (w, h) = Y.shape
    W = defaultdict()
    for i in range(w):
        for j in range(h):
            if not (i, j) in colored: # pixel i,j in color scribble
            (N, sigma) = get_nbrs(Y, i, j, w, h) #r = (i, j)
            Z = 0.
            id1 = ij2id(i,j,w,h) # linearized id
            for (i1, j1) in N: #s = (i1, j1)
                id2 = ij2id(i1,j1,w,h)
                W[id1,id2] = np.exp(-(Y[i,j]-Y[i1,j1])**2/(sigma**2)) if sigma > 0 else 0.
                Z += W[id1,id2]
            if Z > 0:
               for (i1, j1) in N: #s = (i1, j1)
                   id2 = ij2id(i1,j1,w,h)
                   W[id1,id2] /= -Z
     for i in range(w):
        for j in range(h):
            id = ij2id(i,j,w,h)
            W[id,id] = 1.

    rows, cols = zip(*(W.keys())) #keys
    data = W.values() #[W[k] for k in keys]
    return sp.csc_matrix((data, (rows, cols)), shape=(w*h, w*h)) #W

 

Results

The following images and animations show the results obtained with the optimization algorithm. Most of the following images are taken from the paper itself.

Original Gray-scale Image Input                 Gray-scale image Input Markedbaby  baby_marked

Color image Output
baby_col.png


The next animations show how the incremental scribbling results in better and better color images.

Original Gray-scale Image Input 

monaco_gray

As can be seen from the following animation, the different parts of the building get colored as more and more color-hints are scribbled / annotated.

Gray-scale image Input Markedhouse_marked

Color image Output

house_col


Original Gray-scale Image Input 

waterfall_gray

Gray-scale image Input Marked
water_marked

Color image Output
water_col


Original Gray-scale Image Input (me)me3_gray

Gray-scale image Input Marked incrementally
me_marked

Color image Output
me_colored


Original Gray-scale Image Input
roses_gray

Gray-scale image Input Marked
roses_gray_marked
Color image Output
roses_col


Original Gray-scale Image Input

madurga_gray

Gray-scale image Input Marked

madurga_gray_marked

Color image Output

madurga_gray_col


 

Recursive Graphics, Bi/Tri-linear Interpolation, Anti-aliasing and Image Transformation in Python

The following problem appeared in an assignment in the Princeton course COS 126 . The problem description is taken from the course itself.

Recursive Graphics

Write a program that plots a Sierpinski triangle, as illustrated below. Then develop a program that plots a recursive patterns of your own design.

sierpinski9.png

Part 1. 

The Sierpinski triangle is an example of a fractal pattern. The pattern was described by Polish mathematician Waclaw Sierpinski in 1915, but has appeared in Italian art since the 13th century. Though the Sierpinski triangle looks complex, it can be generated with a short recursive program.

Examples. Below are the target Sierpinski triangles for different values of order N.

sierpinski2.png

Our task is to implement a recursive function sierpinski(). We need to think recursively: our function should draw one black triangle (pointed downwards) and then call itself recursively 3 times (with an appropriate stopping condition). When writing our program, we should exercise modular design.

The following code shows an implementation:

class Sierpinski:
#Height of an equilateral triangle whose sides are of the specified length.
def height (self, length):
return sqrt(3) * length / 2.
#Draws a filled equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def filledTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length/2., x-length/2.]), np.array([y, y+h, y+h]), alpha=1)
#Draws an empty equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def emptyTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length, x-length]), np.array([y+2*h, y, y]), alpha=0)
# Draws a Sierpinski triangle of order n, such that the largest filled
# triangle has bottom vertex (x, y) and sides of the specified length.
def sierpinski(self, n, x, y, length):
self.filledTriangle(x, y, length)
if n > 1:
self.sierpinski(n-1, x-length/2., y, length/2.)
self.sierpinski(n-1, x+length/2., y, length/2.)
self.sierpinski(n-1, x, y+self.height(length), length/2.)

The following animation shows how such a triangle of order 5 is drawn recursively.
sier.gif

The following animation shows how such a triangle of order 6 is drawn recursively.

sier6.gif

A diversion: fractal dimension.

Formally, we can define the Hausdorff dimension or similarity dimension of a self-similar figure by partitioning the figure into a number of self-similar pieces of smaller size. We define the dimension to be the log (# self similar pieces) / log (scaling factor in each spatial direction). For example, we can decompose the unit square into 4 smaller squares, each of side length 1/2; or we can decompose it into 25 squares, each of side length 1/5. Here, the number of self-similar pieces is 4 (or 25) and the scaling factor is 2 (or 5). Thus, the dimension of a square is 2 since log (4) / log(2) = log (25) / log (5) = 2. We can decompose the unit cube into 8 cubes, each of side length 1/2; or we can decompose it into 125 cubes, each of side length 1/5. Therefore, the dimension of a cube is log(8) / log (2) = log(125) / log(5) = 3.

We can also apply this definition directly to the (set of white points in) Sierpinski triangle. We can decompose the unit Sierpinski triangle into 3 Sierpinski triangles, each of side length 1/2. Thus, the dimension of a Sierpinski triangle is log (3) / log (2) ≈ 1.585. Its dimension is fractional—more than a line segment, but less than a square! With Euclidean geometry, the dimension is always an integer; with fractal geometry, it can be something in between. Fractals are similar to many physical objects; for example, the coastline of Britain resembles a fractal, and its fractal dimension has been measured to be approximately 1.25.

 

Part 2.

Drawing a tree recursively, as described here:

treed.png

The following shows how a tree of order 10 is drawn:

tree.gif


 

The next problem appeared in an assignment in the Cornell course CS1114 . The problem description is taken from the course itself.

Bilinear Interpolation

Let’s consider a 2D matrix of values at integer grid locations (e.g., a grayscale image). To interpolate values on a 2D grid, we can use the 2D analogue of linear interpolation: bilinear interpolation. In this case, there are four neighbors for each possible point we’d like to interpolation, and the intensity values of these four neighbors are all combined to compute the interpolated intensity, as shown in the next figure.

f6.png

In the figure, the Q values represent intensities. To combine these intensities, we perform linear interpolation in multiple directions: we first interpolate in the x direction (to get the value at the blue points), then in the y direction (to get the value at the green points).

Image transformations

Next, we’ll use the interpolation function to help us implement image transformations.

A 2D affine transformation can be represented with a 3 ×3 matrix T:

f4.png

Recall that the reason why this matrix is 3×3, rather than 2 ×2, is that we operate in homogeneous coordinates; that is, we add an extra 1 on the end of our 2D coordinates (i.e., (x,y) becomes (x,y,1)), in order to represent translations with a matrix multiplication. To apply a transformation T to a pixel, we multiply T by the pixel’s location:

f5

The following figure shows a few such transformation matrices:

f8.png

To apply a transformation T to an entire image I, we could apply the transformation to each of I’s pixels to map them to the output image. However, this forward warping procedure has several problems. Instead, we’ll use inverse mapping to warp the pixels of the output image back to the input image. Because this won’t necessarily hit an integer-valued location, we’ll need to use the (bi-linear) interpolation to determine the intensity of the input image at the desired location, as shown in the next figure.

f7.png

To demo the transformation function, let’s implement the following on a gray scale bird image:

1. Horizontal flipping

out_flip

2. Scaling by a factor of 0.5

out_scale
3. Rotation by 45 degrees around the center of the image

out_rot

The next animations show rotation and sheer transformations with the Lena image:

rot.gif

wave1_.gif
sheer.gif

Next, let’s implement a function to transform RGB images. To do this, we need to simply call transform image three times, once for each channel, then put the results together into a single image.  Next figures and animations show some results on an RGB image.

duck.gif


Some
non-linear transformations

The next figure shows the transform functions from here:

f9.png

The next figures and animations show the application of the above non-linear transforms on the Lena image.

Wave1
out_f1
wave1.gif
     Wave2
out_f2

wave2.gif           

Swirl
out_swirl

swirl.gif

me.gif

         Warp

out_warp

warp.gif

Some more non-linear transforms:

syncsync1

 

Anti-aliasing

There is a problem with our interpolation method above: it is not very good at shrinking images, due to aliasing. For instance, if let’s try to down-sample the following bricks image by a factor of 0.4, we get the image shown in the following figure: notice the strange banding effects in the output image.

Original Image

bricks_gray.png

Down-sampled Image with Bilinear Interpolation
out_scale_b.png

The problem is that a single pixel in the output image corresponds to about 2.8 pixels in the input image, but we are sampling the value of a single pixel—we should really be averaging over a small area.

To overcome this problem, we will create a data structure that will let us (approximately) average over any possible square regions of pixels in the input image: an image stack. An image stack is a 3D matrix that we can think of as, not surprisingly, a stack of images, one on top of the other. The top image in the cube will be the original input image. Images further down the stack will be the input image with progressively larger amounts of blur. The size of the matrix will be rows × cols × num levels, where the original (grayscale) image has size rows×cols and there are num levels images in the stack.

Before we use the stack, we must write a function to create it, which takes as input a (grayscale) image and a number of levels in the stack, and returns a 3D matrix stack corresponding to the stack. Again, the first image on the stack will be the original image. Every other image in the stack will be a blurred version of the previous image. A good blur kernel to use is:

f10.png

Now, for image k in the stack, we know that every pixel is a (weighted) average of some number of pixels (a k × k patch, roughly speaking) in the input image. Thus, if we down-sample the image by a factor of k, we want to sample pixels from level k of the stack.

⇒  Let’s write the following function to create image stack that takes a grayscale image and a number max levels, and returns an image stack.

from scipy import signal

def create_image_stack(img, max_level):
K = np.ones((3,3)) / 9.
image_stack = np.zeros((img.shape[0], img.shape[1], max_level))
image_stack[:,:,0] = img
for l in range(1, max_level):
image_stack[:,:,l] = signal.convolve2d(image_stack[:,:,l-1], K,
boundary=’symm’, mode=’same’)
return image_stack

The next animation shows the image stack created from the bricks image.

ims.gif

 

Trilinear Interpolation

Now, what happens if we down-sample the image by a fractional factor, such as 3.6? Unfortunately, there is no level 3.6 of the stack. Fortunately, we have a tool to solve this problem: interpolation. We now potentially need to sample a value at position (row,col,k) of the image stack, where all three coordinates are fractional. We therefore something more powerful than bilinear interpolation: trilinear interpolation! Each position we want to sample now has eight neighbors, and we’ll combine all of their values together in a weighted sum.

This sounds complicated, but we can write this in terms of our existing functions. In particular, we now interpolate separately along different dimensions: trilinear interpolation can be implemented with two calls to bilinear interpolation and one call to linear interpolation.

Let’s implement a function trilerp like the following that takes an image stack, and a row, column, and stack level k, and returns the interpolated value.

def trilerp (img_stack, x, y, k):

if k < 1: k = 1
if  k == int(k):
return bilerp(img_stack[:,:,k-1], x, y)
else:
f_k, c_k = int(floor(k)), int(ceil(k))
v_f_k = bilerp(img_stack[:,:,f_k-1], x, y)
v_c_k = bilerp(img_stack[:,:,c_k-1], x, y)
return linterp(k, f_k, c_k, v_f_k, v_c_k)

Now we can finally implement a transformation function that does proper anti-aliasing. In order to do this, let’s implement a function that will

  • First compute the image stack.
  • Then compute, for the transformation T, how much T is scaling down the image. If T is defined by the six values a,b,c,d,e,f above, then, to a first approximation, the downscale factor is:
    f11.png
    However, if k < 1 (corresponding to scaling up the image), we still want to sample from level 1. This situation reverts to normal bilinear interpolation.
  • Next call the trilerp function on the image stack, instead of bilerp on the input image.

The next figure shows the output image obtained image transformation with proper anti-aliasing:

Down-sampled Image with Anti-aliasing using Trilinear Interpolationout_scale_t.png

As we can see from the above output, the aliasing artifact has disappeared.

The same results are obtained on the color image, as shown below, by applying the trilerp function on the color channels separately and creating separate image stacks for different color channels.

Original Image
bricks_rgb

Down-sampled Image with Bilinear Interpolation
out_scale_b

Down-sampled Image with Anti-aliasing
out_scale_t

The following animation shows the branding artifacts created when using bilinear interpolation for  different scale factors and how they are removed with anti-aliasing.

Down-sampled Images with Bilinear Interpolation

a

Down-sampled Images with Anti-aliasing

aa

Hand-Gesture Classification using Deep Convolution and Residual Neural Network (ResNet-50) with Tensorflow / Keras in Python

In this article, first an application of convolution net to classify a set of hand-sign images is going to be discussed.  Later the accuracy of this classifier will be improved using a deep res-net. These problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deep-learning specialization) by the Stanford Prof. Andrew Ng. (deeplearning.ai). The problem descriptions are taken straightaway from the course itself.

1. Hand-gesture Classification with Convolution Neural Network

In this assignment, the following tasks are going to be accomplished:

  • Implement a fully functioning ConvNet using TensorFlow.
  • Build and train a ConvNet in TensorFlow for a classification problem

This assignment is going to be done using tensorflow.

First the necessary packages are loaded:

import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import scipy
from PIL import Image
from scipy import ndimage
import tensorflow as tf
from tensorflow.python.framework import ops
from cnn_utils import *

%matplotlib inline
np.random.seed(1)

Next the “SIGNS” dataset is loaded that we are going to use. The SIGNS dataset is a collection of 6 signs representing numbers from 0 to 5, as shown in the next figure. The output classes are shown with one hot encoding.

# Loading the data (signs)
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
signs_data_kiank.pngThe next figures show a few randomly sampled images for each class label from the training dataset. There are 180 images for each class and a total of 108 images in the training dataset.

012345
number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)

The following steps are to be executed to train a conv-net model with tensorflow using the trainign dataset and then classify the images from the test dataset using the model.

Create placeholders

TensorFlow requires that we create placeholders for the input data that will be fed into the model when running the session.

Let’s implement the function below to create placeholders for the input image X and the output Y. We should not define the number of training examples for the moment. To do so, we could use “None” as the batch size, it will give us the flexibility to choose it later. Hence X should be of dimension [None, n_H0, n_W0, n_C0] and Y should be of dimension  [None, n_y].

def create_placeholders(n_H0, n_W0, n_C0, n_y):
    """
    Creates the placeholders for the tensorflow session.
    
    Arguments:
    n_H0 -- scalar, height of an input image
    n_W0 -- scalar, width of an input image
    n_C0 -- scalar, number of channels of the input
    n_y -- scalar, number of classes
        
    Returns:
    X -- placeholder for the data input, of shape [None, n_H0, n_W0, n_C0] and dtype "float"
    Y -- placeholder for the input labels, of shape [None, n_y] and dtype "float"
    """

    X = tf.placeholder(tf.float32, shape=(None, n_H0, n_W0, n_C0))
    Y = tf.placeholder(tf.float32, shape=(None, n_y))
    
    return X, Y

 

Initialize parameters

Let’s initialize weights/filters W1 and Wusing xavier_initializer.
We don’t need to worry about bias variables as you will soon see that TensorFlow functions take care of the bias. Note also that you will only initialize the weights/filters for the conv2d functions. TensorFlow initializes the layers for the fully connected part automatically.

def initialize_parameters():
    """
    Initializes weight parameters to build a neural network with tensorflow. The shapes are:
                        W1 : [4, 4, 3, 8]
                        W2 : [2, 2, 8, 16]
    Returns:
    parameters -- a dictionary of tensors containing W1, W2
    """
    
    tf.set_random_seed(1)                              # so that our "random" numbers match ours
        
    W1 = tf.get_variable("W1", (4, 4, 3, 8), initializer = tf.contrib.layers.xavier_initializer(seed = 0)) 
    W2 = tf.get_variable("W2", (2, 2, 8, 16), initializer = tf.contrib.layers.xavier_initializer(seed = 0)) 

    parameters = {"W1": W1,
                  "W2": W2}
    
    return parameters

Forward propagation

Next we need to implement the forward_propagation function below to build the following model:
CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED.

We need to use the following built-in tensorflow functions:

  • tf.nn.conv2d(X,W1, strides = [1,s,s,1], padding = ‘SAME’): given an input XX and a group of filters W1W1, this function convolves W1W1‘s filters on X. The third input ([1,f,f,1]) represents the strides for each dimension of the input (m, n_H_prev, n_W_prev, n_C_prev). You can read the full documentation here
  • tf.nn.max_pool(A, ksize = [1,f,f,1], strides = [1,s,s,1], padding = ‘SAME’): given an input A, this function uses a window of size (f, f) and strides of size (s, s) to carry out max pooling over each window. You can read the full documentation here
  • tf.nn.relu(Z1): computes the elementwise ReLU of Z1 (which can be any shape). You can read the full documentation here.
  • tf.contrib.layers.flatten(P): given an input P, this function flattens each example into a 1D vector it while maintaining the batch-size. It returns a flattened tensor with shape [batch_size, k]. You can read the full documentation here.
  • tf.contrib.layers.fully_connected(F, num_outputs): given a the flattened input F, it returns the output computed using a fully connected layer. You can read the full documentation here.

In detail, we will use the following parameters for all the steps:

 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use an 8 by 8 filter size and an 8 by 8 stride, padding is "SAME"
 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use a 4 by 4 filter size and a 4 by 4 stride, padding is "SAME"
 - Flatten the previous output.
 - FULLYCONNECTED (FC) layer: Apply a fully connected layer without an non-linear activation function. Do not call the softmax here. This will result in 6 neurons in the output layer, which then get passed later to a softmax. In TensorFlow, the softmax and cost function are lumped together into a single function, which you'll call in a different function when computing the cost. 
def forward_propagation(X, parameters):
    """
    Implements the forward propagation for the model:
    CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED
    
    Arguments:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "W2"
                  the shapes are given in initialize_parameters

    Returns:
    Z3 -- the output of the last LINEAR unit
    """    

Compute cost

Next step is to implement the compute cost function using the following tensorflow functions:

  • tf.nn.softmax_cross_entropy_with_logits(logits = Z3, labels = Y): computes the softmax entropy loss. This function both computes the softmax activation function as well as the resulting loss. You can check the full documentation here.
  • tf.reduce_mean: computes the mean of elements across dimensions of a tensor. Use this to sum the losses over all the examples to get the overall cost. You can check the full documentation here.
def compute_cost(Z3, Y):
    """
    Computes the cost
    
    Arguments:
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3
    
    Returns:
    cost - Tensor of the cost function
    """

Model

Finally we need to merge the helper functions we implemented above to build a model and train it on the SIGNS dataset.

The model should:

  • create placeholders
  • initialize parameters
  • forward propagate
  • compute the cost
  • create an optimizer

Finally we need to create a session and run a for loop for num_epochs, get the mini-batches, and then for each mini-batch you will optimize the function.

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.009,
          num_epochs = 100, minibatch_size = 64, print_cost = True):
    """
    Implements a three-layer ConvNet in Tensorflow:
    CONV2D -> RELU -> MAXPOOL -> CONV2D -> RELU -> MAXPOOL -> FLATTEN -> FULLYCONNECTED
    
    Arguments:
    X_train -- training set, of shape (None, 64, 64, 3)
    Y_train -- test set, of shape (None, n_y = 6)
    X_test -- training set, of shape (None, 64, 64, 3)
    Y_test -- test set, of shape (None, n_y = 6)
    learning_rate -- learning rate of the optimization
    num_epochs -- number of epochs of the optimization loop
    minibatch_size -- size of a minibatch
    print_cost -- True to print the cost every 100 epochs
    
    Returns:
    train_accuracy -- real number, accuracy on the train set (X_train)
    test_accuracy -- real number, testing accuracy on the test set (X_test)
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

Then let’s train the model for 100 epochs.

_, _, parameters = model(X_train, Y_train, X_test, Y_test)
 with the following output:
Cost after epoch 0: 1.918487
Cost after epoch 5: 1.875008
Cost after epoch 10: 1.813409
Cost after epoch 15: 1.667654
Cost after epoch 20: 1.444399
Cost after epoch 25: 1.203926
Cost after epoch 30: 1.028009
Cost after epoch 35: 0.887578
Cost after epoch 40: 0.791803
Cost after epoch 45: 0.712319
Cost after epoch 50: 0.655244
Cost after epoch 55: 0.597494
Cost after epoch 60: 0.556236
Cost after epoch 65: 0.525260
Cost after epoch 70: 0.484548
Cost after epoch 75: 0.477365
Cost after epoch 80: 0.451908
Cost after epoch 85: 0.415393
Cost after epoch 90: 0.386501
Cost after epoch 95: 0.373167

gr

Tensor(“Mean_1:0”, shape=(), dtype=float32)
Train Accuracy: 0.894444
Test Accuracy: 0.841667

 

2. Improving the Accuracy of the Hand-Gesture Classifier with Residual Networks

Now we shall learn how to build very deep convolutional networks, using Residual Networks (ResNets). In theory, very deep networks can represent very complex functions; but in practice, they are hard to train. Residual Networks, introduced by He et al., allow to train much deeper networks than were previously practically feasible.

In this assignment, the following tasks we are going to accomplish:

  • Implement the basic building blocks of ResNets.
  • Put together these building blocks to implement and train a state-of-the-art neural network for image classification.

This assignment will be done in Keras.

Let’s first load the following required packages.

import numpy as np
from keras import layers
from keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from keras.models import Model, load_model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot_ng as pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from resnets_utils import *
from keras.initializers import glorot_uniform
import scipy.misc
from matplotlib.pyplot import imshow
%matplotlib inline
import keras.backend as K
K.set_image_data_format('channels_last')
K.set_learning_phase(1)

The problem of very deep neural networks

In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow.

During training, we might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:

vanishing_grad_kiank.png

We are now going to solve this problem by building a Residual Network!

Building a Residual Network

In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly back-propagated to earlier layers:

skip_connection_kiank.png

The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, we can form a very deep network.

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different. We are going to implement both of them.

1 – The identity block

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l]) has the same dimension as the output activation (say a[l+2]). To flesh out the different steps of what happens in a ResNet’s identity block, here is an alternative diagram showing the individual steps:

idblock2_kiank.pngThe upper path is the “shortcut path.” The lower path is the “main path.” In this diagram, we have also made explicit the CONV2D and ReLU steps in each layer. To speed up training we have also added a BatchNorm step. 

In this exercise, we’ll actually implement a slightly more powerful version of this identity block, in which the skip connection “skips over” 3 hidden layers rather than 2 layers. It looks like this:

idblock3_kiank.png

Here’re the individual steps.

First component of main path:

  • The first CONV2D has F1 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2a'. Use 0 as the seed for the random initialization.
  • The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

  • The second CONV2D has F2 filters of shape (f,fand a stride of (1,1). Its padding is “same” and its name should be conv_name_base + '2b'. Use 0 as the seed for the random initialization.
  • The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

  • The third CONV2D has F3 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2c'. Use 0 as the seed for the random initialization.
  • The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Final step:

  • The shortcut and the input are added together.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Now let’s implement the ResNet identity block.

  • To implement the Conv2D step: See reference
  • To implement BatchNorm: See reference (axis: Integer, the axis that should be normalized (typically the channels axis))
  • For the activation, use: Activation('relu')(X)
  • To add the value passed forward by the shortcut: See reference
defidentity_block(X, f, filters, stage, block):
    """
    Implementation of the identity block as defined in Figure 3
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    
    Returns:
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    """
    ### The first Component ###
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value. You'll need this later to add back to the main path. 
    X_shortcut = X
    
    # First component of main path
    X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)

    ### The second Component ###

    # ...

    ### The third Component ###

    # ...

    return X

2 – The convolutional block

Next, the ResNet “convolutional block” is the other type of block. We can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:

convblock_kiank.png

The CONV2D layer in the shortcut path is used to resize the input x to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. For example, to reduce the activation dimensions’s height and width by a factor of 2, we can use a 1×1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.

The details of the convolutional block are as follows.

First component of main path:

  • The first CONV2D has F1 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '2a'.
  • The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

  • The second CONV2D has F2 filters of (f,f) and a stride of (1,1). Its padding is “same” and it’s name should be conv_name_base + '2b'.
  • The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

  • The third CONV2D has F3 filters of (1,1) and a stride of (1,1). Its padding is “valid” and it’s name should be conv_name_base + '2c'.
  • The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Shortcut path:

  • The CONV2D has F3 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '1'.
  • The BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '1'.

Final step:

  • The shortcut and the main path values are added together.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Let’s now implement the convolutional block.

defconvolutional_block(X, f, filters, stage, block, s = 2):
    """
    Implementation of the convolutional block as defined in Figure 4
    
    Arguments:
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    s -- Integer, specifying the stride to be used
    
    Returns:
    X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
    """
    
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    
    # Retrieve Filters
    F1, F2, F3 = filters
    
    # Save the input value
    X_shortcut = X


    ##### MAIN PATH #####
    # First component of main path 
    X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)
    
    # Second component of main path
    # ...
    # Third component of main path
    # ...
    ##### SHORTCUT PATH ####
    # ...
    # Final step: Add shortcut value to main path, and pass it through a RELU activation
    # ...
    return X

 

3 – Building our first ResNet model (50 layers)

We now have the necessary blocks to build a very deep ResNet. The following figure describes in detail the architecture of this neural network. “ID BLOCK” in the diagram stands for “Identity block,” and “ID BLOCK x3” means we should stack 3 identity blocks together.

resnet_kiank.png

The details of this ResNet-50 model are:

  • Zero-padding pads the input with a pad of (3,3)
  • Stage 1:
    • The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”.
    • BatchNorm is applied to the channels axis of the input.
    • MaxPooling uses a (3,3) window and a (2,2) stride.
  • Stage 2:
    • The convolutional block uses three set of filters of size [64,64,256], “f” is 3, “s” is 1 and the block is “a”.
    • The 2 identity blocks use three set of filters of size [64,64,256], “f” is 3 and the blocks are “b” and “c”.
  • Stage 3:
    • The convolutional block uses three set of filters of size [128,128,512], “f” is 3, “s” is 2 and the block is “a”.
    • The 3 identity blocks use three set of filters of size [128,128,512], “f” is 3 and the blocks are “b”, “c” and “d”.
  • Stage 4:
    • The convolutional block uses three set of filters of size [256, 256, 1024], “f” is 3, “s” is 2 and the block is “a”.
    • The 5 identity blocks use three set of filters of size [256, 256, 1024], “f” is 3 and the blocks are “b”, “c”, “d”, “e” and “f”.
  • Stage 5:
    • The convolutional block uses three set of filters of size [512, 512, 2048], “f” is 3, “s” is 2 and the block is “a”.
    • The 2 identity blocks use three set of filters of size [512, 512, 2048], “f” is 3 and the blocks are “b” and “c”.
  • The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
  • The flatten doesn’t have any hyperparameters or name.
  • The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

Let’s implement the ResNet with 50 layers described in the figure above.

We’ll need to use this function:

Here’re some other functions we used in the code below:

def ResNet50(input_shape = (64, 64, 3), classes = 6):
    """
    Implementation of the popular ResNet50 the following architecture:
    CONV2D -> BATCHNORM -> RELU -> MAXPOOL -> CONVBLOCK -> IDBLOCK*2 -> CONVBLOCK -> IDBLOCK*3
    -> CONVBLOCK -> IDBLOCK*5 -> CONVBLOCK -> IDBLOCK*2 -> AVGPOOL -> TOPLAYER

    Arguments:
    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    Returns:
    model -- a Model() instance in Keras
    """
    
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)

    
    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)
    
    # Stage 1
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # Stage 2
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], stage = 2, block='a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='b')
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='c')

    # ...

    # ...

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='softmax', name='fc' + str(classes), kernel_initializer = glorot_uniform(seed=0))(X)
        
    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

Next, let’s build the model’s graph. We have 6 output classes for the hand-signs dataset.

model = ResNet50(input_shape = (64, 64, 3), classes = 6)

We need to configure the learning process by compiling the model.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

The model is now ready to be trained. The only thing we need is to pass the same hand-signs dataset that we used earlier. We need to load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Convert training and test labels to one hot matrices
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T

print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))
number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)

Now let’s train our  resnet model on 20 epochs with a batch size of 32.

model.fit(X_train, Y_train, epochs = 20, batch_size = 32)
Epoch 1/20
1080/1080 [==============================] - 173s - loss: 2.0610 - acc: 0.3435   
Epoch 2/20
1080/1080 [==============================] - 149s - loss: 1.8561 - acc: 0.4259   
Epoch 3/20
1080/1080 [==============================] - 147s - loss: 2.0284 - acc: 0.4343   
Epoch 4/20
1080/1080 [==============================] - 151s - loss: 1.7140 - acc: 0.4500   
Epoch 5/20
1080/1080 [==============================] - 134s - loss: 1.4401 - acc: 0.5676   
Epoch 6/20
1080/1080 [==============================] - 128s - loss: 1.1950 - acc: 0.6481   
Epoch 7/20
1080/1080 [==============================] - 129s - loss: 0.9886 - acc: 0.7426   
Epoch 8/20
1080/1080 [==============================] - 133s - loss: 1.2155 - acc: 0.6843   
Epoch 9/20
1080/1080 [==============================] - 131s - loss: 0.8536 - acc: 0.8185   
Epoch 10/20
1080/1080 [==============================] - 132s - loss: 0.9502 - acc: 0.7565   
Epoch 11/20
1080/1080 [==============================] - 129s - loss: 0.8180 - acc: 0.8111   
Epoch 12/20
1080/1080 [==============================] - 130s - loss: 0.7060 - acc: 0.8343   
Epoch 13/20
1080/1080 [==============================] - 130s - loss: 0.8687 - acc: 0.8148   
Epoch 14/20
1080/1080 [==============================] - 130s - loss: 0.8282 - acc: 0.8509   
Epoch 15/20
1080/1080 [==============================] - 130s - loss: 0.9303 - acc: 0.7972   
Epoch 16/20
1080/1080 [==============================] - 146s - loss: 1.1211 - acc: 0.7870   
Epoch 17/20
1080/1080 [==============================] - 143s - loss: 0.9337 - acc: 0.7824   
Epoch 18/20
1080/1080 [==============================] - 150s - loss: 0.3976 - acc: 0.8870   
Epoch 19/20
1080/1080 [==============================] - 143s - loss: 0.2532 - acc: 0.9407   
Epoch 20/20
1080/1080 [==============================] - 133s - loss: 0.2528 - acc: 0.9556

Let’s see how this model performs on the test set.

preds = model.evaluate(X_test, Y_test)
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))
Loss = 0.36906948487
Test Accuracy = 0.891666662693

We can also print a summary of your model by running the following code.

model.summary()
____________________________________________________________________________________________________
Layer (type)                     Output Shape          Param #     Connected to                     
====================================================================================================
input_1 (InputLayer)             (None, 64, 64, 3)     0                                            
____________________________________________________________________________________________________
zero_padding2d_1 (ZeroPadding2D) (None, 70, 70, 3)     0                                            
____________________________________________________________________________________________________
conv1 (Conv2D)                   (None, 32, 32, 64)    9472                                         
____________________________________________________________________________________________________
bn_conv1 (BatchNormalization)    (None, 32, 32, 64)    256                                          
____________________________________________________________________________________________________
activation_4 (Activation)        (None, 32, 32, 64)    0                                            
____________________________________________________________________________________________________
max_pooling2d_1 (MaxPooling2D)   (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2a_branch2a (Conv2D)          (None, 15, 15, 64)    4160                                         
____________________________________________________________________________________________________
bn2a_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_5 (Activation)        (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2a_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
____________________________________________________________________________________________________
bn2a_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_6 (Activation)        (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2a_branch1 (Conv2D)           (None, 15, 15, 256)   16640                                        
____________________________________________________________________________________________________
res2a_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
____________________________________________________________________________________________________
bn2a_branch1 (BatchNormalization (None, 15, 15, 256)   1024                                         
____________________________________________________________________________________________________
bn2a_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
____________________________________________________________________________________________________
add_2 (Add)                      (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
activation_7 (Activation)        (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
res2b_branch2a (Conv2D)          (None, 15, 15, 64)    16448                                        
____________________________________________________________________________________________________
bn2b_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_8 (Activation)        (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2b_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
____________________________________________________________________________________________________
bn2b_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_9 (Activation)        (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2b_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
____________________________________________________________________________________________________
bn2b_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
____________________________________________________________________________________________________
add_3 (Add)                      (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
activation_10 (Activation)       (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
res2c_branch2a (Conv2D)          (None, 15, 15, 64)    16448                                        
____________________________________________________________________________________________________
bn2c_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_11 (Activation)       (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2c_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
____________________________________________________________________________________________________
bn2c_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
____________________________________________________________________________________________________
activation_12 (Activation)       (None, 15, 15, 64)    0                                            
____________________________________________________________________________________________________
res2c_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
____________________________________________________________________________________________________
bn2c_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
____________________________________________________________________________________________________
add_4 (Add)                      (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
activation_13 (Activation)       (None, 15, 15, 256)   0                                            
____________________________________________________________________________________________________
res3a_branch2a (Conv2D)          (None, 8, 8, 128)     32896                                        
____________________________________________________________________________________________________
bn3a_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_14 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3a_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
____________________________________________________________________________________________________
bn3a_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_15 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3a_branch1 (Conv2D)           (None, 8, 8, 512)     131584                                       
____________________________________________________________________________________________________
res3a_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
____________________________________________________________________________________________________
bn3a_branch1 (BatchNormalization (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
bn3a_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
add_5 (Add)                      (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
activation_16 (Activation)       (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
res3b_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
____________________________________________________________________________________________________
bn3b_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_17 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3b_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
____________________________________________________________________________________________________
bn3b_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_18 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3b_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
____________________________________________________________________________________________________
bn3b_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
add_6 (Add)                      (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
activation_19 (Activation)       (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
res3c_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
____________________________________________________________________________________________________
bn3c_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_20 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3c_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
____________________________________________________________________________________________________
bn3c_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_21 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3c_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
____________________________________________________________________________________________________
bn3c_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
add_7 (Add)                      (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
activation_22 (Activation)       (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
res3d_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
____________________________________________________________________________________________________
bn3d_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_23 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3d_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
____________________________________________________________________________________________________
bn3d_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
____________________________________________________________________________________________________
activation_24 (Activation)       (None, 8, 8, 128)     0                                            
____________________________________________________________________________________________________
res3d_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
____________________________________________________________________________________________________
bn3d_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
____________________________________________________________________________________________________
add_8 (Add)                      (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
activation_25 (Activation)       (None, 8, 8, 512)     0                                            
____________________________________________________________________________________________________
res4a_branch2a (Conv2D)          (None, 4, 4, 256)     131328                                       
____________________________________________________________________________________________________
bn4a_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_26 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4a_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4a_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_27 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4a_branch1 (Conv2D)           (None, 4, 4, 1024)    525312                                       
____________________________________________________________________________________________________
res4a_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4a_branch1 (BatchNormalization (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
bn4a_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_9 (Add)                      (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_28 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4b_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4b_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_29 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4b_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4b_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_30 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4b_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4b_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_10 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_31 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4c_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4c_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_32 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4c_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4c_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_33 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4c_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4c_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_11 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_34 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4d_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4d_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_35 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4d_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4d_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_36 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4d_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4d_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_12 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_37 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4e_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4e_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_38 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4e_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4e_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_39 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4e_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4e_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_13 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_40 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res4f_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
____________________________________________________________________________________________________
bn4f_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_41 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4f_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
____________________________________________________________________________________________________
bn4f_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
____________________________________________________________________________________________________
activation_42 (Activation)       (None, 4, 4, 256)     0                                            
____________________________________________________________________________________________________
res4f_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
____________________________________________________________________________________________________
bn4f_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
____________________________________________________________________________________________________
add_14 (Add)                     (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
activation_43 (Activation)       (None, 4, 4, 1024)    0                                            
____________________________________________________________________________________________________
res5a_branch2a (Conv2D)          (None, 2, 2, 512)     524800                                       
____________________________________________________________________________________________________
bn5a_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_44 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5a_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
____________________________________________________________________________________________________
bn5a_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_45 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5a_branch1 (Conv2D)           (None, 2, 2, 2048)    2099200                                      
____________________________________________________________________________________________________
res5a_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
____________________________________________________________________________________________________
bn5a_branch1 (BatchNormalization (None, 2, 2, 2048)    8192                                         
____________________________________________________________________________________________________
bn5a_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
____________________________________________________________________________________________________
add_15 (Add)                     (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
activation_46 (Activation)       (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
res5b_branch2a (Conv2D)          (None, 2, 2, 512)     1049088                                      
____________________________________________________________________________________________________
bn5b_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_47 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5b_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
____________________________________________________________________________________________________
bn5b_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_48 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5b_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
____________________________________________________________________________________________________
bn5b_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
____________________________________________________________________________________________________
add_16 (Add)                     (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
activation_49 (Activation)       (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
res5c_branch2a (Conv2D)          (None, 2, 2, 512)     1049088                                      
____________________________________________________________________________________________________
bn5c_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_50 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5c_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
____________________________________________________________________________________________________
bn5c_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
____________________________________________________________________________________________________
activation_51 (Activation)       (None, 2, 2, 512)     0                                            
____________________________________________________________________________________________________
res5c_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
____________________________________________________________________________________________________
bn5c_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
____________________________________________________________________________________________________
add_17 (Add)                     (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
activation_52 (Activation)       (None, 2, 2, 2048)    0                                            
____________________________________________________________________________________________________
avg_pool (AveragePooling2D)      (None, 1, 1, 2048)    0                                            
____________________________________________________________________________________________________
flatten_1 (Flatten)              (None, 2048)          0                                            
____________________________________________________________________________________________________
fc6 (Dense)                      (None, 6)             12294                                        
====================================================================================================
Total params: 23,600,006.0
Trainable params: 23,546,886.0
Non-trainable params: 53,120.0
_____________________________

Finally, the next figure shows the visualization of our ResNet50.

archres.png

Key points

  • Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.
  • The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
  • There are two main type of blocks: The identity block and the convolutional block.
  • Very deep Residual Networks are built by stacking these blocks together.

References

This article presents the ResNet algorithm due to He et al. (2015). The implementation here also took significant inspiration and follows the structure given in the github repository of Francois Chollet:

Classifying a Face as Happy/Unhappy and Face Recognition using a Pre-trained Deep Inception Network with Keras in Python

In this article couple of problems are going to be discussed. Both the problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deeplearning specialization) by the Stanford Prof. Andrew Ng. (deeplearning.ai). The problem descriptions are taken from the course itself.

1. Classifying a Face Image as Happy/Unhappy

  • Given:
    • 600 RGB (labeled) training images each of size 64×64, with labels 0 (not happy) and 1 (happy).
    • 150 (unlabeled) test images (also the ground-truths separately).
  • Train a deep convolution neural net model for binary classification.
  • Use the model to predict the labels of the test images and evaluate the model using the ground truth.

 

Details of the “Happy” dataset:

Images are of shape (64,64,3)
Training: 600 pictures
labels.png
Test: 150 pictures

It is now time to solve the “Happy” Challenge.

We need to start by loading the following required packages.

import numpy as np
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
import keras.backend as K
K.set_image_data_format(‘channels_last’)
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

Then let’s normalize and load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

​# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Reshape
Y_train = Y_train_orig.T
Y_test = Y_test_orig.T

print (“number of training examples = ” + str(X_train.shape[0]))
print (“number of test examples = ” + str(X_test.shape[0]))
print (“X_train shape: ” + str(X_train.shape))
print (“Y_train shape: ” + str(Y_train.shape))
print (“X_test shape: ” + str(X_test.shape))
print (“Y_test shape: ” + str(Y_test.shape))

number of training examples = 600
number of test examples = 150
X_train shape: (600, 64, 64, 3)
Y_train shape: (600, 1)
X_test shape: (150, 64, 64, 3)
Y_test shape: (150, 1)

Now let’s find the number of labeled happy and unhappy faces in the training dataset.

print(X_train[Y_train.ravel()==1].shape, X_train[Y_train.ravel()==0].shape)

(300, 64, 64, 3) (300, 64, 64, 3)

As can be seen, there are equal numbers of positive and negative examples in the training dataset. The following figures show a few samples drawn from each class.

happy.png

unhappy.png

Building a model in Keras

Keras is very good for rapid prototyping. In just a short time we shall be able to build a model that achieves outstanding results.

Let’s Implement a HappyModel() with the following architecture:

def HappyModel(input_shape):
“””
Implementation of the HappyModel.

Arguments:
input_shape — shape of the images of the dataset

Returns:
model — a Model() instance in Keras
“””

# Define the input placeholder as a tensor with shape input_shape. Think of
# this as our input image!
X_input = Input(input_shape)

# Zero-Padding: pads the border of X_input with zeroes
X = ZeroPadding2D((3, 3))(X_input)

# CONV -> BN -> RELU Block applied to X
X = Conv2D(32, (7, 7), strides = (1, 1), name = ‘conv0’)(X)
X = BatchNormalization(axis = 3, name = ‘bn0’)(X)
X = Activation(‘relu’)(X)

# MAXPOOL
X = MaxPooling2D((2, 2), name=’max_pool’)(X)

# FLATTEN X (means convert it to a vector) + FULLYCONNECTED
X = Flatten()(X)
X = Dense(1, activation=’sigmoid’, name=’fc’)(X)

# Create model. This creates our Keras model instance, you’ll use this instance
# to train/test the model.
model = Model(inputs = X_input, outputs = X, name=’HappyModel’)

return model

arch.png

Step 1: Let’s first create the model.

happyModel = HappyModel((64,64,3))

Step 2:  Compile the model to configure the learning process, keeping in view that the Happy Challenge is a binary classification problem.

happyModel.compile(optimizer = “Adam”, loss = “binary_crossentropy”, metrics = [“accuracy”])

Step 3: Train the model. Choose the number of epochs and the batch size.

happyModel.fit(x = X_train, y = Y_train, epochs = 20, batch_size = 32)

Epoch 1/20
600/600 [==============================] – 6s – loss: 1.0961 – acc: 0.6750
Epoch 2/20
600/600 [==============================] – 7s – loss: 0.4198 – acc: 0.8250
Epoch 3/20
600/600 [==============================] – 8s – loss: 0.1933 – acc: 0.9250
Epoch 4/20
600/600 [==============================] – 7s – loss: 0.1165 – acc: 0.9567
Epoch 5/20
600/600 [==============================] – 6s – loss: 0.1224 – acc: 0.9500
Epoch 6/20
600/600 [==============================] – 6s – loss: 0.0970 – acc: 0.9667
Epoch 7/20
600/600 [==============================] – 7s – loss: 0.0639 – acc: 0.9850
Epoch 8/20
600/600 [==============================] – 7s – loss: 0.0841 – acc: 0.9700
Epoch 9/20
600/600 [==============================] – 8s – loss: 0.0934 – acc: 0.9733
Epoch 10/20
600/600 [==============================] – 7s – loss: 0.0677 – acc: 0.9767
Epoch 11/20
600/600 [==============================] – 6s – loss: 0.0705 – acc: 0.9650
Epoch 12/20
600/600 [==============================] – 7s – loss: 0.0548 – acc: 0.9783
Epoch 13/20
600/600 [==============================] – 7s – loss: 0.0533 – acc: 0.9800
Epoch 14/20
600/600 [==============================] – 7s – loss: 0.0517 – acc: 0.9850
Epoch 15/20
600/600 [==============================] – 7s – loss: 0.0665 – acc: 0.9750
Epoch 16/20
600/600 [==============================] – 7s – loss: 0.0273 – acc: 0.9917
Epoch 17/20
600/600 [==============================] – 7s – loss: 0.0291 – acc: 0.9933
Epoch 18/20
600/600 [==============================] – 6s – loss: 0.0245 – acc: 0.9917
Epoch 19/20
600/600 [==============================] – 7s – loss: 0.0376 – acc: 0.9883
Epoch 20/20
600/600 [==============================] – 7s – loss: 0.0440 – acc: 0.9917

Note that if we run fit() again, the model will continue to train with the parameters it has already learnt instead of re-initializing them.

Step 4: Test/evaluate the model.

preds = happyModel.evaluate(x = X_test, y = Y_test)
print()
print (“Loss = ” + str(preds[0]))
print (“Test Accuracy = ” + str(preds[1]))

150/150 [==============================] – 0s

Loss = 0.167731122573
Test Accuracy = 0.94666667064

As can be seen, our model gets around 95% test accuracy in 20 epochs (and 99% train accuracy).

Test with my own image

Let’s test on my own image to see how well the model generalizes on unseen face images.

img_path = ‘me_happy.png’
img = image.load_img(img_path, target_size=(64, 64))
imshow(img)

me3.png

x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
print(happyModel.predict(x))

[[ 1.]]             # Happy !

Model Summary

happyModel.summary()

summary.png

 

2.  Face Recognition with Deep Neural Net

Face recognition problems commonly fall into two categories:

  1. Face Verification – “is this the claimed person?”. For example, at some airports, one can pass through customs by letting a system scan your passport and then verifying that he (the person carrying the passport) is the correct person. A mobile phone that unlocks using our face is also using face verification. This is a 1:1 matching problem.
  2. Face Recognition – “who is this person?”. For example, this video of Baidu employees entering the office without needing to otherwise identify themselves is an example of face recognition. This is a 1:K matching problem.

FaceNet learns a neural network that encodes a face image into a vector of 128 numbers. By comparing two such vectors, we can then determine if two pictures are of the same person.

In this assignment, we shall:

  • Implement the triplet loss function
  • Use a pretrained model to map face images into 128-dimensional encodings
  • Use these encodings to perform face verification and face recognition

In this exercise, we will be using a pre-trained model which represents ConvNet activations using a “channels first” convention, as opposed to the “channels last” convention.

In other words, a batch of images will be of shape (m,n_C,n_H,n_W) instead of (m,n_H,n_W,n_C). Both of these conventions have a reasonable amount of traction among open-source implementations; there isn’t a uniform standard yet within the deep learning community.

Naive Face Verification

In Face Verification, we’re given two images and we have to tell if they are of the same person. The simplest way to do this is to compare the two images pixel-by-pixel. If the distance between the raw images are less than a chosen threshold, it may be the same person!

pixel_comparison.png

Of course, this algorithm performs really poorly, since the pixel values change dramatically due to variations in lighting, orientation of the person’s face, even minor changes in head position, and so on.

We’ll see that rather than using the raw image, we can learn an encoding f(img) so that element-wise comparisons of this encoding gives more accurate judgments as to whether two pictures are of the same person.

Encoding face images into a 128-dimensional vector

Using an ConvNet to compute encodings

The FaceNet model takes a lot of data and a long time to train. So following common practice in applied deep learning settings, let’s just load weights that someone else has already trained. The network architecture follows the Inception model from Szegedy et al.. We are going to use an inception network implementation.

This network uses 96×96 dimensional RGB images as its input. Specifically, inputs a face image (or batch of m face images) as a tensor of shape (m,nC,nH,nW)=(m,3,96,96).
It outputs a matrix of shape (m,128) that encodes each input face image into a 128-dimensional vector.

Let’s create the model for face images.

FRmodel = faceRecoModel(input_shape=(3, 96, 96))
print(“Total Params:”, FRmodel.count_params())

Total Params: 3743280

By using a 128-neuron fully connected layer as its last layer, the model ensures that the output is an encoding vector of size 128. We then use the encodings the compare two face images as follows:

distance_kiank.png

By computing a distance between two encodings and thresholding, we can determine if the two pictures represent the same person.

So, an encoding is a good one if:

  • The encodings of two images of the same person are quite similar to each other
  • The encodings of two images of different persons are very different

The triplet loss function formalizes this, and tries to “push” the encodings of two images of the same person (Anchor and Positive) closer together, while “pulling” the encodings of two images of different persons (Anchor, Negative) further apart.

triplet_comparison.png

In the next part, we will call the pictures from left to right: Anchor (A), Positive (P), Negative (N).

triplet

 

Loading the trained model

FaceNet is trained by minimizing the triplet loss. But since training requires a lot of data and a lot of computation, we won’t train it from scratch here. Instead, we load a previously trained model. Let’s Load a model using the following code; this might take a couple of minutes to run.

FRmodel.compile(optimizer = ‘adam’, loss = triplet_loss, metrics = [‘accuracy’])
load_weights_from_FaceNet(FRmodel)

Here is the summary of the very deep inception network:

inception.png

The next figure shows how the pre-trained deep inception network looks like:

FRmodel.png

Here’re some examples of distances between the encodings between three individuals:

dist.png
Let’s now use this model to perform face verification and face recognition!

Applying the model

Face Verification

Let’s build a database containing one encoding vector for each person. To generate the encoding we use img_to_encoding(image_path, model) which basically runs the forward propagation of the model on the specified image.

Let’s build a database to map each person’s name to a 128-dimensional encoding of their face.

Now this can be used in an automated employee verification at the gate in an office in the following way: when someone shows up at the front door and swipes their ID card (thus giving us their name), we can look up their encoding in the database, and use it to check if the person standing at the front door matches the name on the ID.

Let’s implement the verify() function which checks if the front-door camera picture (image_path) is actually the person called “identity“. We shall have to go through the following steps:

  • Compute the encoding of the image from image_path
  • Compute the distance in between this encoding and the encoding of the identity image stored in the database
  • Open the door if the distance is less than the threshold  0.7, else do not open.

As presented above, we are going to use the L2 distance (np.linalg.norm).

def verify(image_path, identity, database, model):

“””
Function that verifies if the person on the “image_path” image is “identity”.

Arguments:

image_path — path to an image

identity — string, name of the person you’d like to verify the identity. Has to be a resident of the Happy house.

database — python dictionary mapping names of allowed people’s names (strings) to their encodings (vectors).

model — your Inception model instance in Keras

Returns:

dist — distance between the image_path and the image of “identity” in the database.
door_open — True, if the door should open. False otherwise.

“””

### CODE HERE ###

return dist, door_open

Younes is trying to enter the  and the camera takes a picture of him (“camera_0.jpg”). Let’s run the above verification algorithm on this picture and compare with the one stored in the system (image_path):

camera_0

verify(“camera_0.jpg”, “younes”, database, FRmodel)

# output
It’s younes, welcome home!
(0.67291224, True)

Benoit, has been banned from the office and removed from the database. He stole Kian’s ID card and came back to the house to try to present himself as Kian. The front-door camera took a picture of Benoit (“camera_2.jpg). Let’s run the verification algorithm to check if benoit can enter.

benoit

verify(“camera_2.jpg”, “kian”, database, FRmodel)

# output
It’s not kian, please go away
(0.86543155, False)

Face Recognition

In this case, we need to implement a face recognition system that takes as input an image, and figures out if it is one of the authorized persons (and if so, who). Unlike the previous face verification system, we will no longer get a person’s name as another input.

Implement who_is_it(). We shall have to go through the following steps:

  • Compute the target encoding of the image from image_path
  • Find the encoding from the database that has smallest distance with the target encoding.
  • Initialize the min_dist variable to a large enough number (100). It will help to keep track of what is the closest encoding to the input’s encoding.
  • Loop over the database dictionary’s names and encodings. To loop use for (name, db_enc) in database.items().
  • Compute L2 distance between the target “encoding” and the current “encoding” from the database.
  • If this distance is less than the min_dist, then set min_dist to dist, and identity to name.

 

def who_is_it(image_path, database, model):
“””
Implements face recognition for the happy house by finding who is the person on the image_path image.

Arguments:
image_path — path to an image
database — database containing image encodings along with the name of the person on the image
model — your Inception model instance in Keras

Returns:
min_dist — the minimum distance between image_path encoding and the encodings from the database
identity — string, the name prediction for the person on image_path
“””

###  CODE HERE ###

return min_dist, identity

Younes is at the front-door and the camera takes a picture of him (“camera_0.jpg”). Let’s see if our who_it_is() algorithm identifies Younes.

who_is_it(“camera_0.jpg”, database, FRmodel)

# output
it’s younes, the distance is 0.672912
(0.67291224, ‘younes’)

We can change “camera_0.jpg” (picture of younes) to “camera_1.jpg” (picture of bertrand) and see the result.

who_is_it(“camera_1.jpg”, database, FRmodel)

# output
it’s bertrand, the distance is 0.474829
(0.47482917, ‘bertrand’)

Here is the takeaway:

  • Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
  • The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
  • The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.

 

References:

  • Florian Schroff, Dmitry Kalenichenko, James Philbin (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering
  • Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Lior Wolf (2014). DeepFace: Closing the gap to human-level performance in face verification
  • The pretrained model we use is inspired by Victor Sy Wang’s implementation and was loaded using his code: https://github.com/iwantooxxoox/Keras-OpenFace.
  • Implementation by Ng. et al. also took a lot of inspiration from the official FaceNet github repository: https://github.com/davidsandberg/facenet