Interactive Image Segmentation with Graph-Cut in Python

In this article, interactive image segmentation with graph-cut is going to be discussed. and it will be used to segment the source object from the background in an image. This segmentation technique was proposed by Boycov and Jolli in this paper. This problem appeared as a homework assignment here., and also in this lecture video from the Coursera image processing course by Duke university.

Problem Statement: Interactive graph-cut segmentation

Let’s implement “intelligent paint” interactive segmentation tool using graph cuts algorithm on a weighted image grid. Our task will be to separate the foreground object from the background in an image.

Since it can be difficult sometimes to automatically define what’s foreground and what’s background for an image, the user is going to help us with a few interactive scribble lines using which our algorithm is going to identify the foreground and the background, after that it will be the algorithms job to obtain a complete segmentation of the foreground from the background image.

The following figures show how an input image gets scribbling from a user with two different colors (which is also going to be input to our algorithm) and the ideal segmented image output.

Scribbled Input Image                                       Expected Segmented Output Image       bj


The Graph-Cut Algorithm

The following describes how the segmentation problem is transformed into a graph-cut problem:

  1. Let’s first define the Directed Graph G = (V, E) as follows:

    1. Each of the pixels in the image is going to be a vertex in the graph. There will be another couple of special terminal vertices: a source vertex (corresponds to the foreground object) and a sink vertex (corresponds to the background object in the image). Hence, |V(G)| = width x height + 2.

    2. Next, let’s defines the edges of the graph. As obvious, there is going to be two types of edges: terminal (edges that connect the terminal nodes to the non-terminal nodes) and non-terminal (edges that connect the non-terminal nodes only).

    3. There will be a directed edge from the terminal node source to each of non-terminal nodes in the graph. Similarly,  a directed edge will be there from each non-terminal node (pixel) to the other terminal node sink. These are going to be all the terminal edges and hence, |E_T(G)| =  2 x width x height.

    4. Each of the non-terminal nodes (pixels) are going to be connected by edges with the nodes corresponding to the neighboring pixels (defined by 4 or 8 neighborhood of a pixel).  Hence, |E_N(G)| =  |Nbd| x width x height.

  2. Now let’s describe how to compute the edge weights in this graph.

    1. In order to compute the terminal edge weights, we need to estimate the feature distributions first, i.e., starting with the assumption that each of the nodes corresponding to the scribbled pixels have the probability 1.0 (since we want the solution to respect the regional hard constraints marked by the user-seeds / scribbles) to be in foreground or background object in the image (distinguished by the scribble color, e.g.), we have to compute the probability that a node belongs to the foreground (or background) for all the other non-terminal nodes.

    2. The simplest way to compute P_F and P_B is to first fit a couple of  Gaussian distributions on the scribbles by computing the parameters (μ, ∑)
      with MLE from the scribbled pixel intensities and then computing the (class-conditional) probabilities from the individual pdfs (followed by a normalization) for each of the pixels as shown in the next figures. The following code fragment show how the pdfs are being computed.

      import numpy as np
      from collections import defaultdict
      def compute_pdfs(imfile, imfile_scrib):
          # Compute foreground and background pdfs
          # input image and the image with user scribbles
           rgb = mpimg.imread(imfile)[:,:,:3]
           yuv = rgb2yiq(rgb)
           rgb_s = mpimg.imread(imfile_scrib)[:,:,:3]
           yuv_s = rgb2yiq(rgb_s)
           # find the scribble pixels
           scribbles = find_marked_locations(rgb, rgb_s)
           imageo = np.zeros(yuv.shape)
           # separately store background and foreground scribble pixels in the dictionary comps
           comps = defaultdict(lambda:np.array([]).reshape(0,3))
           for (i, j) in scribbles:
                imageo[i,j,:] = rgbs[i,j,:]
                # scribble color as key of comps
                comps[tuple(imageo[i,j,:])] = np.vstack([comps[tuple(imageo[i,j,:])], yuv[i,j,:]])
                mu, Sigma = {}, {}
           # compute MLE parameters for Gaussians
           for c in comps:
                mu[c] = np.mean(comps[c], axis=0)
                Sigma[c] = np.cov(comps[c].T)
           return (mu, Sigma)

      3. In order to compute the non-terminal edge weights, we need to compute the similarities in between a pixel node and the nodes corresponding to its neighborhood pixels, e.g., with the formula shown in the next figures (e.g., how similar neighborhood pixels are in RGB / YIQ space).

  3. Now that the underlying graph is defined, the segmentation of the foreground from the background image boils down to computing the min-cut in the graph or equivalently computing the max-flow (the dual problem) from the source to sink.

  4. The intuition is that the min-cut solution will keep the pixels with high probabilities to belong to the side of the source (foreground) node and likewise the background pixels on the other side of the cut near the sink (background) node, since it’s going to respect the (relatively) high-weight edges (by not going through the highly-similar pixels).

  5. There are several standard algorithms, e.g., Ford-Fulkerson (by finding an augmenting path with O(E max| f |) time complexity) or Edmonds-Karp (by using bfs to find the shortest augmenting path, with O(VE2) time complexity) to solve the max-flow problem, typical implementations of these algorithms run pretty fast, in polynomial time in V, E.  Here we are going to use a different implementation (with pymaxflow) based on Vladimir Kolmogorov, which is shown to run faster on many images empirically in this paper.







The following figures / animations show the interactive-segmentation results (computed probability densities, subset of the flow-graph & min-cut, final segmented image) on a few images,  some of them taken from the above-mentioned courses / videos, some of them taken from Berkeley Vision dataset.

Input Imagedeer

Input Image with Scribblesdeer_scribed

Fitted Densities from Color Scribblesgaussian_deer.png


A Tiny Sub-graph with Min-Cut


Input Image

Input Image with Scribbleseagle_scribed

Fitted Densities from Color Scribblesgaussian_eagle



A Tiny Sub-graph with Min-Cut flow_eagle

Input Image (liver)                                             

Input Image with Scribblesliver_scribed

Fitted Densities from Color Scribbles


Input Imagefishes2

Input Image with Scribbles

Fitted Densities from Color Scribblesgaussian_fishes2


A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribblespanda_scribed


Input Imagebunny

Input Image with Scribblesbunny_scribed

Fitted Densities from Color Scribbles



A Tiny Sub-graph of the flow-graph with Min-Cut

Input Image

Input Image with Scribbles

A Tiny Sub-graph of the flow-graph with Min-Cut


Input Image                                                         Input Image with Scribbles
me3    me3_scribed


Input Imageinput1

Input Image with Scribblesinput1_scribed


A Tiny Sub-graph of the flow-graph with Min-Cut


Input Imagerose

Input Image with Scribbles

Fitted Densities from Color Scribbles


Input Image

Input Image with Scribbleswhale_scribed

Fitted Densities from Color Scribblesgaussian_whale

A Tiny Sub-graph of the flow-graph with Min-Cut


Input Image (UMBC Campus Map)umbc_campus

Input Image with Scribbles


Input Imagehorses

Input Image with Scribbles


A Tiny Sub-graph of the flow-graph with Min-Cut (with blue foreground nodes)


Changing the background of an image (obtained using graph-cut segmentation) with another image’s background with cut & paste

The following figures / animation show how the background of a given image can be replaced by a new image using cut & paste (by replacing the corresponding pixels in the new image corresponding to foreground), once the foreground in the original image gets identified after segmentation.

Original Input Imagehorses

New Backgroundsky.png





Image Colorization Using Optimization in Python


This article is inspired by this SIGGRAPH paper by Levin et. al, for which they took this patent , the paper was referred to in the course CS1114 from Cornell.  This method is also discussed in the coursera online image processing course by NorthWestern University. Some part of the problem description is taken from the paper itself. Also, one can refer to the implementation provided by the authors in matlab, the following link and the following python implementation in github.


The Problem

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.


The Algorithm

YUV/YIQ color space is chosen to work in, this is commonly used in video, where Y is the monochromatic luminance channel, which we will refer to simply as intensity, while U and V are the chrominance channels, encoding the color.

The algorithm is given as input an intensity volume Y(x,y,t) and outputs two color volumes U(x,y,t) and V(x,y,t). To simplify notation the boldface letters are used (e.g. r,s) to denote \left(x,y,t \right) triplets. Thus, Y(r) is the intensity of a particular pixel.

Now, One needs to impose the constraint that two neighboring pixels r,s should have similar colors if their intensities are similar. Thus, the problem is formulated as the following optimization problem that aims to minimize the difference between the  colorU(r) at pixel r and the weighted average of the colors at neighboring pixels, where w(r,s) is a weighting function that sums to one, large when Y(r) is similar to Y(s), and small when the two intensities are different. 


When the intensity is constant the color should be constant, and when the intensity is an edge the color should also be an edge. Since the cost functions are quadratic and
the constraints are linear, this optimization problem yields a large, sparse system of linear equations, which may be solved using a number of standard methods.

As discussed in the paper, this algorithm is closely related to algorithms proposed for other tasks in image processing. In image segmentation algorithms based on normalized cuts [Shi and Malik 1997], one attempts to find the second smallest eigenvector of the matrix D − W where W is a npixels×npixels matrix whose elements are the pairwise affinities between pixels (i.e., the r,s entry of the matrix is w(r,s)) and D is a diagonal matrix whose diagonal elements are the sum of the affinities (in this case it is always 1). The second smallest eigenvector of any symmetric matrix A is a unit norm vector x that minimizes x^{T}Ax and is orthogonal to the first eigenvector. By direct  inspection, the quadratic form minimized by normalized cuts is exactly the cost function J, that is x^{T}(D-W)x =J(x). Thus, this algorithm minimizes the same cost function but under different constraints. In image denoising algorithms based on anisotropic diffusion [Perona and Malik 1989; Tang et al. 2001] one often minimizes a  function
similar to equation 1, but the function is applied to the image intensity as well.

The following figures show an original gray-scale image and the marked image with color-scribbles that are going to be used to compute the output colored image.

Original Gray-scale Image Input

Gray-scale image Input Marked with Color-Scribbles 

Implementation of the Algorithm

Here are the the steps for the algorithm:

  1. Convert both the original gray-scale image and the marked image (marked with color scribbles for a few pixels by the artist) to from RGB to YUV / YIQ color space.
  2. Compute the difference image from the marked and the gray-scale image. The pixels that differ  are going to be pinned and they will appear in the output, they are directly going to be used as stand-alone constraints in the minimization problem.
  3. We need to compute the weight matrix W that depends on the similarities in the neighbor intensities for a pixel from the original gray-scale image.
  4. The optimization problem finally boils down to solving the system of linear equations of the form WU = b that has a closed form least-square solution
    U = W^{+}b = {(W^{T}W)}^{-1}W^{T}b. Same thing is to be repeated for the V channel too.
  5. However the W matrix is going to be very huge and sparse, hence sparse-matrix based implementations must be used to obtain an efficient solution. However, in this python implementation in github, the scipy sparse lil_matrix was used when constructing the sparse matrices, which is quite slow, we can construct more efficient scipy csc matrix rightaway, by using a dictionary to store the weights initially. It is much faster. The python code in the next figure shows my implementation for computing the weight matrix W.
  6. Once W is computed it’s just a matter of obtaining the least-square solution, by computing the pseudo-inverse, which can be more efficiently computed with LU factorization and a sparse LU solver, as in this  python implementation in github.
  7. Once the solution of the optimization problem is obtained in YUV / YIQ space, it needs to be converted back to RGB. The following formula is used for conversion.rgbyiq.png

import scipy.sparse as sp
from collections import defaultdict

def compute_W(Y, colored):

    (w, h) = Y.shape
    W = defaultdict()
    for i in range(w):
        for j in range(h):
            if not (i, j) in colored: # pixel i,j in color scribble
            (N, sigma) = get_nbrs(Y, i, j, w, h) #r = (i, j)
            Z = 0.
            id1 = ij2id(i,j,w,h) # linearized id
            for (i1, j1) in N: #s = (i1, j1)
                id2 = ij2id(i1,j1,w,h)
                W[id1,id2] = np.exp(-(Y[i,j]-Y[i1,j1])**2/(sigma**2)) if sigma > 0 else 0.
                Z += W[id1,id2]
            if Z > 0:
               for (i1, j1) in N: #s = (i1, j1)
                   id2 = ij2id(i1,j1,w,h)
                   W[id1,id2] /= -Z
     for i in range(w):
        for j in range(h):
            id = ij2id(i,j,w,h)
            W[id,id] = 1.

    rows, cols = zip(*(W.keys())) #keys
    data = W.values() #[W[k] for k in keys]
    return sp.csc_matrix((data, (rows, cols)), shape=(w*h, w*h)) #W



The following images and animations show the results obtained with the optimization algorithm. Most of the following images are taken from the paper itself.

Original Gray-scale Image Input                 Gray-scale image Input Markedbaby  baby_marked

Color image Output

The next animations show how the incremental scribbling results in better and better color images.

Original Gray-scale Image Input 


As can be seen from the following animation, the different parts of the building get colored as more and more color-hints are scribbled / annotated.

Gray-scale image Input Markedhouse_marked

Color image Output


Original Gray-scale Image Input 


Gray-scale image Input Marked

Color image Output

Original Gray-scale Image Input (me)me3_gray

Gray-scale image Input Marked incrementally

Color image Output

Original Gray-scale Image Input

Gray-scale image Input Marked
Color image Output

Original Gray-scale Image Input


Gray-scale image Input Marked


Color image Output



Recursive Graphics, Bi/Tri-linear Interpolation, Anti-aliasing and Image Transformation in Python

The following problem appeared in an assignment in the Princeton course COS 126 . The problem description is taken from the course itself.

Recursive Graphics

Write a program that plots a Sierpinski triangle, as illustrated below. Then develop a program that plots a recursive patterns of your own design.


Part 1. 

The Sierpinski triangle is an example of a fractal pattern. The pattern was described by Polish mathematician Waclaw Sierpinski in 1915, but has appeared in Italian art since the 13th century. Though the Sierpinski triangle looks complex, it can be generated with a short recursive program.

Examples. Below are the target Sierpinski triangles for different values of order N.


Our task is to implement a recursive function sierpinski(). We need to think recursively: our function should draw one black triangle (pointed downwards) and then call itself recursively 3 times (with an appropriate stopping condition). When writing our program, we should exercise modular design.

The following code shows an implementation:

class Sierpinski:
#Height of an equilateral triangle whose sides are of the specified length.
def height (self, length):
return sqrt(3) * length / 2.
#Draws a filled equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def filledTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length/2., x-length/2.]), np.array([y, y+h, y+h]), alpha=1)
#Draws an empty equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def emptyTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length, x-length]), np.array([y+2*h, y, y]), alpha=0)
# Draws a Sierpinski triangle of order n, such that the largest filled
# triangle has bottom vertex (x, y) and sides of the specified length.
def sierpinski(self, n, x, y, length):
self.filledTriangle(x, y, length)
if n > 1:
self.sierpinski(n-1, x-length/2., y, length/2.)
self.sierpinski(n-1, x+length/2., y, length/2.)
self.sierpinski(n-1, x, y+self.height(length), length/2.)

The following animation shows how such a triangle of order 5 is drawn recursively.

The following animation shows how such a triangle of order 6 is drawn recursively.


A diversion: fractal dimension.

Formally, we can define the Hausdorff dimension or similarity dimension of a self-similar figure by partitioning the figure into a number of self-similar pieces of smaller size. We define the dimension to be the log (# self similar pieces) / log (scaling factor in each spatial direction). For example, we can decompose the unit square into 4 smaller squares, each of side length 1/2; or we can decompose it into 25 squares, each of side length 1/5. Here, the number of self-similar pieces is 4 (or 25) and the scaling factor is 2 (or 5). Thus, the dimension of a square is 2 since log (4) / log(2) = log (25) / log (5) = 2. We can decompose the unit cube into 8 cubes, each of side length 1/2; or we can decompose it into 125 cubes, each of side length 1/5. Therefore, the dimension of a cube is log(8) / log (2) = log(125) / log(5) = 3.

We can also apply this definition directly to the (set of white points in) Sierpinski triangle. We can decompose the unit Sierpinski triangle into 3 Sierpinski triangles, each of side length 1/2. Thus, the dimension of a Sierpinski triangle is log (3) / log (2) ≈ 1.585. Its dimension is fractional—more than a line segment, but less than a square! With Euclidean geometry, the dimension is always an integer; with fractal geometry, it can be something in between. Fractals are similar to many physical objects; for example, the coastline of Britain resembles a fractal, and its fractal dimension has been measured to be approximately 1.25.


Part 2.

Drawing a tree recursively, as described here:


The following shows how a tree of order 10 is drawn:



The next problem appeared in an assignment in the Cornell course CS1114 . The problem description is taken from the course itself.

Bilinear Interpolation

Let’s consider a 2D matrix of values at integer grid locations (e.g., a grayscale image). To interpolate values on a 2D grid, we can use the 2D analogue of linear interpolation: bilinear interpolation. In this case, there are four neighbors for each possible point we’d like to interpolation, and the intensity values of these four neighbors are all combined to compute the interpolated intensity, as shown in the next figure.


In the figure, the Q values represent intensities. To combine these intensities, we perform linear interpolation in multiple directions: we first interpolate in the x direction (to get the value at the blue points), then in the y direction (to get the value at the green points).

Image transformations

Next, we’ll use the interpolation function to help us implement image transformations.

A 2D affine transformation can be represented with a 3 ×3 matrix T:


Recall that the reason why this matrix is 3×3, rather than 2 ×2, is that we operate in homogeneous coordinates; that is, we add an extra 1 on the end of our 2D coordinates (i.e., (x,y) becomes (x,y,1)), in order to represent translations with a matrix multiplication. To apply a transformation T to a pixel, we multiply T by the pixel’s location:


The following figure shows a few such transformation matrices:


To apply a transformation T to an entire image I, we could apply the transformation to each of I’s pixels to map them to the output image. However, this forward warping procedure has several problems. Instead, we’ll use inverse mapping to warp the pixels of the output image back to the input image. Because this won’t necessarily hit an integer-valued location, we’ll need to use the (bi-linear) interpolation to determine the intensity of the input image at the desired location, as shown in the next figure.


To demo the transformation function, let’s implement the following on a gray scale bird image:

1. Horizontal flipping


2. Scaling by a factor of 0.5

3. Rotation by 45 degrees around the center of the image


The next animations show rotation and sheer transformations with the Lena image:



Next, let’s implement a function to transform RGB images. To do this, we need to simply call transform image three times, once for each channel, then put the results together into a single image.  Next figures and animations show some results on an RGB image.


non-linear transformations

The next figure shows the transform functions from here:


The next figures and animations show the application of the above non-linear transforms on the Lena image.









Some more non-linear transforms:




There is a problem with our interpolation method above: it is not very good at shrinking images, due to aliasing. For instance, if let’s try to down-sample the following bricks image by a factor of 0.4, we get the image shown in the following figure: notice the strange banding effects in the output image.

Original Image


Down-sampled Image with Bilinear Interpolation

The problem is that a single pixel in the output image corresponds to about 2.8 pixels in the input image, but we are sampling the value of a single pixel—we should really be averaging over a small area.

To overcome this problem, we will create a data structure that will let us (approximately) average over any possible square regions of pixels in the input image: an image stack. An image stack is a 3D matrix that we can think of as, not surprisingly, a stack of images, one on top of the other. The top image in the cube will be the original input image. Images further down the stack will be the input image with progressively larger amounts of blur. The size of the matrix will be rows × cols × num levels, where the original (grayscale) image has size rows×cols and there are num levels images in the stack.

Before we use the stack, we must write a function to create it, which takes as input a (grayscale) image and a number of levels in the stack, and returns a 3D matrix stack corresponding to the stack. Again, the first image on the stack will be the original image. Every other image in the stack will be a blurred version of the previous image. A good blur kernel to use is:


Now, for image k in the stack, we know that every pixel is a (weighted) average of some number of pixels (a k × k patch, roughly speaking) in the input image. Thus, if we down-sample the image by a factor of k, we want to sample pixels from level k of the stack.

⇒  Let’s write the following function to create image stack that takes a grayscale image and a number max levels, and returns an image stack.

from scipy import signal

def create_image_stack(img, max_level):
K = np.ones((3,3)) / 9.
image_stack = np.zeros((img.shape[0], img.shape[1], max_level))
image_stack[:,:,0] = img
for l in range(1, max_level):
image_stack[:,:,l] = signal.convolve2d(image_stack[:,:,l-1], K,
boundary=’symm’, mode=’same’)
return image_stack

The next animation shows the image stack created from the bricks image.



Trilinear Interpolation

Now, what happens if we down-sample the image by a fractional factor, such as 3.6? Unfortunately, there is no level 3.6 of the stack. Fortunately, we have a tool to solve this problem: interpolation. We now potentially need to sample a value at position (row,col,k) of the image stack, where all three coordinates are fractional. We therefore something more powerful than bilinear interpolation: trilinear interpolation! Each position we want to sample now has eight neighbors, and we’ll combine all of their values together in a weighted sum.

This sounds complicated, but we can write this in terms of our existing functions. In particular, we now interpolate separately along different dimensions: trilinear interpolation can be implemented with two calls to bilinear interpolation and one call to linear interpolation.

Let’s implement a function trilerp like the following that takes an image stack, and a row, column, and stack level k, and returns the interpolated value.

def trilerp (img_stack, x, y, k):

if k < 1: k = 1
if  k == int(k):
return bilerp(img_stack[:,:,k-1], x, y)
f_k, c_k = int(floor(k)), int(ceil(k))
v_f_k = bilerp(img_stack[:,:,f_k-1], x, y)
v_c_k = bilerp(img_stack[:,:,c_k-1], x, y)
return linterp(k, f_k, c_k, v_f_k, v_c_k)

Now we can finally implement a transformation function that does proper anti-aliasing. In order to do this, let’s implement a function that will

  • First compute the image stack.
  • Then compute, for the transformation T, how much T is scaling down the image. If T is defined by the six values a,b,c,d,e,f above, then, to a first approximation, the downscale factor is:
    However, if k < 1 (corresponding to scaling up the image), we still want to sample from level 1. This situation reverts to normal bilinear interpolation.
  • Next call the trilerp function on the image stack, instead of bilerp on the input image.

The next figure shows the output image obtained image transformation with proper anti-aliasing:

Down-sampled Image with Anti-aliasing using Trilinear Interpolationout_scale_t.png

As we can see from the above output, the aliasing artifact has disappeared.

The same results are obtained on the color image, as shown below, by applying the trilerp function on the color channels separately and creating separate image stacks for different color channels.

Original Image

Down-sampled Image with Bilinear Interpolation

Down-sampled Image with Anti-aliasing

The following animation shows the branding artifacts created when using bilinear interpolation for  different scale factors and how they are removed with anti-aliasing.

Down-sampled Images with Bilinear Interpolation


Down-sampled Images with Anti-aliasing


Hand-Gesture Classification using Deep Convolution and Residual Neural Network (ResNet-50) with Tensorflow / Keras in Python

In this article, first an application of convolution net to classify a set of hand-sign images is going to be discussed.  Later the accuracy of this classifier will be improved using a deep res-net. These problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deep-learning specialization) by the Stanford Prof. Andrew Ng. ( The problem descriptions are taken straightaway from the course itself.

1. Hand-gesture Classification with Convolution Neural Network

In this assignment, the following tasks are going to be accomplished:

  • Implement a fully functioning ConvNet using TensorFlow.
  • Build and train a ConvNet in TensorFlow for a classification problem

This assignment is going to be done using tensorflow.

First the necessary packages are loaded:

import math
import numpy as np
import h5py
import matplotlib.pyplot as plt
import scipy
from PIL import Image
from scipy import ndimage
import tensorflow as tf
from tensorflow.python.framework import ops
from cnn_utils import *

%matplotlib inline

Next the “SIGNS” dataset is loaded that we are going to use. The SIGNS dataset is a collection of 6 signs representing numbers from 0 to 5, as shown in the next figure. The output classes are shown with one hot encoding.

# Loading the data (signs)
X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()
signs_data_kiank.pngThe next figures show a few randomly sampled images for each class label from the training dataset. There are 180 images for each class and a total of 108 images in the training dataset.

number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)

The following steps are to be executed to train a conv-net model with tensorflow using the trainign dataset and then classify the images from the test dataset using the model.

Create placeholders

TensorFlow requires that we create placeholders for the input data that will be fed into the model when running the session.

Let’s implement the function below to create placeholders for the input image X and the output Y. We should not define the number of training examples for the moment. To do so, we could use “None” as the batch size, it will give us the flexibility to choose it later. Hence X should be of dimension [None, n_H0, n_W0, n_C0] and Y should be of dimension  [None, n_y].

def create_placeholders(n_H0, n_W0, n_C0, n_y):
    Creates the placeholders for the tensorflow session.
    n_H0 -- scalar, height of an input image
    n_W0 -- scalar, width of an input image
    n_C0 -- scalar, number of channels of the input
    n_y -- scalar, number of classes
    X -- placeholder for the data input, of shape [None, n_H0, n_W0, n_C0] and dtype "float"
    Y -- placeholder for the input labels, of shape [None, n_y] and dtype "float"

    X = tf.placeholder(tf.float32, shape=(None, n_H0, n_W0, n_C0))
    Y = tf.placeholder(tf.float32, shape=(None, n_y))
    return X, Y


Initialize parameters

Let’s initialize weights/filters W1 and Wusing xavier_initializer.
We don’t need to worry about bias variables as you will soon see that TensorFlow functions take care of the bias. Note also that you will only initialize the weights/filters for the conv2d functions. TensorFlow initializes the layers for the fully connected part automatically.

def initialize_parameters():
    Initializes weight parameters to build a neural network with tensorflow. The shapes are:
                        W1 : [4, 4, 3, 8]
                        W2 : [2, 2, 8, 16]
    parameters -- a dictionary of tensors containing W1, W2
    tf.set_random_seed(1)                              # so that our "random" numbers match ours
    W1 = tf.get_variable("W1", (4, 4, 3, 8), initializer = tf.contrib.layers.xavier_initializer(seed = 0)) 
    W2 = tf.get_variable("W2", (2, 2, 8, 16), initializer = tf.contrib.layers.xavier_initializer(seed = 0)) 

    parameters = {"W1": W1,
                  "W2": W2}
    return parameters

Forward propagation

Next we need to implement the forward_propagation function below to build the following model:

We need to use the following built-in tensorflow functions:

  • tf.nn.conv2d(X,W1, strides = [1,s,s,1], padding = ‘SAME’): given an input XX and a group of filters W1W1, this function convolves W1W1‘s filters on X. The third input ([1,f,f,1]) represents the strides for each dimension of the input (m, n_H_prev, n_W_prev, n_C_prev). You can read the full documentation here
  • tf.nn.max_pool(A, ksize = [1,f,f,1], strides = [1,s,s,1], padding = ‘SAME’): given an input A, this function uses a window of size (f, f) and strides of size (s, s) to carry out max pooling over each window. You can read the full documentation here
  • tf.nn.relu(Z1): computes the elementwise ReLU of Z1 (which can be any shape). You can read the full documentation here.
  • tf.contrib.layers.flatten(P): given an input P, this function flattens each example into a 1D vector it while maintaining the batch-size. It returns a flattened tensor with shape [batch_size, k]. You can read the full documentation here.
  • tf.contrib.layers.fully_connected(F, num_outputs): given a the flattened input F, it returns the output computed using a fully connected layer. You can read the full documentation here.

In detail, we will use the following parameters for all the steps:

 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use an 8 by 8 filter size and an 8 by 8 stride, padding is "SAME"
 - Conv2D: stride 1, padding is "SAME"
 - ReLU
 - Max pool: Use a 4 by 4 filter size and a 4 by 4 stride, padding is "SAME"
 - Flatten the previous output.
 - FULLYCONNECTED (FC) layer: Apply a fully connected layer without an non-linear activation function. Do not call the softmax here. This will result in 6 neurons in the output layer, which then get passed later to a softmax. In TensorFlow, the softmax and cost function are lumped together into a single function, which you'll call in a different function when computing the cost. 
def forward_propagation(X, parameters):
    Implements the forward propagation for the model:
    X -- input dataset placeholder, of shape (input size, number of examples)
    parameters -- python dictionary containing your parameters "W1", "W2"
                  the shapes are given in initialize_parameters

    Z3 -- the output of the last LINEAR unit

Compute cost

Next step is to implement the compute cost function using the following tensorflow functions:

  • tf.nn.softmax_cross_entropy_with_logits(logits = Z3, labels = Y): computes the softmax entropy loss. This function both computes the softmax activation function as well as the resulting loss. You can check the full documentation here.
  • tf.reduce_mean: computes the mean of elements across dimensions of a tensor. Use this to sum the losses over all the examples to get the overall cost. You can check the full documentation here.
def compute_cost(Z3, Y):
    Computes the cost
    Z3 -- output of forward propagation (output of the last LINEAR unit), of shape (6, number of examples)
    Y -- "true" labels vector placeholder, same shape as Z3
    cost - Tensor of the cost function


Finally we need to merge the helper functions we implemented above to build a model and train it on the SIGNS dataset.

The model should:

  • create placeholders
  • initialize parameters
  • forward propagate
  • compute the cost
  • create an optimizer

Finally we need to create a session and run a for loop for num_epochs, get the mini-batches, and then for each mini-batch you will optimize the function.

def model(X_train, Y_train, X_test, Y_test, learning_rate = 0.009,
          num_epochs = 100, minibatch_size = 64, print_cost = True):
    Implements a three-layer ConvNet in Tensorflow:
    X_train -- training set, of shape (None, 64, 64, 3)
    Y_train -- test set, of shape (None, n_y = 6)
    X_test -- training set, of shape (None, 64, 64, 3)
    Y_test -- test set, of shape (None, n_y = 6)
    learning_rate -- learning rate of the optimization
    num_epochs -- number of epochs of the optimization loop
    minibatch_size -- size of a minibatch
    print_cost -- True to print the cost every 100 epochs
    train_accuracy -- real number, accuracy on the train set (X_train)
    test_accuracy -- real number, testing accuracy on the test set (X_test)
    parameters -- parameters learnt by the model. They can then be used to predict.

Then let’s train the model for 100 epochs.

_, _, parameters = model(X_train, Y_train, X_test, Y_test)
 with the following output:
Cost after epoch 0: 1.918487
Cost after epoch 5: 1.875008
Cost after epoch 10: 1.813409
Cost after epoch 15: 1.667654
Cost after epoch 20: 1.444399
Cost after epoch 25: 1.203926
Cost after epoch 30: 1.028009
Cost after epoch 35: 0.887578
Cost after epoch 40: 0.791803
Cost after epoch 45: 0.712319
Cost after epoch 50: 0.655244
Cost after epoch 55: 0.597494
Cost after epoch 60: 0.556236
Cost after epoch 65: 0.525260
Cost after epoch 70: 0.484548
Cost after epoch 75: 0.477365
Cost after epoch 80: 0.451908
Cost after epoch 85: 0.415393
Cost after epoch 90: 0.386501
Cost after epoch 95: 0.373167


Tensor(“Mean_1:0”, shape=(), dtype=float32)
Train Accuracy: 0.894444
Test Accuracy: 0.841667


2. Improving the Accuracy of the Hand-Gesture Classifier with Residual Networks

Now we shall learn how to build very deep convolutional networks, using Residual Networks (ResNets). In theory, very deep networks can represent very complex functions; but in practice, they are hard to train. Residual Networks, introduced by He et al., allow to train much deeper networks than were previously practically feasible.

In this assignment, the following tasks we are going to accomplish:

  • Implement the basic building blocks of ResNets.
  • Put together these building blocks to implement and train a state-of-the-art neural network for image classification.

This assignment will be done in Keras.

Let’s first load the following required packages.

import numpy as np
from keras import layers
from keras.layers import Input, Add, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D, AveragePooling2D, MaxPooling2D, GlobalMaxPooling2D
from keras.models import Model, load_model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
import pydot_ng as pydot
from IPython.display import SVG
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
from resnets_utils import *
from keras.initializers import glorot_uniform
import scipy.misc
from matplotlib.pyplot import imshow
%matplotlib inline
import keras.backend as K

The problem of very deep neural networks

In recent years, neural networks have become deeper, with state-of-the-art networks going from just a few layers (e.g., AlexNet) to over a hundred layers.

The main benefit of a very deep network is that it can represent very complex functions. It can also learn features at many different levels of abstraction, from edges (at the lower layers) to very complex features (at the deeper layers). However, using a deeper network doesn’t always help. A huge barrier to training them is vanishing gradients: very deep networks often have a gradient signal that goes to zero quickly, thus making gradient descent unbearably slow.

During training, we might therefore see the magnitude (or norm) of the gradient for the earlier layers descrease to zero very rapidly as training proceeds:


We are now going to solve this problem by building a Residual Network!

Building a Residual Network

In ResNets, a “shortcut” or a “skip connection” allows the gradient to be directly back-propagated to earlier layers:


The image on the left shows the “main path” through the network. The image on the right adds a shortcut to the main path. By stacking these ResNet blocks on top of each other, we can form a very deep network.

Two main types of blocks are used in a ResNet, depending mainly on whether the input/output dimensions are same or different. We are going to implement both of them.

1 – The identity block

The identity block is the standard block used in ResNets, and corresponds to the case where the input activation (say a[l]) has the same dimension as the output activation (say a[l+2]). To flesh out the different steps of what happens in a ResNet’s identity block, here is an alternative diagram showing the individual steps:

idblock2_kiank.pngThe upper path is the “shortcut path.” The lower path is the “main path.” In this diagram, we have also made explicit the CONV2D and ReLU steps in each layer. To speed up training we have also added a BatchNorm step. 

In this exercise, we’ll actually implement a slightly more powerful version of this identity block, in which the skip connection “skips over” 3 hidden layers rather than 2 layers. It looks like this:


Here’re the individual steps.

First component of main path:

  • The first CONV2D has F1 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2a'. Use 0 as the seed for the random initialization.
  • The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

  • The second CONV2D has F2 filters of shape (f,fand a stride of (1,1). Its padding is “same” and its name should be conv_name_base + '2b'. Use 0 as the seed for the random initialization.
  • The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

  • The third CONV2D has F3 filters of shape (1,1) and a stride of (1,1). Its padding is “valid” and its name should be conv_name_base + '2c'. Use 0 as the seed for the random initialization.
  • The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Final step:

  • The shortcut and the input are added together.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Now let’s implement the ResNet identity block.

  • To implement the Conv2D step: See reference
  • To implement BatchNorm: See reference (axis: Integer, the axis that should be normalized (typically the channels axis))
  • For the activation, use: Activation('relu')(X)
  • To add the value passed forward by the shortcut: See reference
defidentity_block(X, f, filters, stage, block):
    Implementation of the identity block as defined in Figure 3
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    X -- output of the identity block, tensor of shape (n_H, n_W, n_C)
    ### The first Component ###
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    # Retrieve Filters
    F1, F2, F3 = filters
    # Save the input value. You'll need this later to add back to the main path. 
    X_shortcut = X
    # First component of main path
    X = Conv2D(filters = F1, kernel_size = (1, 1), strides = (1,1), padding = 'valid', name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)

    ### The second Component ###

    # ...

    ### The third Component ###

    # ...

    return X

2 – The convolutional block

Next, the ResNet “convolutional block” is the other type of block. We can use this type of block when the input and output dimensions don’t match up. The difference with the identity block is that there is a CONV2D layer in the shortcut path:


The CONV2D layer in the shortcut path is used to resize the input x to a different dimension, so that the dimensions match up in the final addition needed to add the shortcut value back to the main path. For example, to reduce the activation dimensions’s height and width by a factor of 2, we can use a 1×1 convolution with a stride of 2. The CONV2D layer on the shortcut path does not use any non-linear activation function. Its main role is to just apply a (learned) linear function that reduces the dimension of the input, so that the dimensions match up for the later addition step.

The details of the convolutional block are as follows.

First component of main path:

  • The first CONV2D has F1 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '2a'.
  • The first BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2a'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Second component of main path:

  • The second CONV2D has F2 filters of (f,f) and a stride of (1,1). Its padding is “same” and it’s name should be conv_name_base + '2b'.
  • The second BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2b'.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Third component of main path:

  • The third CONV2D has F3 filters of (1,1) and a stride of (1,1). Its padding is “valid” and it’s name should be conv_name_base + '2c'.
  • The third BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '2c'. Note that there is no ReLU activation function in this component.

Shortcut path:

  • The CONV2D has F3 filters of shape (1,1) and a stride of (s,s). Its padding is “valid” and its name should be conv_name_base + '1'.
  • The BatchNorm is normalizing the channels axis. Its name should be bn_name_base + '1'.

Final step:

  • The shortcut and the main path values are added together.
  • Then apply the ReLU activation function. This has no name and no hyperparameters.

Let’s now implement the convolutional block.

defconvolutional_block(X, f, filters, stage, block, s = 2):
    Implementation of the convolutional block as defined in Figure 4
    X -- input tensor of shape (m, n_H_prev, n_W_prev, n_C_prev)
    f -- integer, specifying the shape of the middle CONV's window for the main path
    filters -- python list of integers, defining the number of filters in the CONV layers of the main path
    stage -- integer, used to name the layers, depending on their position in the network
    block -- string/character, used to name the layers, depending on their position in the network
    s -- Integer, specifying the stride to be used
    X -- output of the convolutional block, tensor of shape (n_H, n_W, n_C)
    # defining name basis
    conv_name_base = 'res' + str(stage) + block + '_branch'
    bn_name_base = 'bn' + str(stage) + block + '_branch'
    # Retrieve Filters
    F1, F2, F3 = filters
    # Save the input value
    X_shortcut = X

    ##### MAIN PATH #####
    # First component of main path 
    X = Conv2D(F1, (1, 1), strides = (s,s), name = conv_name_base + '2a', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = bn_name_base + '2a')(X)
    X = Activation('relu')(X)
    # Second component of main path
    # ...
    # Third component of main path
    # ...
    ##### SHORTCUT PATH ####
    # ...
    # Final step: Add shortcut value to main path, and pass it through a RELU activation
    # ...
    return X


3 – Building our first ResNet model (50 layers)

We now have the necessary blocks to build a very deep ResNet. The following figure describes in detail the architecture of this neural network. “ID BLOCK” in the diagram stands for “Identity block,” and “ID BLOCK x3” means we should stack 3 identity blocks together.


The details of this ResNet-50 model are:

  • Zero-padding pads the input with a pad of (3,3)
  • Stage 1:
    • The 2D Convolution has 64 filters of shape (7,7) and uses a stride of (2,2). Its name is “conv1”.
    • BatchNorm is applied to the channels axis of the input.
    • MaxPooling uses a (3,3) window and a (2,2) stride.
  • Stage 2:
    • The convolutional block uses three set of filters of size [64,64,256], “f” is 3, “s” is 1 and the block is “a”.
    • The 2 identity blocks use three set of filters of size [64,64,256], “f” is 3 and the blocks are “b” and “c”.
  • Stage 3:
    • The convolutional block uses three set of filters of size [128,128,512], “f” is 3, “s” is 2 and the block is “a”.
    • The 3 identity blocks use three set of filters of size [128,128,512], “f” is 3 and the blocks are “b”, “c” and “d”.
  • Stage 4:
    • The convolutional block uses three set of filters of size [256, 256, 1024], “f” is 3, “s” is 2 and the block is “a”.
    • The 5 identity blocks use three set of filters of size [256, 256, 1024], “f” is 3 and the blocks are “b”, “c”, “d”, “e” and “f”.
  • Stage 5:
    • The convolutional block uses three set of filters of size [512, 512, 2048], “f” is 3, “s” is 2 and the block is “a”.
    • The 2 identity blocks use three set of filters of size [512, 512, 2048], “f” is 3 and the blocks are “b” and “c”.
  • The 2D Average Pooling uses a window of shape (2,2) and its name is “avg_pool”.
  • The flatten doesn’t have any hyperparameters or name.
  • The Fully Connected (Dense) layer reduces its input to the number of classes using a softmax activation. Its name should be 'fc' + str(classes).

Let’s implement the ResNet with 50 layers described in the figure above.

We’ll need to use this function:

Here’re some other functions we used in the code below:

def ResNet50(input_shape = (64, 64, 3), classes = 6):
    Implementation of the popular ResNet50 the following architecture:

    input_shape -- shape of the images of the dataset
    classes -- integer, number of classes

    model -- a Model() instance in Keras
    # Define the input as a tensor with shape input_shape
    X_input = Input(input_shape)

    # Zero-Padding
    X = ZeroPadding2D((3, 3))(X_input)
    # Stage 1
    X = Conv2D(64, (7, 7), strides = (2, 2), name = 'conv1', kernel_initializer = glorot_uniform(seed=0))(X)
    X = BatchNormalization(axis = 3, name = 'bn_conv1')(X)
    X = Activation('relu')(X)
    X = MaxPooling2D((3, 3), strides=(2, 2))(X)

    # Stage 2
    X = convolutional_block(X, f = 3, filters = [64, 64, 256], stage = 2, block='a', s = 1)
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='b')
    X = identity_block(X, 3, [64, 64, 256], stage=2, block='c')

    # ...

    # ...

    # output layer
    X = Flatten()(X)
    X = Dense(classes, activation='softmax', name='fc' + str(classes), kernel_initializer = glorot_uniform(seed=0))(X)
    # Create model
    model = Model(inputs = X_input, outputs = X, name='ResNet50')

    return model

Next, let’s build the model’s graph. We have 6 output classes for the hand-signs dataset.

model = ResNet50(input_shape = (64, 64, 3), classes = 6)

We need to configure the learning process by compiling the model.

model.compile(optimizer='adam', loss='categorical_crossentropy', metrics=['accuracy'])

The model is now ready to be trained. The only thing we need is to pass the same hand-signs dataset that we used earlier. We need to load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Convert training and test labels to one hot matrices
Y_train = convert_to_one_hot(Y_train_orig, 6).T
Y_test = convert_to_one_hot(Y_test_orig, 6).T

print ("number of training examples = " + str(X_train.shape[0]))
print ("number of test examples = " + str(X_test.shape[0]))
print ("X_train shape: " + str(X_train.shape))
print ("Y_train shape: " + str(Y_train.shape))
print ("X_test shape: " + str(X_test.shape))
print ("Y_test shape: " + str(Y_test.shape))
number of training examples = 1080
number of test examples = 120
X_train shape: (1080, 64, 64, 3)
Y_train shape: (1080, 6)
X_test shape: (120, 64, 64, 3)
Y_test shape: (120, 6)

Now let’s train our  resnet model on 20 epochs with a batch size of 32., Y_train, epochs = 20, batch_size = 32)
Epoch 1/20
1080/1080 [==============================] - 173s - loss: 2.0610 - acc: 0.3435   
Epoch 2/20
1080/1080 [==============================] - 149s - loss: 1.8561 - acc: 0.4259   
Epoch 3/20
1080/1080 [==============================] - 147s - loss: 2.0284 - acc: 0.4343   
Epoch 4/20
1080/1080 [==============================] - 151s - loss: 1.7140 - acc: 0.4500   
Epoch 5/20
1080/1080 [==============================] - 134s - loss: 1.4401 - acc: 0.5676   
Epoch 6/20
1080/1080 [==============================] - 128s - loss: 1.1950 - acc: 0.6481   
Epoch 7/20
1080/1080 [==============================] - 129s - loss: 0.9886 - acc: 0.7426   
Epoch 8/20
1080/1080 [==============================] - 133s - loss: 1.2155 - acc: 0.6843   
Epoch 9/20
1080/1080 [==============================] - 131s - loss: 0.8536 - acc: 0.8185   
Epoch 10/20
1080/1080 [==============================] - 132s - loss: 0.9502 - acc: 0.7565   
Epoch 11/20
1080/1080 [==============================] - 129s - loss: 0.8180 - acc: 0.8111   
Epoch 12/20
1080/1080 [==============================] - 130s - loss: 0.7060 - acc: 0.8343   
Epoch 13/20
1080/1080 [==============================] - 130s - loss: 0.8687 - acc: 0.8148   
Epoch 14/20
1080/1080 [==============================] - 130s - loss: 0.8282 - acc: 0.8509   
Epoch 15/20
1080/1080 [==============================] - 130s - loss: 0.9303 - acc: 0.7972   
Epoch 16/20
1080/1080 [==============================] - 146s - loss: 1.1211 - acc: 0.7870   
Epoch 17/20
1080/1080 [==============================] - 143s - loss: 0.9337 - acc: 0.7824   
Epoch 18/20
1080/1080 [==============================] - 150s - loss: 0.3976 - acc: 0.8870   
Epoch 19/20
1080/1080 [==============================] - 143s - loss: 0.2532 - acc: 0.9407   
Epoch 20/20
1080/1080 [==============================] - 133s - loss: 0.2528 - acc: 0.9556

Let’s see how this model performs on the test set.

preds = model.evaluate(X_test, Y_test)
print ("Loss = " + str(preds[0]))
print ("Test Accuracy = " + str(preds[1]))
Loss = 0.36906948487
Test Accuracy = 0.891666662693

We can also print a summary of your model by running the following code.

Layer (type)                     Output Shape          Param #     Connected to                     
input_1 (InputLayer)             (None, 64, 64, 3)     0                                            
zero_padding2d_1 (ZeroPadding2D) (None, 70, 70, 3)     0                                            
conv1 (Conv2D)                   (None, 32, 32, 64)    9472                                         
bn_conv1 (BatchNormalization)    (None, 32, 32, 64)    256                                          
activation_4 (Activation)        (None, 32, 32, 64)    0                                            
max_pooling2d_1 (MaxPooling2D)   (None, 15, 15, 64)    0                                            
res2a_branch2a (Conv2D)          (None, 15, 15, 64)    4160                                         
bn2a_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_5 (Activation)        (None, 15, 15, 64)    0                                            
res2a_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
bn2a_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_6 (Activation)        (None, 15, 15, 64)    0                                            
res2a_branch1 (Conv2D)           (None, 15, 15, 256)   16640                                        
res2a_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
bn2a_branch1 (BatchNormalization (None, 15, 15, 256)   1024                                         
bn2a_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
add_2 (Add)                      (None, 15, 15, 256)   0                                            
activation_7 (Activation)        (None, 15, 15, 256)   0                                            
res2b_branch2a (Conv2D)          (None, 15, 15, 64)    16448                                        
bn2b_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_8 (Activation)        (None, 15, 15, 64)    0                                            
res2b_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
bn2b_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_9 (Activation)        (None, 15, 15, 64)    0                                            
res2b_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
bn2b_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
add_3 (Add)                      (None, 15, 15, 256)   0                                            
activation_10 (Activation)       (None, 15, 15, 256)   0                                            
res2c_branch2a (Conv2D)          (None, 15, 15, 64)    16448                                        
bn2c_branch2a (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_11 (Activation)       (None, 15, 15, 64)    0                                            
res2c_branch2b (Conv2D)          (None, 15, 15, 64)    36928                                        
bn2c_branch2b (BatchNormalizatio (None, 15, 15, 64)    256                                          
activation_12 (Activation)       (None, 15, 15, 64)    0                                            
res2c_branch2c (Conv2D)          (None, 15, 15, 256)   16640                                        
bn2c_branch2c (BatchNormalizatio (None, 15, 15, 256)   1024                                         
add_4 (Add)                      (None, 15, 15, 256)   0                                            
activation_13 (Activation)       (None, 15, 15, 256)   0                                            
res3a_branch2a (Conv2D)          (None, 8, 8, 128)     32896                                        
bn3a_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_14 (Activation)       (None, 8, 8, 128)     0                                            
res3a_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
bn3a_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_15 (Activation)       (None, 8, 8, 128)     0                                            
res3a_branch1 (Conv2D)           (None, 8, 8, 512)     131584                                       
res3a_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
bn3a_branch1 (BatchNormalization (None, 8, 8, 512)     2048                                         
bn3a_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
add_5 (Add)                      (None, 8, 8, 512)     0                                            
activation_16 (Activation)       (None, 8, 8, 512)     0                                            
res3b_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
bn3b_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_17 (Activation)       (None, 8, 8, 128)     0                                            
res3b_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
bn3b_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_18 (Activation)       (None, 8, 8, 128)     0                                            
res3b_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
bn3b_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
add_6 (Add)                      (None, 8, 8, 512)     0                                            
activation_19 (Activation)       (None, 8, 8, 512)     0                                            
res3c_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
bn3c_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_20 (Activation)       (None, 8, 8, 128)     0                                            
res3c_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
bn3c_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_21 (Activation)       (None, 8, 8, 128)     0                                            
res3c_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
bn3c_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
add_7 (Add)                      (None, 8, 8, 512)     0                                            
activation_22 (Activation)       (None, 8, 8, 512)     0                                            
res3d_branch2a (Conv2D)          (None, 8, 8, 128)     65664                                        
bn3d_branch2a (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_23 (Activation)       (None, 8, 8, 128)     0                                            
res3d_branch2b (Conv2D)          (None, 8, 8, 128)     147584                                       
bn3d_branch2b (BatchNormalizatio (None, 8, 8, 128)     512                                          
activation_24 (Activation)       (None, 8, 8, 128)     0                                            
res3d_branch2c (Conv2D)          (None, 8, 8, 512)     66048                                        
bn3d_branch2c (BatchNormalizatio (None, 8, 8, 512)     2048                                         
add_8 (Add)                      (None, 8, 8, 512)     0                                            
activation_25 (Activation)       (None, 8, 8, 512)     0                                            
res4a_branch2a (Conv2D)          (None, 4, 4, 256)     131328                                       
bn4a_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_26 (Activation)       (None, 4, 4, 256)     0                                            
res4a_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4a_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_27 (Activation)       (None, 4, 4, 256)     0                                            
res4a_branch1 (Conv2D)           (None, 4, 4, 1024)    525312                                       
res4a_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4a_branch1 (BatchNormalization (None, 4, 4, 1024)    4096                                         
bn4a_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_9 (Add)                      (None, 4, 4, 1024)    0                                            
activation_28 (Activation)       (None, 4, 4, 1024)    0                                            
res4b_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4b_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_29 (Activation)       (None, 4, 4, 256)     0                                            
res4b_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4b_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_30 (Activation)       (None, 4, 4, 256)     0                                            
res4b_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4b_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_10 (Add)                     (None, 4, 4, 1024)    0                                            
activation_31 (Activation)       (None, 4, 4, 1024)    0                                            
res4c_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4c_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_32 (Activation)       (None, 4, 4, 256)     0                                            
res4c_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4c_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_33 (Activation)       (None, 4, 4, 256)     0                                            
res4c_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4c_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_11 (Add)                     (None, 4, 4, 1024)    0                                            
activation_34 (Activation)       (None, 4, 4, 1024)    0                                            
res4d_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4d_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_35 (Activation)       (None, 4, 4, 256)     0                                            
res4d_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4d_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_36 (Activation)       (None, 4, 4, 256)     0                                            
res4d_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4d_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_12 (Add)                     (None, 4, 4, 1024)    0                                            
activation_37 (Activation)       (None, 4, 4, 1024)    0                                            
res4e_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4e_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_38 (Activation)       (None, 4, 4, 256)     0                                            
res4e_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4e_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_39 (Activation)       (None, 4, 4, 256)     0                                            
res4e_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4e_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_13 (Add)                     (None, 4, 4, 1024)    0                                            
activation_40 (Activation)       (None, 4, 4, 1024)    0                                            
res4f_branch2a (Conv2D)          (None, 4, 4, 256)     262400                                       
bn4f_branch2a (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_41 (Activation)       (None, 4, 4, 256)     0                                            
res4f_branch2b (Conv2D)          (None, 4, 4, 256)     590080                                       
bn4f_branch2b (BatchNormalizatio (None, 4, 4, 256)     1024                                         
activation_42 (Activation)       (None, 4, 4, 256)     0                                            
res4f_branch2c (Conv2D)          (None, 4, 4, 1024)    263168                                       
bn4f_branch2c (BatchNormalizatio (None, 4, 4, 1024)    4096                                         
add_14 (Add)                     (None, 4, 4, 1024)    0                                            
activation_43 (Activation)       (None, 4, 4, 1024)    0                                            
res5a_branch2a (Conv2D)          (None, 2, 2, 512)     524800                                       
bn5a_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_44 (Activation)       (None, 2, 2, 512)     0                                            
res5a_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
bn5a_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_45 (Activation)       (None, 2, 2, 512)     0                                            
res5a_branch1 (Conv2D)           (None, 2, 2, 2048)    2099200                                      
res5a_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
bn5a_branch1 (BatchNormalization (None, 2, 2, 2048)    8192                                         
bn5a_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
add_15 (Add)                     (None, 2, 2, 2048)    0                                            
activation_46 (Activation)       (None, 2, 2, 2048)    0                                            
res5b_branch2a (Conv2D)          (None, 2, 2, 512)     1049088                                      
bn5b_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_47 (Activation)       (None, 2, 2, 512)     0                                            
res5b_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
bn5b_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_48 (Activation)       (None, 2, 2, 512)     0                                            
res5b_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
bn5b_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
add_16 (Add)                     (None, 2, 2, 2048)    0                                            
activation_49 (Activation)       (None, 2, 2, 2048)    0                                            
res5c_branch2a (Conv2D)          (None, 2, 2, 512)     1049088                                      
bn5c_branch2a (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_50 (Activation)       (None, 2, 2, 512)     0                                            
res5c_branch2b (Conv2D)          (None, 2, 2, 512)     2359808                                      
bn5c_branch2b (BatchNormalizatio (None, 2, 2, 512)     2048                                         
activation_51 (Activation)       (None, 2, 2, 512)     0                                            
res5c_branch2c (Conv2D)          (None, 2, 2, 2048)    1050624                                      
bn5c_branch2c (BatchNormalizatio (None, 2, 2, 2048)    8192                                         
add_17 (Add)                     (None, 2, 2, 2048)    0                                            
activation_52 (Activation)       (None, 2, 2, 2048)    0                                            
avg_pool (AveragePooling2D)      (None, 1, 1, 2048)    0                                            
flatten_1 (Flatten)              (None, 2048)          0                                            
fc6 (Dense)                      (None, 6)             12294                                        
Total params: 23,600,006.0
Trainable params: 23,546,886.0
Non-trainable params: 53,120.0

Finally, the next figure shows the visualization of our ResNet50.


Key points

  • Very deep “plain” networks don’t work in practice because they are hard to train due to vanishing gradients.
  • The skip-connections help to address the Vanishing Gradient problem. They also make it easy for a ResNet block to learn an identity function.
  • There are two main type of blocks: The identity block and the convolutional block.
  • Very deep Residual Networks are built by stacking these blocks together.


This article presents the ResNet algorithm due to He et al. (2015). The implementation here also took significant inspiration and follows the structure given in the github repository of Francois Chollet:

Some Applications of Markov Chain in Python

In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.

1. Markov Model of Natural Language

This problem appeared as an assignment in the Princeton course cos126 . This assignment was developed by Prof. Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.

Problem Statement

Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these frequencies as probabilities. For example, if the input text is "gagggagaggcgagaaa", the Markov model of order 0 predicts that each letter is 'a' with probability 7/17, 'c' with probability 1/17, and 'g' with probability 9/17 because these are the fraction of times each letter occurs. The following sequence of letters is a typical example generated from this model:

g a g g c g a g a a g a g a a g a a a g a g a g a g a a a g a g a a g ...

A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with statistical properties of English text because there a high correlation among successive letters in a word or sentence. For example, 'w' is more likely to be followed with 'e' than with 'u', while 'q' is more likely to be followed with 'u' than with 'e'.

We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters. Let a k-gram mean any k consecutive letters. Then for example, if the text has 100 occurrences of "th", with 60 occurrences of "the", 25 occurrences of "thi", 10 occurrences of "tha", and 5 occurrences of "tho", the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20.

A brute-force solution. Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of order 1:

“ To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. ”

Our task is to write a python program to automate this laborious task in a more efficient way — Shannon’s brute-force approach is prohibitively slow when the size of the input text is large.

Markov model data type.
 Create an immutable data type MarkovModel to represent a Markov model of order k from a given text string. The data type must implement the following API:


  • Constructor. To implement the data type, create a symbol table, whose keys will be Stringk-grams. You may assume that the input text is a sequence of characters over the ASCII alphabet so that all char values are between 0 and 127. The value type of your symbol table needs to be a data structure that can represent the frequency of each possible next character. The frequencies should be tallied as if the text were circular (i.e., as if it repeated the first k characters at the end).
  • Order. Return the order k of the Markov Model.
  • Frequency. There are two frequency methods.
    • freq(kgram) returns the number of times the k-gram was found in the original text.
    • freq(kgram, c) returns the number of times the k-gram was followed by the character c in the original text.
  • Randomly generate a character. Return a character. It must be a character that followed the k-gram in the original text. The character should be chosen randomly, but the results of calling rand(kgram) several times should mirror the frequencies of characters that followed the k-gram in the original text.
  • Generate pseudo-random text. Return a String of length T that is a randomly generated stream of characters whose first k characters are the argument kgram. Starting with the argument kgram, repeatedly call rand() to generate the next character. Successive k-grams should be formed by using the most recent k characters in the newly generated text. Use a StringBuilder object to build the stream of characters (otherwise, as we saw when discussing performance, your code will take order of N2 time to generate N characters, which is too slow).

To avoid dead ends, treat the input text as a circular string: the last character is considered to precede the first character.

For example, if k = 2 and the text is the 17-character string "gagggagaggcgagaaa", then the salient features of the Markov model are captured in the table below:

               frequency of   probability that
                next char       next char is 
kgram   freq    a   c   g        a    c    g
 aa      2      1   0   1       1/2   0   1/2  
 ag      5      3   0   2       3/5   0   2/5  
 cg      1      1   0   0        1    0    0
 ga      5      1   0   4       1/5   0   4/5  
 gc      1      0   0   1        0    0    1  
 gg      3      1   1   1       1/3  1/3  1/3  
        17      7   1   9

Taken from an example from the same assignment.

Note that the frequency of "ag" is 5 (and not 4) because we are treating the string as circular.

Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For example, if the current state is "ga" in the Markov model of order 2 discussed above, then the next character is 'a' with probability 1/5 and 'g' with probability 4/5. The next state in the Markov chain is obtained by appending the new character to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such states. Below is a possible trajectory consisting of 9 transitions.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
probability for a:       1/5      3/5      1/3       0        1       1/5      3/51/5      1/2
probability for c:        0        0       1/3       0        0        0        0        0        0
probability for g:       4/5      2/5      1/3       1        0       4/5      2/5      4/5      1/2

Taken from an example from the same assignment.

Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters.

To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then, simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
output:              ga        g        g        c        g        a        g        a        a        g

Text generation client. Implement a client program TextGenerator that takes two command-line integers k and T, reads the input text from standard input and builds a Markov model of order k from the input text; then, starting with the k-gram consisting of the first k letters of the input text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain. We may assume that the text has length at least k, and also that T ≥ k.

A Python Implementation

import numpy as np
from collections import defaultdict

class MarkovModel:

    def __init__(self, text, k): 
        create a Markov model of order k from given text
        Assume that text has length at least k.
        self.k = k               
        self.tran = defaultdict(float)
        self.alph = list(set(list(text)))
        self.kgrams = defaultdict(int)
        n = len(text)
        text += text[:k]
        for i in range(n):
            self.tran[text[i:i+k],text[i+k]] += 1.
            self.kgrams[text[i:i+k]] += 1

    def order(self):                   # order k of Markov model
        return self.k

    def freq(self, kgram):             # number of occurrences of kgram in text
        assert len(kgram) == self.k    # (check if kgram is of length k)
        return self.kgrams[kgram]

    def freq2(self, kgram, c):   	# number of times that character c follows kgram
        assert len(kgram) == self.k     # (check if kgram is of length k)  
        return self.tran[kgram,c]  

    def rand(self, kgram):             # random character following given kgram
        assert len(kgram) == self.k    # (check if kgram is of length k.
        Z = sum([self.tran[kgram, alph] for alph in self.alph])
        return np.random.choice(self.alph, 1, p=np.array([self.tran[kgram, alph] for alph in self.alph])/Z)

    def gen(self, kgram, T):           # generate a String of length T characters
        assert len(kgram) == self.k    # by simulating a trajectory through the corresponding
        str = ''                       # Markov chain. The first k characters of the newly
        for _ in range(T):             # generated String should be the argument kgram.
             #print kgram, c           # check if kgram is of length k.
             c =  self.rand(kgram)[0]  # Assume that T is at least k.
             kgram = kgram[1:] + c     
             str += c			 
        return str

Some Results

m = MarkovModel('gagggagaggcgagaaa', 2)

generates the following MarkovChain where each state represents a 2-gram.



Input: news item (taken from the assignment)

Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.

“We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Counsel for Microsoft Corporation. “We are disappointed with this decision, but we will immediately comply with the Court’s order.”

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft’s goal.

“We will continue to listen to our customers and provide them the tools they need to write great software using the Java language,” added Tod Nielsen, General Manager for Microsoft’s Developer Relations Group/Platform Marketing.

Markov Model learnt


Generated output: random news item, using input as an order 7 model


Microsoft is no longer able to use the Java language,” added Tod Nielsen, General Counsel for Microsoft’s Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developer Relations Group/Platform Marketing.Microsoft to be Microsoft Corporation. “We are disappointed with the Court’s order.”

Microsoft is no longer able to use the Java language. Providing the best tools and provide them the tools and programming options will continue to listen to our customers and provide them the tools and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the Java programming option of taking advantage of Windows features when writing software Developer Relations Group/Platform Marketing.Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Manager for Microsoft’s goal.

“We will continue to listen to our customers and programming language. Providing the Java language,” added Tod Nielsen, General Manager for Microsoft is no longer able to use the Java language.

Noisy Text Correction

Imagine we receive a message where some of the characters have been corrupted by noise. We represent unknown characters by the ~ symbol (we assume we don’t use ~ in our messages). Add a method replaceUnknown that decodes a noisy message by replacing each ~ with the most likely character given our order k Markov model, and conditional on the surrounding text:

def replaceUnknown(corrupted)  # replace unknown characters with most probable characters

Assume unknown characters are at least k characters apart and also appear at least k characters away from the start and end of the message.  This maximum-likelihood approach doesn’t always get it perfect, but it fixes most of the missing characters correctly.

Here are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should consider all possible replacement characters. We want the replacement character that makes sense not only at the unknown position (given the previous characters) but also when the replacement is used in the context of the k subsequent known characters. For example we expect the unknown character in "was ~he wo" to be 't' and not simply the most likely character in the context of "was ". You can compute the probability of each hypothesis by multiplying the probabilities of generating each of k+1 characters in sequence: the missing one, and the k next ones. The following figure illustrates how we want to consider k+1 windows to maximize the log-likelihood:


Using the algorithm described above, here are the results obtained for the following example:

Original : it was the best of times, it was the worst of times.
Noisy :      it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.
Corrected (k=4): it was the best of times, it was the worst of times.
Corrected (k=2): it was the best of times, in was the wo st of times.


2. Detecting authorship

This problem appeared as an assignment in the Cornell course cs1114 .

The Problem Statement

In this assignment, we shall be implementing an authorship detector which, when
given a large sample size of text to train on, can then guess the author of an unknown

  • The algorithm to be implemented works based on the following idea: An author’s
    writing style can be defined quantitatively by looking at the words he uses. Specifically, we want to keep track of his word flow – that is, which words he tends to use after other words.
  • To make things significantly simpler, we’re going to assume that the author always
    follows a given word with the same distribution of words. Of course, this isn’t true,
    since the words you choose when writing obviously depend on context. Nevertheless, this simplifying assumption should hold over an extended amount of text, where context becomes less relevant.
  • In order to implement this model of an author’s writing style, we will use a Markov
    chain. A Markov chain is a set of states with the Markov property – that is, the
    probabilities of each state are independent from the probabilities of every other state. This behavior correctly models our assumption of word independence.
  • A Markov chain can be represented as a directed graph. Each node is a state (words,
    in our case), and a directed edge going from state Si to Sj represents the probability we will go to Sj when we’re at Si. We will implement this directed graph as a transition matrix. Given a set of words W1, W2, …Wn, we can construct an n by n transition matrix A, where an edge from Wi to Wj of weight p means Aij = p.
  • The edges, in this case, represent the probability that word j follows word i from the
    given author. This means, of course, that the sum of the weights of all edges leaving
    from each word must add up to 1.
  • We can construct this graph from a large sample text corpus. Our next step would be finding the author of an unknown, short chunk of text. To do this, we simply compute the probability of this unknown text occurring, using the words in that order, in each of our Markov chains. The author would likely be the one with the highest probability.
  • We shall implement the Markov chain model of writing style. We are given some sample texts to train our model on, as well as some challenges for you to
    figure out.


Constructing the transition matrix

Our first step is to construct the transition matrix representing our Markov chain.
First, we must read the text from a sample file. We shall want to create a sparse array using the scipy csr sparse matrix. Along with the transition matrix, we shall be creating a corresponding vector that contains 2 word frequencies (normalized by the total number of words in the document (including repeated words)).

Calculating likelihood

Once we have our transition matrix, we can calculate the likelihood of an unknown
sample of text. We are given several pieces of literature by various authors,
as well as excerpts from each of the authors as test dataset. Our goal is to identify the authors of each excerpt.

To do so, we shall need to calculate the likelihood of the excerpt occurring in each author’s transition matrix. Recall that each edge in the directed graph that the transition
matrix represents is the probability that the author follows a word with another.
Since we shall be multiplying numerous possibly small probabilities together, our
calculated likelihood will likely be extremely small. Thus, you should compare log(likelihood) instead. Keep in mind the possibility that the author may have used a word he has never used before. Our calculated likelihood should not eliminate an author completely because of this. We shall be imposing a high penalty if a word is missing.

Finding the author with the maximum likelihood

Now that we can compute likelihoods, the next step is to write a routine that takes a set of transition matrices and dictionaries, and a sequence of text, and returns the index of the transition matrix that results in the highest likelihood. You will write this in a function classify text, which takes transition matrices, dictionaries, histograms, and the name of the file containing the test text, and returns a single integer best index.  The following figure shows how to detect an author k (A_k) of the test text t_1..t_n  using the transition matrix P_k with MLE :


Python Implementation

from np import log
def log0(x):
return 0 if x <= 0 else log(x)

def compute_text_likelihood(filename, T, dict_rev, histogram, index):

Compute the (log) likelihood L of a given string (in ‘filename’) given
a word transition probability T, dictionary ‘dict’, and histogram

text = word_tokenize(open(filename).read().replace(‘\n’, ‘ ‘).lower())
num_words = len(text)
text = [word for word in text if word in histogram] # keep only the words that are found in the training dataset
ll = log0(histogram[text[0]]) – log0(sum(histogram.values()))
for i in range(1, len(text)):
ll += log0(T[dict_rev[text[i-1]], dict_rev[text[i]]])
return ll + (num_words – num_matches)*penalty

def classify_text(tmatrices, dict_revs, histograms, filename):

Return the index of the most likely author given the transition matrices, dictionaries, and a test file

for i in range(len(tmatrices)):
ll = compute_text_likelihood(filename, tmatrices[i], dict_revs[i], histograms[i], i)

print i, ll

Training Dataset

The list of authors whose writings are there in the training dataset:

0. Austen
1. Carroll
2. Hamilton
3. Jay
4. Madison
5. Shakespeare
6. Thoreau
7. Twain


A few lines of excerpts from the training files (the
literature by several authors, taken from Project Gutenberg), the word clouds and a few states from the corresponding Markov Models Constructed for a few authors


Author JANE AUSTEN (Texts taken from Emma, Mansfield Park, Pride and Prejudice)

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and
happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister’s marriage,
been mistress of his house from a very early period. Her mother had
died too long ago for her to have more than an indistinct remembrance
of her caresses; and her place had been supplied by an excellent woman
as governess, who had fallen little short of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a
governess than a friend, very fond of both daughters, but particularly
of Emma. Between _them_ it was more the intimacy of sisters. Even
before Miss Taylor had ceased to hold the nominal office of governess,
the mildness of her temper had hardly allowed her to impose any
restraint; and the shadow of authority being now long passed away, they
had been living together as friend and friend very mutually attached,
and Emma doing just what she liked; highly esteeming Miss Taylor’s
judgment, but directed chiefly by her own.

The real evils, indeed, of Emma’s situation were the power of having
rather too much her own way, and a disposition to think a little too
well of herself; these were the disadvantages which threatened alloy to
her many enjoyments. The danger, however, was at present so
unperceived, that they did not by any means rank as misfortunes with




Author Lewis Carroll (Texts taken from Alice’s Adventures in Wonderland, Sylvie and Bruno)

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, ‘and what is the use of a book,’ thought Alice ‘without pictures or

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear!
Oh dear! I shall be late!’ (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural); but when the Rabbit actually TOOK A WATCH
OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,
Alice started to her feet, for it flashed across her mind that she had
never before seen a rabbit with either a waistcoat-pocket, or a watch
to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large
rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how
in the world she was to get out again.

The rabbit-hole went straight on like a tunnel for some way, and then
dipped suddenly down, so suddenly that Alice had not a moment to think
about stopping herself before she found herself falling down a very deep




Author William Shakespeare (Texts taken from Henry IV Part 1, Romeo and Juliet, Twelfth Night)


So shaken as we are, so wan with care,
Find we a time for frighted peace to pant,
And breathe short-winded accents of new broils
To be commenced in strands afar remote.
No more the thirsty entrance of this soil
Shall daub her lips with her own children’s blood;
No more shall trenching war channel her fields,
Nor bruise her flowerets with the armed hoofs
Of hostile paces: those opposed eyes,
Which, like the meteors of a troubled heaven,
All of one nature, of one substance bred,
Did lately meet in the intestine shock
And furious close of civil butchery,
Shall now, in mutual well-beseeming ranks,
March all one way, and be no more opposed
Against acquaintance, kindred, and allies:
The edge of war, like an ill-sheathed knife,
No more shall cut his master. Therefore, friends,
As far as to the sepulchre of Christ–
Whose soldier now, under whose blessed cross
We are impressed and engaged to fight–
Forthwith a power of English shall we levy,
To chase these pagans in those holy fields
Over whose acres walk’d those blessed feet
Which fourteen hundred years ago were nail’d
For our advantage on the bitter cross.
But this our purpose now is twelvemonth old,
And bootless ’tis to tell you we will go:
Therefore we meet not now.–Then let me hear
Of you, my gentle cousin Westmoreland,
What yesternight our Council did decree
In forwarding this dear expedience.

My liege, this haste was hot in question,
And many limits of the charge set down
But yesternight; when, all athwart, there came
A post from Wales loaden with heavy news;
Whose worst was, that the noble Mortimer,
Leading the men of Herefordshire to fight
Against th’ irregular and wild Glendower,
Was by the rude hands of that Welshman taken;
A thousand of his people butchered,
Upon whose dead corpse’ there was such misuse,
Such beastly, shameless transformation,
By those Welshwomen done, as may not be
Without much shame re-told or spoken of.




Author MARK TWAIN (Texts taken from A Connecticut Yankee in King Arthur’s Court, The Adventures of Huckleberry Finn, The Prince and the Pauper)

I am an American. I was born and reared in Hartford, in the State
of Connecticut–anyway, just over the river, in the country. So
I am a Yankee of the Yankees–and practical; yes, and nearly
barren of sentiment, I suppose–or poetry, in other words. My
father was a blacksmith, my uncle was a horse doctor, and I was
both, along at first. Then I went over to the great arms factory
and learned my real trade; learned all there was to it; learned
to make everything: guns, revolvers, cannon, boilers, engines, all
sorts of labor-saving machinery. Why, I could make anything
a body wanted–anything in the world, it didn’t make any difference
what; and if there wasn’t any quick new-fangled way to make a thing,
I could invent one–and do it as easy as rolling off a log. I became
head superintendent; had a couple of thousand men under me.

Well, a man like that is a man that is full of fight–that goes
without saying. With a couple of thousand rough men under one,
one has plenty of that sort of amusement. I had, anyway. At last
I met my match, and I got my dose. It was during a misunderstanding
conducted with crowbars with a fellow we used to call Hercules.
He laid me out with a crusher alongside the head that made everything
crack, and seemed to spring every joint in my skull and made it
overlap its neighbor. Then the world went out in darkness, and
I didn’t feel anything more, and didn’t know anything at all
–at least for a while.

When I came to again, I was sitting under an oak tree, on the
grass, with a whole beautiful and broad country landscape all
to myself–nearly. Not entirely; for there was a fellow on a horse,
looking down at me–a fellow fresh out of a picture-book. He was
in old-time iron armor from head to heel, with a helmet on his
head the shape of a nail-keg with slits in it; and he had a shield,
and a sword, and a prodigious spear; and his horse had armor on,
too, and a steel horn projecting from his forehead, and gorgeous
red and green silk trappings that hung down all around him like
a bedquilt, nearly to the ground.

“Fair sir, will ye just?” said this fellow.

“Will I which?”

“Will ye try a passage of arms for land or lady or for–”

“What are you giving me?” I said. “Get along back to your circus,
or I’ll report you.”

Now what does this man do but fall back a couple of hundred yards
and then come rushing at me as hard as he could tear, with his
nail-keg bent down nearly to his horse’s neck and his long spear
pointed straight ahead. I saw he meant business, so I was up
the tree when he arrived.






Classifying unknown texts from the Test Dataset

Each of the Markov models learnt from the training texts for each author are used to compute the log-likelihood of the unknown test text, the author with the maximum log-likelihood is chosen to be the likely author of the text.
Unknown Text1 

Against the interest of her own individual comfort, Mrs. Dashwood had
determined that it would be better for Marianne to be any where, at
that time, than at Barton, where every thing within her view would be
bringing back the past in the strongest and most afflicting manner, by
constantly placing Willoughby before her, such as she had always seen
him there. She recommended it to her daughters, therefore, by all
means not to shorten their visit to Mrs. Jennings; the length of which,
though never exactly fixed, had been expected by all to comprise at
least five or six weeks. A variety of occupations, of objects, and of
company, which could not be procured at Barton, would be inevitable
there, and might yet, she hoped, cheat Marianne, at times, into some
interest beyond herself, and even into some amusement, much as the
ideas of both might now be spurned by her.

From all danger of seeing Willoughby again, her mother considered her
to be at least equally safe in town as in the country, since his
acquaintance must now be dropped by all who called themselves her
friends. Design could never bring them in each other’s way: negligence
could never leave them exposed to a surprise; and chance had less in
its favour in the crowd of London than even in the retirement of
Barton, where it might force him before her while paying that visit at
Allenham on his marriage, which Mrs. Dashwood, from foreseeing at first
as a probable event, had brought herself to expect as a certain one.

She had yet another reason for wishing her children to remain where
they were; a letter from her son-in-law had told her that he and his
wife were to be in town before the middle of February, and she judged
it right that they should sometimes see their brother.

Marianne had promised to be guided by her mother’s opinion, and she
submitted to it therefore without opposition, though it proved
perfectly different from what she wished and expected, though she felt
it to be entirely wrong, formed on mistaken grounds, and that by
requiring her longer continuance in London it deprived her of the only
possible alleviation of her wretchedness, the personal sympathy of her
mother, and doomed her to such society and such scenes as must prevent
her ever knowing a moment’s rest.

But it was a matter of great consolation to her, that what brought evil
to herself would bring good to her sister; and Elinor, on the other
hand, suspecting that it would not be in her power to avoid Edward
entirely, comforted herself by thinking, that though their longer stay
would therefore militate against her own happiness, it would be better
for Marianne than an immediate return into Devonshire.

Her carefulness in guarding her sister from ever hearing Willoughby’s
name mentioned, was not thrown away. Marianne, though without knowing
it herself, reaped all its advantage; for neither Mrs. Jennings, nor
Sir John, nor even Mrs. Palmer herself, ever spoke of him before her.
Elinor wished that the same forbearance could have extended towards
herself, but that was impossible, and she was obliged to listen day
after day to the indignation of them all.

log-likelihood values computed for the probable authors

Author  LL
0           -3126.5812874
1           -4127.9155186
2           -7364.15782346
3           -9381.06336055
4           -7493.78440066
5           -4837.98005673
6           -3515.44028659
7           -3455.85716104

As can be seen from above the maximum likelihood value corresponds to the author 0, i.e., Austen.  Hence, the most probable author of the unknown text is Austen.



Unknown Text2 

Then he tossed the marble away pettishly, and stood cogitating. The
truth was, that a superstition of his had failed, here, which he and
all his comrades had always looked upon as infallible. If you buried a
marble with certain necessary incantations, and left it alone a
fortnight, and then opened the place with the incantation he had just
used, you would find that all the marbles you had ever lost had
gathered themselves together there, meantime, no matter how widely they
had been separated. But now, this thing had actually and unquestionably
failed. Tom’s whole structure of faith was shaken to its foundations.
He had many a time heard of this thing succeeding but never of its
failing before. It did not occur to him that he had tried it several
times before, himself, but could never find the hiding-places
afterward. He puzzled over the matter some time, and finally decided
that some witch had interfered and broken the charm. He thought he
would satisfy himself on that point; so he searched around till he
found a small sandy spot with a little funnel-shaped depression in it.
He laid himself down and put his mouth close to this depression and

“Doodle-bug, doodle-bug, tell me what I want to know! Doodle-bug,
doodle-bug, tell me what I want to know!”

The sand began to work, and presently a small black bug appeared for a
second and then darted under again in a fright.

“He dasn’t tell! So it WAS a witch that done it. I just knowed it.”

He well knew the futility of trying to contend against witches, so he
gave up discouraged. But it occurred to him that he might as well have
the marble he had just thrown away, and therefore he went and made a
patient search for it. But he could not find it. Now he went back to
his treasure-house and carefully placed himself just as he had been
standing when he tossed the marble away; then he took another marble
from his pocket and tossed it in the same way, saying:

“Brother, go find your brother!”

He watched where it stopped, and went there and looked. But it must
have fallen short or gone too far; so he tried twice more. The last
repetition was successful. The two marbles lay within a foot of each

log-likelihood values computed for the probable authors

Author  LL
0             -2779.02810424
1             -2738.09304225
2             -5978.83684489
3             -6551.16571407
4             -5780.39620942
5             -4166.34886511
6             -2309.25043697
7             -2033.00112729

As can be seen from above the maximum likelihood value corresponds to the author 7, i.e., Twain.  Hence, the most probable author of the unknown text is Twain. The following figure shows the relevant states corresponding to the Markov model for Twain trained from the training dataset.


Classifying a Face as Happy/Unhappy and Face Recognition using a Pre-trained Deep Inception Network with Keras in Python

In this article couple of problems are going to be discussed. Both the problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deeplearning specialization) by the Stanford Prof. Andrew Ng. ( The problem descriptions are taken from the course itself.

1. Classifying a Face Image as Happy/Unhappy

  • Given:
    • 600 RGB (labeled) training images each of size 64×64, with labels 0 (not happy) and 1 (happy).
    • 150 (unlabeled) test images (also the ground-truths separately).
  • Train a deep convolution neural net model for binary classification.
  • Use the model to predict the labels of the test images and evaluate the model using the ground truth.


Details of the “Happy” dataset:

Images are of shape (64,64,3)
Training: 600 pictures
Test: 150 pictures

It is now time to solve the “Happy” Challenge.

We need to start by loading the following required packages.

import numpy as np
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
import keras.backend as K
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

Then let’s normalize and load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

​# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Reshape
Y_train = Y_train_orig.T
Y_test = Y_test_orig.T

print (“number of training examples = ” + str(X_train.shape[0]))
print (“number of test examples = ” + str(X_test.shape[0]))
print (“X_train shape: ” + str(X_train.shape))
print (“Y_train shape: ” + str(Y_train.shape))
print (“X_test shape: ” + str(X_test.shape))
print (“Y_test shape: ” + str(Y_test.shape))

number of training examples = 600
number of test examples = 150
X_train shape: (600, 64, 64, 3)
Y_train shape: (600, 1)
X_test shape: (150, 64, 64, 3)
Y_test shape: (150, 1)

Now let’s find the number of labeled happy and unhappy faces in the training dataset.

print(X_train[Y_train.ravel()==1].shape, X_train[Y_train.ravel()==0].shape)

(300, 64, 64, 3) (300, 64, 64, 3)

As can be seen, there are equal numbers of positive and negative examples in the training dataset. The following figures show a few samples drawn from each class.



Building a model in Keras

Keras is very good for rapid prototyping. In just a short time we shall be able to build a model that achieves outstanding results.

Let’s Implement a HappyModel() with the following architecture:

def HappyModel(input_shape):
Implementation of the HappyModel.

input_shape — shape of the images of the dataset

model — a Model() instance in Keras

# Define the input placeholder as a tensor with shape input_shape. Think of
# this as our input image!
X_input = Input(input_shape)

# Zero-Padding: pads the border of X_input with zeroes
X = ZeroPadding2D((3, 3))(X_input)

# CONV -> BN -> RELU Block applied to X
X = Conv2D(32, (7, 7), strides = (1, 1), name = ‘conv0’)(X)
X = BatchNormalization(axis = 3, name = ‘bn0’)(X)
X = Activation(‘relu’)(X)

X = MaxPooling2D((2, 2), name=’max_pool’)(X)

# FLATTEN X (means convert it to a vector) + FULLYCONNECTED
X = Flatten()(X)
X = Dense(1, activation=’sigmoid’, name=’fc’)(X)

# Create model. This creates our Keras model instance, you’ll use this instance
# to train/test the model.
model = Model(inputs = X_input, outputs = X, name=’HappyModel’)

return model


Step 1: Let’s first create the model.

happyModel = HappyModel((64,64,3))

Step 2:  Compile the model to configure the learning process, keeping in view that the Happy Challenge is a binary classification problem.

happyModel.compile(optimizer = “Adam”, loss = “binary_crossentropy”, metrics = [“accuracy”])

Step 3: Train the model. Choose the number of epochs and the batch size. = X_train, y = Y_train, epochs = 20, batch_size = 32)

Epoch 1/20
600/600 [==============================] – 6s – loss: 1.0961 – acc: 0.6750
Epoch 2/20
600/600 [==============================] – 7s – loss: 0.4198 – acc: 0.8250
Epoch 3/20
600/600 [==============================] – 8s – loss: 0.1933 – acc: 0.9250
Epoch 4/20
600/600 [==============================] – 7s – loss: 0.1165 – acc: 0.9567
Epoch 5/20
600/600 [==============================] – 6s – loss: 0.1224 – acc: 0.9500
Epoch 6/20
600/600 [==============================] – 6s – loss: 0.0970 – acc: 0.9667
Epoch 7/20
600/600 [==============================] – 7s – loss: 0.0639 – acc: 0.9850
Epoch 8/20
600/600 [==============================] – 7s – loss: 0.0841 – acc: 0.9700
Epoch 9/20
600/600 [==============================] – 8s – loss: 0.0934 – acc: 0.9733
Epoch 10/20
600/600 [==============================] – 7s – loss: 0.0677 – acc: 0.9767
Epoch 11/20
600/600 [==============================] – 6s – loss: 0.0705 – acc: 0.9650
Epoch 12/20
600/600 [==============================] – 7s – loss: 0.0548 – acc: 0.9783
Epoch 13/20
600/600 [==============================] – 7s – loss: 0.0533 – acc: 0.9800
Epoch 14/20
600/600 [==============================] – 7s – loss: 0.0517 – acc: 0.9850
Epoch 15/20
600/600 [==============================] – 7s – loss: 0.0665 – acc: 0.9750
Epoch 16/20
600/600 [==============================] – 7s – loss: 0.0273 – acc: 0.9917
Epoch 17/20
600/600 [==============================] – 7s – loss: 0.0291 – acc: 0.9933
Epoch 18/20
600/600 [==============================] – 6s – loss: 0.0245 – acc: 0.9917
Epoch 19/20
600/600 [==============================] – 7s – loss: 0.0376 – acc: 0.9883
Epoch 20/20
600/600 [==============================] – 7s – loss: 0.0440 – acc: 0.9917

Note that if we run fit() again, the model will continue to train with the parameters it has already learnt instead of re-initializing them.

Step 4: Test/evaluate the model.

preds = happyModel.evaluate(x = X_test, y = Y_test)
print (“Loss = ” + str(preds[0]))
print (“Test Accuracy = ” + str(preds[1]))

150/150 [==============================] – 0s

Loss = 0.167731122573
Test Accuracy = 0.94666667064

As can be seen, our model gets around 95% test accuracy in 20 epochs (and 99% train accuracy).

Test with my own image

Let’s test on my own image to see how well the model generalizes on unseen face images.

img_path = ‘me_happy.png’
img = image.load_img(img_path, target_size=(64, 64))


x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

[[ 1.]]             # Happy !

Model Summary




2.  Face Recognition with Deep Neural Net

Face recognition problems commonly fall into two categories:

  1. Face Verification – “is this the claimed person?”. For example, at some airports, one can pass through customs by letting a system scan your passport and then verifying that he (the person carrying the passport) is the correct person. A mobile phone that unlocks using our face is also using face verification. This is a 1:1 matching problem.
  2. Face Recognition – “who is this person?”. For example, this video of Baidu employees entering the office without needing to otherwise identify themselves is an example of face recognition. This is a 1:K matching problem.

FaceNet learns a neural network that encodes a face image into a vector of 128 numbers. By comparing two such vectors, we can then determine if two pictures are of the same person.

In this assignment, we shall:

  • Implement the triplet loss function
  • Use a pretrained model to map face images into 128-dimensional encodings
  • Use these encodings to perform face verification and face recognition

In this exercise, we will be using a pre-trained model which represents ConvNet activations using a “channels first” convention, as opposed to the “channels last” convention.

In other words, a batch of images will be of shape (m,n_C,n_H,n_W) instead of (m,n_H,n_W,n_C). Both of these conventions have a reasonable amount of traction among open-source implementations; there isn’t a uniform standard yet within the deep learning community.

Naive Face Verification

In Face Verification, we’re given two images and we have to tell if they are of the same person. The simplest way to do this is to compare the two images pixel-by-pixel. If the distance between the raw images are less than a chosen threshold, it may be the same person!


Of course, this algorithm performs really poorly, since the pixel values change dramatically due to variations in lighting, orientation of the person’s face, even minor changes in head position, and so on.

We’ll see that rather than using the raw image, we can learn an encoding f(img) so that element-wise comparisons of this encoding gives more accurate judgments as to whether two pictures are of the same person.

Encoding face images into a 128-dimensional vector

Using an ConvNet to compute encodings

The FaceNet model takes a lot of data and a long time to train. So following common practice in applied deep learning settings, let’s just load weights that someone else has already trained. The network architecture follows the Inception model from Szegedy et al.. We are going to use an inception network implementation.

This network uses 96×96 dimensional RGB images as its input. Specifically, inputs a face image (or batch of m face images) as a tensor of shape (m,nC,nH,nW)=(m,3,96,96).
It outputs a matrix of shape (m,128) that encodes each input face image into a 128-dimensional vector.

Let’s create the model for face images.

FRmodel = faceRecoModel(input_shape=(3, 96, 96))
print(“Total Params:”, FRmodel.count_params())

Total Params: 3743280

By using a 128-neuron fully connected layer as its last layer, the model ensures that the output is an encoding vector of size 128. We then use the encodings the compare two face images as follows:


By computing a distance between two encodings and thresholding, we can determine if the two pictures represent the same person.

So, an encoding is a good one if:

  • The encodings of two images of the same person are quite similar to each other
  • The encodings of two images of different persons are very different

The triplet loss function formalizes this, and tries to “push” the encodings of two images of the same person (Anchor and Positive) closer together, while “pulling” the encodings of two images of different persons (Anchor, Negative) further apart.


In the next part, we will call the pictures from left to right: Anchor (A), Positive (P), Negative (N).



Loading the trained model

FaceNet is trained by minimizing the triplet loss. But since training requires a lot of data and a lot of computation, we won’t train it from scratch here. Instead, we load a previously trained model. Let’s Load a model using the following code; this might take a couple of minutes to run.

FRmodel.compile(optimizer = ‘adam’, loss = triplet_loss, metrics = [‘accuracy’])

Here is the summary of the very deep inception network:


The next figure shows how the pre-trained deep inception network looks like:


Here’re some examples of distances between the encodings between three individuals:

Let’s now use this model to perform face verification and face recognition!

Applying the model

Face Verification

Let’s build a database containing one encoding vector for each person. To generate the encoding we use img_to_encoding(image_path, model) which basically runs the forward propagation of the model on the specified image.

Let’s build a database to map each person’s name to a 128-dimensional encoding of their face.

Now this can be used in an automated employee verification at the gate in an office in the following way: when someone shows up at the front door and swipes their ID card (thus giving us their name), we can look up their encoding in the database, and use it to check if the person standing at the front door matches the name on the ID.

Let’s implement the verify() function which checks if the front-door camera picture (image_path) is actually the person called “identity“. We shall have to go through the following steps:

  • Compute the encoding of the image from image_path
  • Compute the distance in between this encoding and the encoding of the identity image stored in the database
  • Open the door if the distance is less than the threshold  0.7, else do not open.

As presented above, we are going to use the L2 distance (np.linalg.norm).

def verify(image_path, identity, database, model):

Function that verifies if the person on the “image_path” image is “identity”.


image_path — path to an image

identity — string, name of the person you’d like to verify the identity. Has to be a resident of the Happy house.

database — python dictionary mapping names of allowed people’s names (strings) to their encodings (vectors).

model — your Inception model instance in Keras


dist — distance between the image_path and the image of “identity” in the database.
door_open — True, if the door should open. False otherwise.


### CODE HERE ###

return dist, door_open

Younes is trying to enter the  and the camera takes a picture of him (“camera_0.jpg”). Let’s run the above verification algorithm on this picture and compare with the one stored in the system (image_path):


verify(“camera_0.jpg”, “younes”, database, FRmodel)

# output
It’s younes, welcome home!
(0.67291224, True)

Benoit, has been banned from the office and removed from the database. He stole Kian’s ID card and came back to the house to try to present himself as Kian. The front-door camera took a picture of Benoit (“camera_2.jpg). Let’s run the verification algorithm to check if benoit can enter.


verify(“camera_2.jpg”, “kian”, database, FRmodel)

# output
It’s not kian, please go away
(0.86543155, False)

Face Recognition

In this case, we need to implement a face recognition system that takes as input an image, and figures out if it is one of the authorized persons (and if so, who). Unlike the previous face verification system, we will no longer get a person’s name as another input.

Implement who_is_it(). We shall have to go through the following steps:

  • Compute the target encoding of the image from image_path
  • Find the encoding from the database that has smallest distance with the target encoding.
  • Initialize the min_dist variable to a large enough number (100). It will help to keep track of what is the closest encoding to the input’s encoding.
  • Loop over the database dictionary’s names and encodings. To loop use for (name, db_enc) in database.items().
  • Compute L2 distance between the target “encoding” and the current “encoding” from the database.
  • If this distance is less than the min_dist, then set min_dist to dist, and identity to name.


def who_is_it(image_path, database, model):
Implements face recognition for the happy house by finding who is the person on the image_path image.

image_path — path to an image
database — database containing image encodings along with the name of the person on the image
model — your Inception model instance in Keras

min_dist — the minimum distance between image_path encoding and the encodings from the database
identity — string, the name prediction for the person on image_path

###  CODE HERE ###

return min_dist, identity

Younes is at the front-door and the camera takes a picture of him (“camera_0.jpg”). Let’s see if our who_it_is() algorithm identifies Younes.

who_is_it(“camera_0.jpg”, database, FRmodel)

# output
it’s younes, the distance is 0.672912
(0.67291224, ‘younes’)

We can change “camera_0.jpg” (picture of younes) to “camera_1.jpg” (picture of bertrand) and see the result.

who_is_it(“camera_1.jpg”, database, FRmodel)

# output
it’s bertrand, the distance is 0.474829
(0.47482917, ‘bertrand’)

Here is the takeaway:

  • Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
  • The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
  • The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.



  • Florian Schroff, Dmitry Kalenichenko, James Philbin (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering
  • Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Lior Wolf (2014). DeepFace: Closing the gap to human-level performance in face verification
  • The pretrained model we use is inspired by Victor Sy Wang’s implementation and was loaded using his code:
  • Implementation by Ng. et al. also took a lot of inspiration from the official FaceNet github repository:


EigenFaces and A Simple Face Detector with PCA/SVD in Python

In this article, a few problems will be discussed that are related to face reconstruction and rudimentary face detection using eigenfaces (we are not going to discuss about more sophisticated face detection algorithms such as Voila-Jones or DeepFace).

1. Eigenfaces

This problem appeared as an assignment in the edX course Analytics for Computing (by Georgia Tech)The following description of the problem is taken straightaway from the assignment.

This is an application of principal components analysis (PCA) and k-means to the analysis of “familiar faces.”  We’ll also create a simple interactive visualization for exploring this dataset (using bokeh).


Solving the PCA problem

The following figure shows the basic algorithm to compute a PCA, the interactive visual demo of which appears here.


The dataset: Some familiar faces

The dataset consists of a bunch of images of people’s faces taken from MIT Faces Recognition Project database.  There are 3991 images in total. A randomly selected  subset of 500 images are loaded, all the following experiments will be done on these images. They are shown below.


Preprocessing the images

To apply PCA, we’ll want to preprocess the images in various ways.

To begin with, it’s possible that that the images come in all shapes and sizes. The following code will figure out what is the largest height and width that are within the bounds of all the images.

import sys
import numpy as np
min_rows, min_cols = sys.maxsize, sys.maxsize
max_rows, max_cols = 0, 0
for (i, image) in enumerate(original_images):
    r, c = image.shape[0], image.shape[1]    
    min_rows = min(min_rows, r)
    max_rows = max(max_rows, r)
    min_cols = min(min_cols, c)
    max_cols = max(max_cols, c)
print("\n==> Least common image size:", min_rows, "x", min_cols, "pixels")
 ==> Least common image size: 128 x 128 pixels
  • Suppose the least common image size is r0×c0 pixels is the smallest dimension. Let’s crop each r×c image so that it is r0×c0 in size. If r>r0, then crop out any extra rows on the bottom of the image; and if c>c0, then center the columns of the image. Let’s store the output images in a 3-DNumpy array called images[:, :, :], where images[k, :, :] is the k-th image, the following code exactly does that.
defrecenter(image, min_rows, min_cols):
    r, c = image.shape
    top, bot, left, right = 0, r, 0, c
    if r > min_rows:
        top = r - min_rows  
    if c > min_cols:
        right = min_cols     
    return image[top:bot, left:right]

image0_recentered = recenter(image0, min_rows, min_cols)
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
imshow_gray(image0, ax=axs[0])
imshow_gray(image0_recentered, ax=axs[1])

Let’s compute an “average” image, taken as the elementwise (pixelwise) mean over the images and display this “average face”.


Recall that PCA requires centered points. Let’s do that by subtracting the mean image from every image. The next figure shows couple of images and the ones obtained after mean subtraction.


From image set to a data matrix and back again

For PCA, we need a data matrix. Here is some code to convert our 3-D array of images into a 2-D data matrix, where we “flatten” each image into a 1-D vector by a simple reshape() operation.

# Create m x d data matrix
m = len(images)
d = min_rows * min_cols
X = np.reshape(images, (m, d))
# To get back to an image, just reshape it again
imshow_gray(np.reshape(X[int(len(X)/2), :], (min_rows, min_cols)))


Applying PCA

Let’s Compute the SVD of X. Store the result in three arrays, USigma, and VT, where U holds USigma holds just the diagonal entries of Σ, and VT holds V’.

U, Sigma, VT = np.linalg.svd(X, full_matrices=False)
# Sanity check on dimensions
print("X:", X.shape)
print("U:", U.shape)
print("Sigma:", Sigma.shape)
print("V^T:", VT.shape)
X: (500, 16384) 
U: (500, 500)
Sigma: (500,)
V^T: (500, 16384)

The following table and the plot inspect the singular values, i.e., the entries of Σ stored in Sigma. The plot will show the singular values as dots, plotted at each position x=for the i-th singular values. To give a rough idea of how quickly the singular values decay, the plot includes a solid line showing the curve, σ0/√(i+1).


Does the spectrum of these data decay quickly or slowly? How should that affect our choice of k, if we consider a k-truncated SVD?

The question is ill-defined and the answer is relative. In this case, a reasonable argument is that the spectrum actually decays somewhat slowly. Why? If we try fitting the singular values {σ_i} to functions of i, we’ll find that 1/√(i+1) is actually a pretty good fit. That is considered fairly slow decay; there would be significant compressibility if the curve dropped off exponentially (or at least superlinearly) instead.

Next, let’s plot the first few principal components. From what we computed above, each right singular vector has the same number of entries as there are pixels in an image. So, we could plot them as images by reshaping them. What do they appear to capture?


For example, the first (also third) principal component captures a male face, whereas the second (also fourth) one seems to capture a female face, the fifth one captures a face with long hairs.

Now let’s compute a new matrix Y, which is the original data matrix projected onto the first num_components principal components.
num_components = 5 # Number of principal components
Y = np.matmul(X, VT[:num_components,:].T)

The next plot visualizes the face datapoints projected on the first two principal components with bokeh with the thumbnails as the face images.


Next, let’s run  Run k-means on the projected data, Y[:m, :num_components], to try to identify up to num_clusters clusters.

The next animation shows the interactive plot of the face data points projected on the first two principal components with bokeh with the thumbnails as the face images.

For the 500 randomly selected faces, the green cluster seems to contain mostly the whiter / brighter faces whereas the red cluster mostly contains the darker faces, whereas the blue cluster seems to contain younger populations mostly.


2. Face Reconstruction and A  Simple Face Detector

This problem appeared as projects in this CMU machine learning for signal processing course and also in this UW Computer Vision course. The description of the problem is taken straightaway from the course projects.

We are given a corpus of facial images [here] from the LFWCrop database. Each image in this corpus is 64 x 64 and grayscale. We need to learn a typical (i.e. Eigen) face from them. There are 1071 (.pgm) images in this dataset. A few of them are shown below.


Computing the EigenFaces

As described in the original paper from MIT and this case study from JHU, the following figure shows the theory / algorithm for how to efficiently compute the eigenfaces from the (training) images and then choose top k eigenfaces (eigenvectors).


The next 4 figures show the meanface, the percentage variance captured by the eigenfaces and the top 100 and then top 400 eigenfaces computed computed from the images, respectively.


Projection on the EigenFaces and Reconstruction

Now, let’s project a face onto the face space to generate a vector of k coefficients, one for each of the k eigenfaces (for different values of k). Then let’s reconstruct the same face from the vector of coefficients computed.

The following figures and animations show how a given image can be reconstructed by projecting on the first few dominant eigenfaces and also how the reconstruction error  (residual) decreases as more and more eigenfaces are used for projection. Also, the reconstruction error goes down quite fast, with the selection of first few eigenfaces.

  Original Face Image
  Reconstructed Face Image by projecting on the truncated eigenfaces space

The Reconstruction Error when first k eigenfaces were used


The same is shown for yet another image.


The next figure shows the images projected on and reconstructed from the first two eigenfaces. The similar images stay in close proximity whereas the dissimilar ones (w.r.t. projection on the first 2 dominant eigenvectors) are far apart in the 2D space.

2deig.pngProjection of a Human vs. a Non-Human-Face (e.g. Cat) on the EigenFaces Space

Now let’s reconstruct a non-human object (a cat) to the same space spanned by the top-k eigenfaces for different values of k.  As we can see from the below figures, the reconstruction error is much higher for the cat, as expected.




Face Detection

The following figure taken from the same JHU vision lab case study shows the threshold-based face detection algorithm with eigenfaces.


Now, we are also given four group photographs with multiple faces [here]. We need to use the Eigen face to detect the faces in these photos. One such group photo is shown below: the group photo of the rockstars from The Beatles.


The faces in the group photographs may have different sizes. We also need to account for these variations.

To detect faces in the image, we need to scan the group photo and identify all regions in it that “match” the patterns in Eigen facemost. To “Scan” the image to find matches against an N×MN×M Eigen face, you must match every N×MN×M region of the photo against the Eigen face.

The “match” between any N×M region of an image and an Eigen face is given by the normalized dot product between the Eigen face and the region of the image being evaluated. The normalized dot product between an N×M Eigen face and a corresponding N×M segment of the image is given by EP/|P|, where E is the vector (unrolled) representation of the Eigen face, and P is the unrolled vector form of the N×M patch.

Scaling and Rotation

The Eigen face is fixed in size and can only be used to detect faces of approximately the same size as the Eigen face itself. On the other hand faces in the group photos are of different sizes — they get smaller as the subject gets farther away from the camera.

The solution to this is to make many copies of the eigen face and match them all.

In order to make your detection system robust, resize the Eigen faces from 64 pixels to 32×32, 48×48, 96×96, and 128×128 pixels in size.  Once we’ve scaled your eigen face, we will have a total of five “typical” faces, one at each level of scaling. We need to scan the group pictures with all of the five eigen faces. Each of them will give us a “match” score for each position on the image. If we simply locate the peaks in each of them, we may find all the faces. Sometimes multiple peaks will occur at the same position, or within a few pixels of one another. In these cases, we need to merge all of these (non-maximum suppression), they probably all represent the same face.

Additional heuristics may also be required (appropriate setting of thresholds, comparison of peak values from different scaling factors, addiitonal scaling, filtering by human body color thresholds etc.). These are for us to investigate.

The  faces detected using the window scanning and the threshold on the distance from the eigenfaces plane are shown below for the beatles group image.


Face Morphing

Implement morphing. Given two images, compute an animation that shows one being transformed continuously into the other, using the eigenfaces representation. Also, try transforming an image “away” from another one, to exaggerate differences between two faces. Make a video from your morphed faces.

Let’s try to morph the first face to the second one.


Here is the algorithm that we are going to use:

  1. For the first face first create a reconstruction using only a few (k) dominant eigenfaces.
  2. Iteratively reconstruct the first face using lesser and lesser eigenfaces and animate.
  3. Stop when we reached the reconstruction of the first face with only k eigenfaces.
  4. Now compute the coefficients of the second face using only those k eigenfaces,
  5. Next start using second face’s coefficients instead of the first face’s coefficients.
  6. Now iteratively increase the number of eigenfaces to be used to reconstruct the second face, till the point when all the eigenfaces are used.

The following animation shows such morphing:


Deep Learning & Art: Neural Style Transfer – An Implementation with Tensorflow (using Transfer Learning with a Pre-trained VGG-19 Network) in Python


This problem appeared as an assignment in the online coursera course Convolution Neural Networks by Prof Andrew Ng, (  The description of the problem is taken straightway from the assignment.

Neural Style Transfer algorithm was created by Gatys et al. (2015) , the paper can be found here .

In this assignment, we shall:

  • Implement the neural style transfer algorithm
  • Generate novel artistic images using our algorithm

Most of the algorithms we’ve studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, we  shall optimize a cost function to get pixel values!

Problem Statement

Neural Style Transfer (NST) is one of the most fun techniques in deep learning. As seen below, it merges two images, namely,

  1. a “content” image (C) and
  2. a “style” image (S),

to create a “generated” image (G). The generated image G combines the “content” of the image C with the “style” of image S.

In this example, we are going to generate an image of the Louvre museum in Paris (content image C), mixed with a painting by Claude Monet, a leader of the impressionist movement (style image S).


Let’s see how we can do this.

Transfer Learning

Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that. The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

Following the original NST paper, we shall use the VGG network. Specifically, we’ll use VGG-19, a 19-layer version of the VGG network. This model has already been trained on the very large ImageNet database, and thus has learned to recognize a variety of low level features (at the earlier layers) and high level features (at the deeper layers). The following figure (taken from the google image search results) shows how a VGG-19 convolution neural net looks like, without the last fully-connected (FC) layers.


We run the following code to load parameters from the pre-trained VGG-19 model serialized in a matlab file. This takes a few seconds.

model = load_vgg_model(“imagenet-vgg-verydeep-19.mat”)
import pprint

{‘avgpool1’: <tf.Tensor ‘AvgPool_5:0’ shape=(1, 150, 200, 64) dtype=float32>,
‘avgpool2’: <tf.Tensor ‘AvgPool_6:0’ shape=(1, 75, 100, 128) dtype=float32>,
‘avgpool3’: <tf.Tensor ‘AvgPool_7:0’ shape=(1, 38, 50, 256) dtype=float32>,
‘avgpool4’: <tf.Tensor ‘AvgPool_8:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘avgpool5’: <tf.Tensor ‘AvgPool_9:0’ shape=(1, 10, 13, 512) dtype=float32>,
‘conv1_1’: <tf.Tensor ‘Relu_16:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv1_2’: <tf.Tensor ‘Relu_17:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv2_1’: <tf.Tensor ‘Relu_18:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv2_2’: <tf.Tensor ‘Relu_19:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv3_1’: <tf.Tensor ‘Relu_20:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_2’: <tf.Tensor ‘Relu_21:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_3’: <tf.Tensor ‘Relu_22:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_4’: <tf.Tensor ‘Relu_23:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv4_1’: <tf.Tensor ‘Relu_24:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_2’: <tf.Tensor ‘Relu_25:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_3’: <tf.Tensor ‘Relu_26:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_4’: <tf.Tensor ‘Relu_27:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv5_1’: <tf.Tensor ‘Relu_28:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_2’: <tf.Tensor ‘Relu_29:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_3’: <tf.Tensor ‘Relu_30:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_4’: <tf.Tensor ‘Relu_31:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘input’: <tensorflow.python.ops.variables.Variable object at 0x7f7a5bf8f7f0>}

The next figure shows the content image (C) – the Louvre museum’s pyramid surrounded by old Paris buildings, against a sunny sky with a few clouds.


For the above content image, the activation outputs from the convolution layers are visualized in the next few figures.









How to ensure that the generated image G matches the content of the image C?

As we know, the earlier (shallower) layers of a ConvNet tend to detect lower-level features such as edges and simple textures, and the later (deeper) layers tend to detect higher-level features such as more complex textures as well as object classes.

We would like the “generated” image G to have similar content as the input image C. Suppose we have chosen some layer’s activations to represent the content of an image. In practice, we shall get the most visually pleasing results if we choose a layer in the middle of the network – neither too shallow nor too deep.


First we need to compute the “content cost” using TensorFlow.

  • The content cost takes a hidden layer activation of the neural network, and measures how different a(C) and a(G) are.
  • When we minimize the content cost later, this will help make sure G
    has similar content as C.

def compute_content_cost(a_C, a_G):
Computes the content cost

a_C — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G

J_content — scalar that we need to compute using equation 1 above.

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape a_C and a_G
a_C_unrolled = tf.reshape(tf.transpose(a_C), (m, n_H * n_W, n_C))
a_G_unrolled = tf.reshape(tf.transpose(a_G), (m, n_H * n_W, n_C))

# compute the cost with tensorflow
J_content = tf.reduce_sum((a_C_unrolled – a_G_unrolled)**2 / (4.* n_H * n_W *  \

return J_content


Computing the style cost

For our running example, we will use the following style image (S). This painting was painted in the style of impressionism, by  Claude Monet .



def gram_matrix(A):
A — matrix of shape (n_C, n_H*n_W)

GA — Gram matrix of A, of shape (n_C, n_C)

GA = tf.matmul(A, tf.transpose(A))
return GA


def compute_layer_style_cost(a_S, a_G):
a_S — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G

J_style_layer — tensor representing a scalar value, style cost defined above by equation (2)

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape the images to have them of shape (n_C, n_H*n_W)
a_S = tf.reshape(tf.transpose(a_S), (n_C, n_H * n_W))
a_G = tf.reshape(tf.transpose(a_G), (n_C, n_H * n_W))

# Computing gram_matrices for both images S and G (≈2 lines)
GS = gram_matrix(a_S)
GG = gram_matrix(a_G)

# Computing the loss
J_style_layer = tf.reduce_sum((GS – GG)**2 / (4.* (n_H * n_W * n_C)**2))

return J_style_layer


  • The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
  • Minimizing the style cost will cause the image G to follow the style of the image S.


Defining the total cost to optimize

Finally, let’s create and implement a cost function that minimizes both the style and the content cost. The formula is:



def total_cost(J_content, J_style, alpha = 10, beta = 40):
Computes the total cost function

J_content — content cost coded above
J_style — style cost coded above
alpha — hyperparameter weighting the importance of the content cost
beta — hyperparameter weighting the importance of the style cost

J — total cost as defined by the formula above.

J = alpha * J_content + beta * J_style
return J

  • The total cost is a linear combination of the content cost J_content(C,G) and the style cost J_style(S,G).
  • α and β are hyperparameters that control the relative weighting between content and style.


Solving the optimization problem

Finally, let’s put everything together to implement Neural Style Transfer!

Here’s what the program will have to do:

  • Create an Interactive Session
  • Load the content image
  • Load the style image
  • Randomly initialize the image to be generated
  • Load the VGG19 model
  • Build the TensorFlow graph:
    • Run the content image through the VGG19 model and compute the content cost.
    • Run the style image through the VGG19 model and compute the style cost
      Compute the total cost.
    • Define the optimizer and the learning rate.
  • Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.

Let’s first load, reshape, and normalize our “content” image (the Louvre museum picture) and “style” image (Claude Monet’s painting).

Now, we initialize the “generated” image as a noisy image created from the content_image. By initializing the pixels of the generated image to be mostly noise but still slightly correlated with the content image, this will help the content of the “generated” image more rapidly match the content of the “content” image. The following figure shows the noisy image:


Next, let’s load the pre-trained VGG-19 model.

To get the program to compute the content cost, we will now assign a_C and a_G to be the appropriate hidden layer activations. We will use layer conv4_2 to compute the content cost. We need to do the following:

  • Assign the content image to be the input to the VGG model.
  • Set a_C to be the tensor giving the hidden layer activation for layer “conv4_2”.
  • Set a_G to be the tensor giving the hidden layer activation for the same layer.
  • Compute the content cost using a_C and a_G.

Next, we need to compute the style cost and compute the total cost J by taking a linear combination of the two. Use alpha = 10 and beta = 40.

Then we are going to  set up the Adam optimizer in TensorFlow, using a learning rate of 2.0.

Finally, we need to initialize the variables of the tensorflow graph, assign the input image (initial generated image) as the input of the VGG19 model and runs the model to minimize the total cost J for a large number of iterations.


The following figures / animations show the generated images (G) with different content (C) and style images (S) at different iterations in the optimization process.



Style (Claud Monet’s The Poppy Field near Argenteuil)


















Style (Van Gogh’s The Starry Night)










Content (Victoria Memorial Hall)


Style (Van Gogh’s The Starry Night)




Content (Taj Mahal)


Style (Van Gogh’s Starry Night Over the Rhone)






Style (Claud Monet’s Sunset in Venice)




Content (Visva Bharati)biswa.jpg

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )




Content (Howrah Bridge)


Style (Van Gogh’s The Starry Night)





Content (Leonardo Da Vinci’s Mona Lisa)


Style (Van Gogh’s The Starry Night)


Content (My sketch: Rabindranath Tagore)


Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )


Content (me)


Style (Van Gogh’s Irises)













Style (Publo Picaso’s Factory at Horto de Ebro)




The following animations show how the generated image changes with the change in VGG-19 convolution layer used for computing content cost.



Style (Van Gogh’s The Starry Night)



convolution layer 3_2 used


convolution layer 4_2 used


convolution layer 5_2 used



Some Optimization: Implementing the Orthogonal Matching Pursuit (OMP) and the Basis Pursuit (BP) Algorithms with Octave / Matlab


The following problems appeared in a project in the edX course 236862.1x: Introduction to the Fundamentals of Sparse Representations by Prof. Michael Elad from The Technion – Israel Institute of Technology.  The description of the problems are taken straightaway from the project.


Pursuit Algorithms

In this article we demonstrate the Orthogonal Matching Pursuit (OMP) and Basis Pursuit (BP) algorithms by running them on a set of test signals and checking whether they provide the desired outcome for the P0 problem.

Some Theory 

Our goal is to solve the following problem:


which is also known as the P0 problem (since the objective function to be minimized is the L0 norm of a vector x):  a problem that seeks the sparsest solution to a linear system.  The concept of seeking the sparsest representation for data are of central importance to data processing in a universal way (e.g., in machine learning, image processing, computer vision, remote sensing), although the problem is NP-hard in general.

In order to solve it in practice, some approximation algorithm is needed to be used, either using greedy or relaxation based approaches, as shown in the next figure.


In this article the implementation  of the above two algorithms and some experimental results on approximately solving the P0 problem with some synthetic signal dataset will be described. The following figure describes how the experiments are to be done:



Data Construction

In this part we create a dictionary and synthetic signals that will serve for us in the experiments.  Here are the ingredients involved:

  • The matrix A is the “dictionary” (of size 50-by-100) under which the ideal signal is known to be sparse. We need to first construct this dictionary by drawing at random a matrix with i.i.d. Gaussian entries. Normalize each atom to a unit L2 norm.

  • Then we need to generate a sparse vector x0 of cardinality s. Let’s draw at random the  locations of the non-zero entries. Then, draw at random the value of each entry from a uniform distribution in the range [1, 3] and multiply each by a random sign.

The following figure shows the heatmap of the dictionary A randomly generated.


Greedy Pursuit

The greedy algorithm OMP is described in the following figure:



Basis Pursuit 


Even though the P0 problem is relaxed to P1, it’s still not ready to be solved by an LP Solver since, the objective function is still not linear. As explained in this thesis, the P1 problem can be converted to a LP by introducing a bunch of new variables and using the trick shown in the following figure.


The matlab /octave function to implement the BP algorithm using LP is shown below:


Analyzing the Results

In this part we compare the results obtained via the OMP and BP, by executing the following steps.

  • Plot the average relative ℓ2 error, obtained by the OMP and BP versus the cardinality.
  • Plot the average support recovery error, obtained by the OMP and BP versus the cardinality.
  • Discuss the presented graphs and explain the results.





As can be seen from the above results,

• The Linear Programming-based relaxed implementation of Basis Pursuit has higher accuracy (in terms of both L2 and support-probability errors) than the greedy Orthogonal Matching Pursuit, particularly when the cardinality of the true solution increases beyond cardinality 6.

• Both OMP and BP (LP) performs equally well (with almost zero L2 error) upto cardinality 6 and then OMP clearly performs worse than the BP (LP).

• Although the average L2 error for OMP increases upto 0.2 when the cardinality of the true solution increases to 15, whereas that of BP only increases very slightly.

• Similar pattern can be seen for the probability of error in support.

Mutual Coherence of A and theoretical guarantees for OMP and BP


where the mutual coherence of matrix A is defined as follows:


For our experiment, the mutual coherence μ(A)  = 0.45697. Hence, as per the theoretical guarantee, the OMP and BP are guaranteed to find the solution of P0 for s < 1.594164. Although in practice we observe that till s = 6 the solution found is quite accurate, since the theoretical bound shown above is for the worst-case only.


Data Science with Python: Exploratory Analysis with Movie-Ratings and Fraud Detection with Credit-Card Transactions

The following problems are taken from the projects / assignments in the edX course Python for Data Science (UCSanDiagoX) and the coursera course Applied Machine Learning in Python (UMich).


1. Exploratory Analysis to Find Trends in Average Movie Ratings for different Genres


● The IMDB Movie Dataset (MovieLens 20M) is used for the analysis.
● The dataset is downloaded from here .
● This dataset contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users and was released in 4/2015.
● The csv files movies.csv and ratings.csv are used for the analysis.



● Understand the trend in average ratings for different movie genres over years (from 1995 to 2015) and Correlation between the trends for different genres (8 different genres are considered: Animation, Comedy, Romance, Thriller, Horror, Sci-Fi and Musical).

● This will give us an insight about how the people’s liking for the different movie genres change over time and about the strength of association between trends in between different movie genres, insights possibly useful for the critics.

Research Questions

The answer to the following research questions will be searched for, using exploratory analysis and visualization:

1) The trend in average ratings for different movie genres: How the average ratings for a few different movie genres (namely, Animation, Comedy, Romance, Thriller, Horror,
Sci-Fi and Musical) change over time (different years, from 1995 to 2015)? How can the average ratings for different genres be compared among themselves?

2) Correlation between the trends for different genres: How are the trends for the genres correlated? For example, are the average ratings for Comedy and Sci-Fi movies positively associated with each other? What is the strength of association?


Pre-processing the dataset

The next table shows the first few rows of the movies dataset, loaded in a pandas DataFrame.

movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
print movies.shape
(27278, 3)
with in explode() function implementation similar to this stackoverflow post , the DataFrame is transformed to the one shown below.
movies.genres = movies.genres.str.split('|')
movies = explode(movies, ['genres'])
movieId title genres
0 1 Toy Story (1995) Adventure
1 1 Toy Story (1995) Animation
2 1 Toy Story (1995) Children
3 1 Toy Story (1995) Comedy
4 1 Toy Story (1995) Fantasy

The next table shows the first few rows of the ratings dataset, again loaded with pandas.

userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580

The input tables are pre-processed using the following code to get the data in the desired format, ready for the analysis.

import time
ratings['timestamp'] = ratings['timestamp'].apply(lambda x: time.strftime('%Y', time.localtime(x)))
userId movieId rating timestamp
0 1 2 3.5 2005
1 1 29 3.5 2005
2 1 32 3.5 2005
3 1 47 3.5 2005
4 1 50 3.5 2005
movies = movies.drop('title', axis=1)
movieId genres
0 1 Adventure
1 1 Animation
2 1 Children
3 1 Comedy
4 1 Fantasy
ratings = ratings.merge(movies, left_on='movieId', right_on='movieId', how='inner')
userId movieId rating timestamp genres
0 1 2 3.5 2005 Adventure
1 1 2 3.5 2005 Children
2 1 2 3.5 2005 Fantasy
3 5 2 3.0 1996 Adventure
4 5 2 3.0 1996 Children
(53177059, 5)
ratings = ratings.loc[ratings['genres'].isin(['Sci-Fi', 'Animation', 'Comedy', 'Romance', 'Thriller', 'Horror', 'Musical'])]
mean_ratings = ratings.groupby(['timestamp', 'genres'], as_index=False)['rating'].aggregate(np.mean)
mean_ratings.rename(columns={'timestamp': 'year'}, inplace=True) sd_ratings = ratings.groupby(['timestamp', 'genres'], as_index=False)['rating'].aggregate(np.std) sd_ratings.rename(columns={'timestamp': 'year'}, inplace=True) sd_ratings.head()
year genres rating
0 1995 Comedy 0.000000
1 1995 Romance NaN
2 1995 Thriller 1.414214
3 1996 Animation 0.953820
4 1996 Comedy 1.010918



Let’s first visualize (visualization done using seaborn) the distributions of the movie-ratings across different genres, as shown in the following figure.


The next figure shows the trends of the average ratings by users for different genres across different years.  The vertical error bars represent the standard deviations of the average ratings (ratings for different movies averaged over users) for the same genres (s.d. across the movies belonging to the same genre).

ratings2 = ratings.groupby(['movieId', 'timestamp', 'genres'], as_index=False)['rating'].aggregate(np.mean)
movieId year genres rating
0 1 1996 Animation 4.132947
1 1 1996 Comedy 4.132947
2 1 1997 Animation 3.875221
3 1 1997 Comedy 3.875221
4 1 1998 Animation 3.883929


The next figure shows the trends of the ratings (averaged over users and movies for each genre) for different genres across different years.


ratings3 = ratings.groupby(['userId', 'genres'], as_index=False)['rating'].aggregate(np.mean)
userId genres rating
0 1 Animation 3.650000
1 1 Comedy 3.731707
2 1 Horror 3.744444
3 1 Musical 3.666667
4 1 Romance 3.954545
pivot = ratings3.pivot(index='userId', columns='genres', values='rating')
genres Animation Comedy Horror Musical Romance Sci-Fi Thriller
1 3.650000 3.731707 3.744444 3.666667 3.954545 3.712500 3.761905
2 3.000000 3.900000 3.555556 3.000000 3.833333 4.608696 4.263158
3 3.750000 4.057692 3.937500 4.000000 4.062500 4.000000 4.260000
4 4.000000 3.545455 NaN 4.000000 3.500000 3.000000 3.461538
5 4.666667 4.083333 3.000000 4.375000 3.937500 4.600000 4.333333

The next figures show the trends for different genres for different sub-windows of time and with the variations (inter-quartile ranges)  in average ratings for each genre.


The next figures show how correlated are the trends for average ratings for different genres.


Some Findings

● There is a decreasing trend in the average ratings for all 8 genres during 1995-98, then the ratings become stable during 1999-2007, then again increase.

● Horror movies always have the lowest average ratings. Sci-Fi and Comedy movies also get low average ratings.

● Musical, Animation and Romance movies get the highest average ratings.

● Sci-Fi and Animation movies show very similar trends, they again become popular during 2009-2013.

● Trends in the average ratings of Romance and Horror movies show positive association between them.


2. Fraud Detection with Classification


In this project, we aim to build machine learning models to automatically detect frauds in credit card transactions. Several supervised binary classification models will be trained using 75-25 validation on this credit card transaction dataset from Kaggle. Given a transaction instance, a model will predict whether it is fraud or not. Different models will then be evaluated on a held-out subset of this data by measuring how effectively they predict instances of credit card fraud. The model with the best recall value (the one which is able to detect the highest number of true frauds) will be selected for prediction. We found that the k-Nearest Neighbors (k=3) and LogisticRegression models perform the best when only the recall (sensitivity) is concerned, whereas the RandomForest model gives pretty high specificity and sensitivity at the same time.



Using the credit card transaction dataset, we want to train a few machine learning models that can predict whether an unseen transaction in the future is likely to be fraud or not. This automatic prediction / detection of fraud can immediately raise an alarm and the transaction could be stopped before it completes. This can also help by expediting the process of manual inspection / examination / investigation of the predicted fraudulent transactions, thereby ensuring safety for the financial institution / bank / customers.


● This dataset from Kaggle is used for credit card fraud detection. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available here and here.

● The datasets contains transactions made by credit cards in September 2013 by the European cardholders. This dataset presents transactions that occurred in two days, where there were 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Each row in the dataset creditcard.csv corresponds to a credit card transaction. The dataset contains 284,807 rows and 30 columns.

● Features (Columns) include an integer variable Time (which represents the seconds elapsed between each transaction and the first transaction in the dataset), 28 confidential variables V1 through V28 (these are numerical input variables which are the result of a PCA transformation – unfortunately, due to confidentiality issues, the original features and more background information about the data could not be provided), as well as the variable Amount which is the amount of the transaction.

● The target is stored in the Class column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

● The next table shows the first few rows.


Data Preparation and Cleaning

● The main problem of this dataset is that it is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions. There are only 492 frauds out of 284,807 transactions: too many negative instances and too few positive (fraud) instances.

● In order to mitigate this high imbalance ratio, so that while training the models can see enough fraud examples, the following techniques were used.

  1. Oversampling: the Synthetic Minority Oversampling Technique (SMOTE) is used to  generate new fraud (minority class) samples with interpolation and k-nearest neighbors. Using this technique, the number of positive examples were increased to 5000 samples.
  2.  Under-sampling: the NeverMiss-1 Algorithm is used to select samples from the majority (non-fraud) class for which the average distance of the k nearest samples of the minority class is the smallest. Using this technique, the number of positive examples were decreased to 10000 samples.

● The input feature Time is not that relevant to predict fraud, that is why that feature is dropped from the input.


Research Questions

Predict whether a given transaction is fraudulent or not.

● Given a credit card transaction (represented by the values of the 30 input features), the goal is to answer the following question: is the transaction fraud?

● More mathematically, given the labelled data, we want to learn a function f :

from the data. Also we shall be interested to compute the conditional probability Prob(fraud|transaction).

● We want to use the function learnt to predict new transactions (not seen while learning the function ).

● We want to evaluate how correctly we can find frauds from the unseen data and find which model performs the best (model selection).



● The dataset is first split (with sklearn) into random train (75%) and test (25%) subsets. Both subsets roughly maintain the ratio of majority vs. minority class.

Split the data into train and test

from sklearn.model_selection import train_test_split

df = pd.read_csv(path + 'creditcard.csv')
df.drop('Time', 1, inplace=True)

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

● The training dataset is highly imbalanced (only 372 fraud instances out of 213,607 total instances) w.r.t. class-ratio, it’s balanced using SMOTE (oversampling: the number of the fraud instances to increase to 5000) and NeverMiss-1 (under-sampling: decreasing the number of the non-fraud instances to 10000). The test dataset is not touched.

Preprocessing: Resample to balance the dataset


● A few sklearn models (kNN, SVM, LogisticRegression, RandomForest, DecesionTree, AdaBoost, NaiveBayesian) are then trained separately on the training dataset and every time a model is learnt, it is used to predict the class of the hitherto-unseen test dataset.

● Given the class imbalance ratio, one of the recommend measures for model evaluation is the Area Under the Precision-Recall Curve (AUPRC), since Confusion matrix accuracy is not meaningful for unbalanced classification. e.g., Training a dummy classifier that classifies everything as the majority class of the training data, the accuracy of this classifier obtained is 0.996643 but the recall obtained is very poor: 0.008333, as shown below.

Using X_trainX_testy_train, and y_test (as defined above), let’s train a dummy classifier that classifies everything as the majority class of the training data and compute what is the accuracy of this classifier? What is the recall?

from sklearn.metrics import recall_score, precision_score
from sklearn.dummy import DummyClassifier 
clf = DummyClassifier(random_state=0), y_train) 
y_pred = clf.predict(X_test) 
accuracy = float(sum(y_pred==y_test))/len(y_test) 
recall = recall_score(y_test, y_pred) print (accuracy, recall)
# (0.9966433527148114, 0.0083333333333333332)

● Particularly we shall be interested in high Recall, since ideally we want all the fraud instances to be predicted correctly as fraud instances by the model, with zero False Negatives.



● The next figure shows the prediction evaluation results on the test dataset using the python sklearn Support Vector Classifier with RBF kernel. As can be seen, using training with resampling, the recall becomes quite high (~91.7%), although the accuracy and precision drops.

from sklearn.svm import SVC
clf = SVC(random_state=0, C=1e9, gamma=1e-07, probability=True), y_train) 
y_pred = clf.predict(X_test) 
y_prob = clf.predict_proba(X_test) 
precision, recall, _ = precision_recall_curve(y_test, y_pred) 
fpr, tpr, _ = roc_curve(y_test, y_prob[:,1], pos_label=1) 
print(auc(fpr, tpr)) 
cnf_matrix = confusion_matrix(y_test, y_pred) 
plot_confusion_matrix(cnf_matrix, classes=['0','1'], title='Confusion matrix, without normalization')


● The next figure shows the prediction evaluation results on the test dataset using the python sklearn LogisticRegression classifier. As can be seen, using training with resampling, the recall becomes even higher (~94.2%), although the accuracy and precision drops further.


● The next figure again shows the prediction recall values on the test dataset using the sklearn LogisticRegression classifier, but this time using GridSearch with different hyper-parameter values (for C and regularization).

from sklearn.linear_model import LogisticRegression
parameters = {'penalty': ['l1', 'l2'], 'C':[0.01, 0.1, 1, 10, 100]}
lr = LogisticRegression()
clf = GridSearchCV(lr, parameters, scoring='recall'), y_train)
print np.reshape(clf.cv_results_['mean_test_score'], (5,2))

● As can be seen, C = 100 with L1 penalty obtains the best recall on the test dataset among the values searched on the grid.

● Also, there are 120 fraud instances in the test dataset, out of which all but 7 are detected correctly with the best Logistic Regression Model. The next table shows a few highlighted in red for which the model failed to predict a fraud instance.

● The next figure shows visualizations of the classification decision functions learnt from the training dataset and the class labels predicted for the test dataset, projected on the variables V1 and V2, for different classifiers.


● With every classifier (specificity, sensitivity) metrics on the test dataset are also computed and shown.

● As can be seen, the RandomForest Classifier does a descent job in having both specificity and sensitivity quite high (89% and 88% resp.)

● The k-NearestNeibors (with k = 3) and LogisticRegression classifiers do the best in terms of recall (95%, 94% respectively) but their specificity is quite poor.



Limitations and Future Work

● The models are learnt from this particular credit card fraud dataset and hence may not generalize to other fraud datasets.

● The dataset being highly imbalanced, other methods (such as different variants of SMOTE and ADASYN oversampling, Tomek’s link / different variants of Edited Nearest Neighbor methods e.g. AllKNN) might be tried as pre-processing step, in order to create a more balanced dataset. Also, the resampling was done to result in a final training dataset with 1:2 ratio of the number of minority vs. majority instances, we could try different ratios (e.g., 1:1 or 2:3).

● 75-25 validation is used to evaluate the classifier models, instead k-fold (e.g., k=10) cross-validation could be used to obtain more generalizable models.

● Fine-tuning with hyper-parameters for all models could be done, it was only done with grid search for LogisticRegression. For probabilistic classifiers (e.g., Logistic Regression), we could try different thresholds for classification instead of default.

● We assumed the transactions to be independent of each other (not time dependent) and ignored the Time feature, one could use Time to build a sequential model for prediction.

● The entire dataset might be used with data augmentation for the minority fraud class and deep learning models (CNN or RNN if the Time feature is used) can be learnt and later used for prediction to improve specificity and sensitivity (recall) simultaneously.

● We could try to find the patterns in the input features for a probable fraud instance.



● As can be seen, with LogisticRegression and k-NearestNeighbors (k=3) classifiers we could obtain the best recall on the held-out dataset (although precision / specificity with these models are really poor). Hence, if recall is the only concerned metric, we can go ahead with either of these models for fraud detection / prediction of a fraud instance.
● If precision / specificity is also a secondary concern (e.g., we may not want too many False Positives / Alarms) along with the primary concern recall, we can go ahead with the RandomForest model.



● Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

Credit card transaction dataset from Kaggle.