Solving Some Image Processing and Computer Vision Problems with Python libraries

In this article, a few image processing / computer vision problems and their solutions  with python libraries (scikit-image, cv2) will be discussed. Some of the problems are from the exercises from this book (available on amazon). This blog will be continued here.

Removing Gaussian Noise from images by computing mean and median images with scikit-image

  1. Start with an input image.
  2. Create n (e.g, n=100) noisy images by adding i.i.d. Gaussian noise (with zero mean) to the original image, with scikit-image.
  3. Compute the mean (median) of the noisy images.
  4. Compare PSNR with the original image.
  5. Vary n and compare the results.
from skimage import img_as_float
from skimage.util import random_noise
from skimage.measure import compare_psnr
from import imread
import matplotlib.pylab as plt
import numpy as np

im = img_as_float(imread('../new images/parrot.jpg')) # original image
# generate n noisy images from the original image by adding Gaussian noise
n = 25
images = np.zeros((n, im.shape[0], im.shape[1], im.shape[2]))
sigma = 0.2
for i in range(n):
    images[i,...] = random_noise(im, var=sigma**2)

im_mean = images.mean(axis=0)
im_median = np.median(images, axis=0)
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, hspace=.01)
plt.subplot(221), plt.imshow(im), plt.axis('off'), plt.title('Original image', size=20)
plt.subplot(222), plt.imshow(images[0]), plt.axis('off'), plt.title('Noisy PSNR: ' + str(compare_psnr(im, images[0])), size=20)
plt.subplot(223), plt.imshow(im_mean), plt.axis('off'), plt.title('Mean PSNR: ' + str(compare_psnr(im, im_mean)), size=20)
plt.subplot(224), plt.imshow(im_median), plt.axis('off'), plt.title('Median PSNR: ' + str(compare_psnr(im, im_median)), size=20)

The next figure shows the original image, a noisy image generated from it by adding Gaussian noise (with 0 mean) to it and the images obtained by taking mean / median over all the n noisy images generated. As can be seen, the Gaussian noise in the images gets cancelled out by taking mean / median.

with n = 25


with n=100

plt.hist(images[:,100,100,0], color='red', alpha=0.2, label='red')
plt.hist(images[:,100,100,1], color='green', alpha=0.2, label='green')
plt.hist(images[:,100,100,2], color='blue', alpha=0.2, label='blue')

The next figure shows how a pixel value (that can be considered a random variable) for a particular location in different noisy images follows approximately a Gaussian distribution.

Distribution of a pixel value at location (100,100) in the noisy images


ns = [25, 50, 100, 200]
# mean_psnrs contain the PSNR values for different n
plt.plot(ns, mean_psnrs, '.--', label='PSNR (mean)')
plt.plot(ns, median_psnrs, '.--', label='PSNR (median)')
plt.xlabel('n'),  plt.ylabel('PSNR')

The following figure shows that the PSNR improves with large n (since by SLLN / WLLN, the sample mean converges to population mean 0 of the Gaussian noise). Also, for median the improvement in the image quality is higher for larger values of n.


Tracking Pedestrians with HOG-SVM with OpenCV / scikit-image

  1. Start with a video with pedestrians.
  2. Capture the video / extract frames from the video.
  3. For each frame
    1. Create HOG scale pyramid of the frame image.
    2. At each scale, use a sliding window to extract the corresponding block from the frame, compute the HOG descriptor features.
    3. Use cv2‘s HOGDescriptor_getDefaultPeopleDetector() – a pre-trained SVM classifier on the HOG descriptor to classify whether the corresponding block contains a pedestrian or not.
    4. Run non-max-suppression to get rid of multiple detection of the same person.
    5. Use cv2‘s  detectMultiScale() function to implement steps 3-4.

The code is adapted from the code here and here.

# HOG descriptor using default people (pedestrian) detector
hog = cv2.HOGDescriptor()

# run detection, using a spatial stride of 4 pixels,
# a scale stride of 1.02, and zero grouping of rectangles
# (to demonstrate that HOG will detect at potentially
# multiple places in the scale pyramid)
(foundBoundingBoxes, weights) = hog.detectMultiScale(frame, winStride=(4, 4), padding=(8, 8), scale=1.02, finalThreshold=0, useMeanshiftGrouping=False)

# convert bounding boxes from format (x1, y1, w, h) to (x1, y1, x2, y2)
rects = np.array([[x, y, x + w, y + h] for (x, y, w, h) in foundBoundingBoxes])

# run non-max suppression on the boxes based on an overlay of 65%
nmsBoundingBoxes = non_max_suppression(rects, probs=None, overlapThresh=0.65)

cv2 functions are used to extract HOG descriptor features and pedestrian detection with SVM,  whereas scikit-image functions are used to visualize the HOG features. The animations below display the original video, what HOG sees and  the detected pedestrians after non-max suppression. Notice there are a few false positive detection.

Original Videoped1

HOG-descriptor features video (what HOG sees)pedh1Original Video with detected Pedestrians ped1o

Face Detection with HaarCascade pre-trained AdaBoost classifiers with OpenCV

  1. Capture video with webcam with cv2.VideoCapture().
  2. For each frame, use the pre-trained Adaboost Cascade classifiers (the haarcascade_frontalface_default classifier for face detection and haarcascade_eye_tree_eyeglasses classifier for better detection of the eyes with glasses, from the corresponding xml files that come with cv2’s installation) using Haar-like features with cv2.CascadeClassifier().
  3. First detect the face(s) with the detectMultiScale() function and draw a bounding box. Then detect the eyes inside a detected face with the same function.
  4. The following python code snippet shows how to detect faces and eyes with cv2. The code is adapted from here.


# read the cascade classifiers from the xml files from the correct path into face_cascade  # and eye_cascade
gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# return bounding box of the face(s) if one is detected
faces = face_cascade.detectMultiScale(gray, 1.03, 5)
for (x,y,w,h) in faces:
   frame = cv2.rectangle(frame,(x,y),(x+w,y+h),(255,0,0),2)
   roi_gray = gray[y:y+h, x:x+w]
   roi_color = frame[y:y+h, x:x+w]
   eyes = eye_cascade.detectMultiScale(roi_gray)
   for (ex,ey,ew,eh) in eyes:

The next animation shows the results of face detection when scalefactor 1.03 was used to create the scale pyramid.  As can be seen, the eyes with the glasses on and some small faces from the photos are not detected at this scale.


The next animation shows the results of face detection when scalefactor 1.3 was used to create the scale pyramid.  As can be seen, the eyes with/without the glasses on as well as most of the small faces from the photos are detected at this scale most of the time.



Object Tracking with OpenCV trackers


Object Saliency Detection with OpenCV


Linear / QR Barcode Generation and Detection with OpenCV




Image Segmentation with Random Walk with scikit-image



Image Segmentation with Grab-Cut with OpenCV


Segmentation with SLIC + RAG in scikit-image


Homography with scikit-image / OpenCV











Face Morphing (Beier-Neely morphing) with Pystasm

me14     sheldon



Object Detection with YOLO DarkNet / Keras / OpenCV (Deep Learning model)


Semantic Segmentation with ENet / DeepLab (Deep Learning  model)

Input video and the segmented Output video

Input video and the segmented Output video

Neural Style Transfer with OpenCV / Torch (Deep Learning Model)

Input  & Output with style (Starry night)me_style

Text Detection with EAST (Deep Learning Model)




Image Colorization with Deep Learning (OpenCV / Caffe)



OCR + Text Recognition with EAST + Tesseract



Solving Some Image Processing Problems with Python libraries

In this article a few popular image processing problems along with their solutions are going to be discussed. Python image processing libraries are going to be used to solve these problems. Some of the problems are from the exercises from this book  (available on amazon).


Image Transformations and Warping


0. Display RGB image color channels in 3D

  1. A gray-scale image can be thought of a 2-D function f(x,y) of the pixel locations (x,y), that maps each pixel into its corresponding gray level (an integer in [0,255], e.g.,).
  2. For an RGB image there are 3 such functions, f_R(x,y), f_G(x.y), f_B(x.y).
  3. matplotlib’s 3-D plot functions can be used to plot each function.

The following python code shows how to plot the RGB channels separately in 3D:

import matplotlib.pylab as plt
from mpl_toolkits.mplot3d import Axes3D
def plot_3d(X, Y, Z, title, cmap):
    # implement this function to plot the channel pixel values in 3D

im = imread('../images/parrot.jpg')
Y = np.arange(im.shape[0])
X = np.arange(im.shape[1])
Z1 = im[...,0]
Z2 = im[...,1]
Z3 = im[...,2]
plot_3d(X, Y, Z1, cmap='Reds', title='3D plot for the Red Channel')
plot_3d(X, Y, Z2, cmap='Greens', title='3D plot for the Green Channel')
plot_3d(X, Y, Z3, cmap='Blues', title='3D plot for the Blue Channel')

The RGB image


1. Wave Transform

  1. Use scikit-image’s warp() function to implement the wave transform.
  2. Note that wave transform can be expressed with the following equations:

We shall use the madrill image to implement the wave transform. The next python code fragment shows how to do it:

def wave(xy):
    xy[:, 1] += 20*np.sin(2*np.pi*xy[:, 0]/64)
    return xy

from import imread
from skimage.transform import warp
import matplotlib.pylab as plt
im = imread('images/mandrill.jpg')
im = warp(im, wave)

The next figure shows the original mandrill input image and the output image obtained after applying the wave transform.

mandrill      mandrill_w.png


2. Swirl Transform

  1. Use scikit-image’s warp() function to implement the swirl transform.
  2. Note that swirl transform can be expressed with the following equations

We shall use the madrill image to implement the wave transform. The next python code fragment shows how to do it:

def swirl(xy, x0, y0, R):
    r = np.sqrt((xy[:,1]-x0)**2 + (xy[:,0]-y0)**2)
    a = np.pi * r / R
    xy[:, 1] = (xy[:, 1]-x0)*np.cos(a) + (xy[:, 0]-y0)*np.sin(a) + x0
    xy[:, 0] = -(xy[:, 1]-x0)*np.sin(a) + (xy[:, 0]-y0)*np.cos(a) + y0
    return xy

im = imread('../images/mandrill.jpg')
im = warp(im, swirl, map_args={'x0':112, 'y0':112, 'R':512})

The next figure shows the original mandrill input image and the output image obtained after applying the swirl transform.

mandrill   ms.png


Compare this with the output of the scikit-image swirl() function.

3. Very simple Face morphing with α-blending

  1. Start from one face image (e.g., let image1 be the face of Messi) and end into another image (let image2 be the face of Ronaldo) iteratively, creating some intermediate images in between.
  2. At each iteration create an image by using a linear combination of the two image numpy ndarrays given by
     3. Iteratively increase α from 0 to 1.

The following code block shows how to implement it using matplotlib’s image and pylab modules.

im1 = mpimg.imread("../images/messi.jpg") / 255 # scale RGB values in [0,1]
im2 = mpimg.imread("../images/ronaldo.jpg") / 255
i = 1
for alpha in np.linspace(0,1,20):
 plt.imshow((1-alpha)*im1 + alpha*im2)
 i += 1
plt.subplots_adjust(wspace=0.05, hspace=0.05)

The next animation shows the simple face morphing:
There are more sophisticated techniques to improve the quality of morphing, but this is the simplest one.

4. Creating Instagram-like Gotham Filter

The Gotham filter

The Gotham filter is computed as follows (the steps taken from here), applying the following operations on an image, the corresponding python code, input and output images are shown along with the operations (with the following input image):


  1. A mid-tone red contrast boost
    from PIL import Image
    import numpy as np
    import matplotlib.pylab as plt
    im ='../images/city.jpg') # pixel values in [0,255]
    r, g, b = im.split()
    red_levels = [0., 12.75, 25.5, 51., 76.5, 127.5, 178.5, 204., 229.5, 242.25, 255.]
    r1 = Image.fromarray((np.reshape(np.interp(np.array(r).ravel(), np.linspace(0,255,len(red_levels)), red_levels), (im.height, im.width))).astype(np.uint8), mode='L')
    plt.title('original', size=20)
    im1 = Image.merge('RGB', (r1, g, b))
    plt.title('with red channel interpolation', size=20)
    plt.hist(np.array(r).ravel(), normed=True)
    plt.hist(np.array(r1).ravel(), normed=True)


  2. Make the blacks a little bluer
    plt.title('last image', size=20)
    b1 = Image.fromarray(np.clip(np.array(b) + 7.65, 0, 255).astype(np.uint8))
    im1 = Image.merge('RGB', (r1, g, b1))
    plt.title('with transformation', size=20)


  3. A small sharpening
    from PIL.ImageEnhance import Sharpness
    plt.title('last image', size=20)
    im2 = Sharpness(im1).enhance(3.0)
    plt.title('with transformation', size=20)


  4. A boost in blue channel for lower mid-tones
  5. A decrease in blue channel for upper mid-tones
    blue_levels = [0., 11.985, 30.09, 64.005, 81.09, 99.96, 107.1, 111.945, 121.125, 143.055, 147.9, 159.885, 171.105, 186.915, 215.985, 235.875, 255.]
    b2 = Image.fromarray((np.reshape(np.interp(np.array(b1).ravel(), np.linspace(0,255,len(blue_levels)), blue_levels), (im.height, im.width))).astype(np.uint8), mode='L')
    plt.title('last image', size=20)
    im3 = Image.merge('RGB', (r1, g, b2))
    plt.title('with blue channel interpolation', size=20)
    plt.hist(np.array(b1).ravel(), normed=True)
    plt.hist(np.array(b2).ravel(), normed=True)


The output image obtained after applying the Gotham filter is shown below:


Down-sampling with anti-aliasing using Gaussian Filter

  1. Start with a large gray-scale image and reduce the image size 16 times, by reducing both height and width by 4 times.
  2. Select every 4th pixel in the x and the y direction from the original image to compute the values of the pixels in the smaller image.
  3. Before down-sampling apply a Gaussian filter (to smooth the image) for anti-aliasing.
  4. Compare the quality of the output image obtained by down-sampling without a Gaussian filter (with aliasing).

The next code block performs the above steps. Since the Gaussian blur is a low-pass filter, it removes the high frequencies from the original input image, hence it’s possible to achieve sampling rate above the Nyquist rate (by sampling theorem) to avoid aliasing.

from scipy.ndimage import gaussian_filter
im = rgb2gray(imread('images/umbc.png'))
im_blurred = gaussian_filter(im, sigma=2.5) #(5,5,1)
n = 4 # create and image 16 times smaller in size
w, h = im.shape[0] // n, im.shape[1] // n
im_small = np.zeros((w,h))
for i in range(w):
   for j in range(h):
      im_small[i,j] = im[n*i, n*j]
im_small = np.zeros((w,h))
for i in range(w):
   for j in range(h):
      im_small[i,j] = im_blurred[n*i, n*j]

Original Image

    Image blurred with Gaussian Filter LPF blur_umbc.png

Down-sampled Image from the original image (with aliasing)


Down-sampled Image from the blurred image (with anti-aliasing)anti_alias_umbc

Some Applications of DFT


0. Fourier Transform of a Gaussian Kernel is another Gaussian Kernel

Also, the spread in the frequency domain  inversely proportional to the spread in the spatial domain (known as Heisenberg’s inequality). Here is the proof:


The following animation shows an example visualizing the Gaussian contours in spatial and corresponding frequency domains:



1. Using DFT to up-sample an image

  1. Let’s use the lena gray-scale image.
  2. First double the size of the by padding zero rows/columns at every alternate positions.
  3. Use FFT followed by an LPF.
  4. Finally use IFFT to get the output image.


The following code block shows the python code for implementing the steps listed above:

import numpy as np
import numpy.fft as fp
import matplotlib.pyplot as plt

im = np.mean(imread('images/lena.jpg'), axis=2)
im1 = np.zeros((2*im.shape[0], 2*im.shape[1]))
print(im.shape, im1.shape)
for i in range(im.shape[0]):
    for j in range(im.shape[1]):
        im1[2*i,2*j] = im[i,j]

def padwithzeros(vector, pad_width, iaxis, kwargs):
    vector[:pad_width[0]] = 0
    vector[-pad_width[1]:] = 0
    return vector

# the LPF kernel
kernel = [[0.25, 0.5, 0.25], [0.5, 1, 0.5], [0.25, 0.5, 0.25]]
# enlarge the kernel to the shape of the image
kernel = np.pad(kernel, (((im1.shape[0]-3)//2,(im1.shape[0]-3)//2+1), ((im1.shape[1]-3)//2,(im1.shape[1]-3)//2+1)), padwithzeros) 

plt.gray() # show the filtered result in grayscale

freq = fp.fft2(im1)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
freq_LPF = freq*freq_kernel # by the Convolution theorem
im2 = fp.ifft2(freq_LPF)
freq_im2 = fp.fft2(im2)

plt.title('Original Image', size=20)
plt.title('Padded Image', size=20)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq))).astype(int), cmap='jet')
plt.title('Original Image Spectrum', size=20)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq_kernel))).astype(int), cmap='jet')
plt.title('Image Spectrum of the LPF', size=20)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq_im2))).astype(int), cmap='jet')
plt.title('Image Spectrum after LPF', size=20)
plt.imshow(im2.astype(np.uint8)) # the imaginary part is an artifact
plt.title('Output Image', size=20)

The next figure shows the output. As can be seen from the next figure, the LPF removed the high frequency components from the Fourier spectrum of the padded image and with a subsequent inverse Fourier transform  we get a decent enlarged image.



2. Frequency Domain Gaussian Filter

  1. Use an input image and use DFT to create the frequency 2D-array.
  2. Create a small Gaussian 2D Kernel (to be used as an LPF) in the spatial domain and pad it to enlarge it to the image dimensions.
  3. Use DFT to obtain the Gaussian Kernel in the frequency domain.
  4. Use the Convolution theorem to convolve the LPF with the input image in the frequency domain.
  5. Use IDFT to obtain the output image.
  6. Plot the frequency spectrum of the image, the gaussian kernel and the image obtained after convolution in the frequency domain, in 3D.

The following code block shows the python code:

import matplotlib.pyplot as plt
from matplotlib import cm
from skimage.color import rgb2gray
from import imread
import scipy.fftpack as fp

im = rgb2gray(imread('images/temple.jpg'))
kernel = np.outer(signal.gaussian(im.shape[0], 10), signal.gaussian(im.shape[1], 10))
freq = fp.fft2(im)
assert(freq.shape == kernel.shape)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
convolved = freq*freq_kernel # by the Convolution theorem
im_blur = fp.ifft2(convolved).real
im_blur = 255 * im_blur / np.max(im_blur)

# center the frequency response
plt.imshow( (20*np.log10( 0.01 + fp.fftshift(freq_kernel))).real.astype(int), cmap='coolwarm')

plt.imshow(im, cmap='gray')

from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter
# ... code for 3D visualization of the spectrums

The original color temple image (time / spatial domain)


The temple image (frequency domain)


The Gaussian Kernel LPF in 2D (frequency domain)


The Gaussian Kernel LPF (frequency domain)

The smoothed temple image with the LPF (frequency domain)


If we set the standard deviation of the LPF Gaussian kernel to be 10 we get the following output as shown in the next figures. As can be seen, the frequency response value drops much quicker from the center.

The smoothed temple image with the LPF with higher s.d. (frequency domain)


The output image after convolution (spatial / time domain)


3. Using the inverse filter to restore a motion-blurred image

  1. First create a motion blur kernel of a given shape.
  2. Convolve the kernel with an input image in the frequency domain.
  3. Get the motion-blurred image in the spatial domain with IDFT.
  4. Compute the inverse filter kernel and convolve with the blurred image in the frequency domain.
  5. Get the convolved image back in the spatial domain.
  6. Plot all the images and kernels in the frequency domain.

The following code block shows the python code:

im = rgb2gray(imread('../images/madurga.jpg'))
# create the motion blur kernel
size = 21
kernel = np.zeros((size, size))
kernel[int((size-1)/2), :] = np.ones(size)
kernel = kernel / size
kernel = np.pad(kernel, (((im.shape[0]-size)//2,(im.shape[0]-size)//2+1), ((im.shape[1]-size)//2,(im.shape[1]-size)//2+1)), padwithzeros)

freq = fp.fft2(im)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
convolved1 = freq1*freq_kernel1
im_blur = fp.ifft2(convolved1).real
im_blur = im_blur / np.max(im_blur)

epsilon = 10**-6

freq = fp.fft2(im_blur)
freq_kernel = 1 / (epsilon + freq_kernel1)

convolved = freq*freq_kernel
im_restored = fp.ifft2(convolved).real
im_restored = im_restored / np.max(im_restored)

plt.title('Original image', size=20)
plt.title('Blurred image with motion blur kernel', size=20)
plt.title('Restored image with inverse filter', size=20)
plt.imshow(im_restored - im)
plt.title('Diff restored & original image', size=20)

# Plot the surface of the frequency responses here


Frequency response of the input image


(log) Frequency response of the motion blur kernel (LPF)


Input image convolved with the motion blur kernel (frequency domain)


(log) Frequency response of the inverse frequency filter kernel (HPF)


Motion-blurred image convolved with the inverse frequency filter kernel (frequency domain)



4. Impact of noise on the inverse filter

  1. Add some random noise to the Lena image.
  2. Blur the image with a Gaussian kernel.
  3. Restore the image using inverse filter.

With the original image

Let’s first blur and apply the inverse filter on the noiseless blurred image. The following figures show the outputs:


(log) Frequency response of the input imagelena_freq.png

(log) Frequency response of the Gaussian blur kernel (LPF)lena_gaussian_freq

(log) Frequency response of the blurred image


(log) Frequency response of the inverse kernel (HPF)


Frequency response of the output imagelena_inverse_spectrum

Adding noise to the original image

The following python code can be used to add Gaussian noise to an image:

from skimage.util import random_noise
im = random_noise(im, var=0.1)

The next figures show the noisy lena image, the blurred image with a Gaussian Kernel and the restored image with the inverse filter. As can be seen, being a high-pass filter, the inverse filter enhances the noise, typically corresponding to high frequencies.



5. Use a notch filter to remove periodic noise from the following half-toned car image.


  1. Use DFT to obtain the frequency spectrum of the image.
  2. Block the high frequency components that are most likely responsible fro noise.
  3. Use IDFT to come back to the spatial domain.


from scipy import fftpack
im = imread('images/halftone.png')
F1 = fftpack.fft2((im).astype(float))
F2 = fftpack.fftshift(F1)
for i in range(60, w, 135):
   for j in range(100, h, 200):
     if not (i == 330 and j == 500):
        F2[i-10:i+10, j-10:j+10] = 0
for i in range(0, w, 135):
   for j in range(200, h, 200):
     if not (i == 330 and j == 500):
        F2[max(0,i-15):min(w,i+15), max(0,j-15):min(h,j+15)] = 0
plt.imshow( (20*np.log10( 0.1 + F2)).astype(int),
im1 = fp.ifft2(fftpack.ifftshift(F2)).real
plt.imshow(im1, cmap='gray')

Frequency response of the input image

Frequency response of the input image with blocked frequencies with notch


Output image


With a low-pass-filter (LPF):

Frequency response of the input image with blocked frequencies with LPF

Output imagecar_notch

Histogram Matching with color images

As described here, here is the algorithm:

  1. The cumulative histogram is computed for each image dataset, see the figure below.
  2. For any particular value (xi) in the input image data to be adjusted has a cumulative histogram value given by G(xi).
  3. This in turn is the cumulative distribution value in the reference (template) image  dataset, namely H(xj). The input data value xi is replaced by xj.


im = imread('images/lena.jpg')
im_t = imread('images/vstyle.png')
im1 = np.zeros(im.shape).astype(np.uint8)
for i in range(3):
    c = cdf(im[...,i])
    c_t = cdf(im_t[...,i])
    im1[...,i] = hist_matching(c, c_t, im[...,i]) # implement this function with the above algorithm
    c1 = cdf(im1[...,i])
    col = 'r' if i == 0 else ('g' if i == 1 else 'b')
    plt.plot(np.arange(256), c, col + ':', label='input ' + col.upper(), linewidth=5)
    plt.plot(np.arange(256), c_t, col + '--', label='template ' + col.upper(), linewidth=5)
    plt.plot(np.arange(256), c1, col + '-', label='output ' + col.upper(), linewidth=2)
plt.title('CDF', size=20)
plt.legend(prop={'size': 15})


Input image                                                 

Template Imagevstyle

Output Image

The following figure shows how the histogram of the input image is matched with the histogram of the template image.


Another example:

Input image                                                 


Template Image


Output Image



Mathematical Morphology

1. Automatically cropping an image

  1. Let’s use the following image. The image has unnecessary white background  outside the molecule of the organic compound.L_2d
  2. First convert the image to a binary image and compute the convex hull of the molecule object.
  3. Use the convex hull image to find the bounding box for cropping.
  4. Crop the original image with the bounding box.

The next python code shows how to implement the above steps:

from PIL import Image
from import imread
from skimage.morphology import convex_hull_image
im = imread('../images/L_2d.jpg')
plt.title('input image')
im1 = 1 - rgb2gray(im)
threshold = 0.5
im1[im1  threshold] = 1
chull = convex_hull_image(im1)
plt.title('convex hull in the binary image')
imageBox = Image.fromarray((chull*255).astype(np.uint8)).getbbox()
cropped = Image.fromarray(im).crop(imageBox)'L_2d_cropped.jpg')
plt.title('cropped image')


This can also be found here.

2. Opening and Closing are Dual operations in mathematical morphology

  1. Start with a binary image and apply opening operation with some structuring element (e.g., a disk) on it to obtain an output image.
  2. Invert the image (to change the foreground to background and vice versa) and apply closing operation on it with the same structuring element to obtain another output image.
  3. Invert the second output image obtained and observe that it’s same as the first output image.
  4. Thus applying opening operation to the foreground of a binary image is equivalent to applying closing operation to the background of the same image with the same structuring element.

The next python code shows the implementation of the above steps.

from skimage.morphology import binary_opening, binary_closing, disk
from skimage.util import invert
im = rgb2gray(imread('../new images/circles.jpg'))
im[im  0.5] = 1
plt.title('original', size=20)
im1 = binary_opening(im, disk(12))
plt.title('opening with disk size ' + str(12), size=20)
im1 = invert(binary_closing(invert(im), disk(12)))
plt.title('closing with disk size ' + str(12), size=20)

As can be seen the output images obtained are exactly same.


Floyd-Steinberg Dithering (to convert a grayscale to a binary image)

The next figure shows the algorithm for error diffusion dithering.


def find_closest_palette_color(oldpixel):
    return int(round(oldpixel / 255)*255)

im = rgb2gray(imread('../my images/godess.jpg'))*255
pixel = np.copy(im)
w, h = im.shape

for x in range(w):
    for y in range(h):
        oldpixel = pixel[x][y]
        newpixel = find_closest_palette_color(oldpixel)
        pixel[x][y] = newpixel
        quant_error = oldpixel - newpixel
        if x + 1 < w-1:
          pixel[x + 1][y] = pixel[x + 1][y] + quant_error * 7 / 16
        if x > 0 and y < h-1:
           pixel[x - 1][y + 1] = pixel[x - 1][y + 1] + quant_error * 3 / 16
        if y < h-1:
           pixel[x ][y + 1] = pixel[x ][y + 1] + quant_error * 5 / 16
        if x < w-1 and y < h-1:
           pixel[x + 1][y + 1] = pixel[x + 1][y + 1] + quant_error * 1 / 16

plt.imshow(pixel, cmap='gray')

The input image (gray-scale)


The output Image (binary)


The next animation shows how an another input grayscale image gets converted to output binary image using the error diffusion dithering.


Sharpen a color image

  1. First blur the image with an LPF (e.g., Gaussian Filter).
  2. Compute the detail image as the difference between the original and the blurred image.
  3. Now the sharpened image can be computed as a linear combination of the original image and the detail  image. The next figure illustrates the concept.


The next python code shows how this can be implemented in python:

from scipy import misc, ndimage
import matplotlib.pyplot as plt
import numpy as np

def rgb2gray(im):
    return np.clip(0.2989 * im[...,0] + 0.5870 * im[...,1] + 0.1140 * im[...,2], 0, 1)

im = misc.imread('../my images/me.jpg')/255
im_blurred = ndimage.gaussian_filter(im, (5,5,0))
im_detail = np.clip(im - im_blurred, 0, 1)
fig, axes = plt.subplots(nrows=2, ncols=3, sharex=True, sharey=True, figsize=(15, 15))
axes = axes.ravel()
axes[0].set_title('Original image', size=15)
axes[1].set_title('Blurred image, sigma=5', size=15)
axes[2].set_title('Detail image', size=15)
alpha = [1, 5, 10]
for i in range(3):
    im_sharp = np.clip(im + alpha[i]*im_detail, 0, 1)
    axes[3+i].set_title('Sharpened image, alpha=' + str(alpha[i]), size=15)
for ax in axes:

The next figure shows the output of the above code block. As cane be seen, the output gets more sharpened as the value of alpha gets increased.


The next animation shows how the image gets more and more sharpened with increasing alpha.


Edge Detection with LOG and Zero-Crossing Algorithm by Marr and Hildreth

The following figure shows LOG filter and its DOG approximation.



In order to detect edges as a binary image, finding the zero-crossings in the LoG-convolved image was proposed by Marr and Hildreth. Identification of the edge pixels can be done by viewing the sign of the LoG-smoothed image by defining it as a binary image, the algorithm is as follows:

Algorithm to compute the zero-crossing 

  1. First convert the LOG-convolved image to a binary image, by replacing the pixel values by 1 for positive values and 0 for negative values.
  2. In order to compute the zero crossing pixels, we need to simply look at the boundaries of the non-zero regions in this binary image.
  3. Boundaries can be found by finding any non-zero pixel that has an immediate neighbor which is is zero.
  4. Hence, for each pixel, if it is non-zero, consider its 8 neighbors, if any of the neighboring pixels is zero, the pixel can be identified as an edge.

The next python code and the output images / animations generated show how to detect the edges from the  zebra image with LOG + zero-crossings:

from scipy import ndimage, misc
import matplotlib.pyplot as plt
from scipy.misc import imread
from skimage.color import rgb2gray
def any_neighbor_zero(img, i, j):
    for k in range(-1,2):
      for l in range(-1,2):
         if img[i+k, j+k] == 0:
            return True
    return False
def zero_crossing(img):
  img[img > 0] = 1
  img[img  0 and any_neighbor_zero(img, i, j):
        out_img[i,j] = 255
  return out_img

img = rgb2gray(imread('../images/zebras.jpg'))

fig = plt.figure(figsize=(25,15))
plt.gray() # show the filtered result in grayscale

for sigma in range(2,10, 2):
result = ndimage.gaussian_laplace(img, sigma=sigma)
plt.title('LoG with zero-crossing, sigma=' + str(sigma), size=30)


Original Input Imagezebras.jpg

Output with edges detected with LOG + zero-crossing at different sigma scales


With another input image

Output with edges detected with LOG + zero-crossing at different sigma scales



Constructing the Gaussian Pyramid with scikit-image transform module’s reduce function and Laplacian Pyramid from the Gaussian Pyramid and the expand function

The Gaussian Pyramid can be computed with the following steps:

  1. Start with the original image.
  2. Iteratively compute the image at each level of the pyramid first by smoothing the image (with gaussian filter) and then downsampling it .
  3. Stop at a level where the image size becomes sufficiently small (e.g., 1×1).

The Laplacian Pyramid can be computed with the following steps:

  1. Start with the Gaussian Pyramid and with the smallest image.
  2. Iteratively compute the difference image in between the image at the current level and the image obtained by first upsampling and then smoothing the image (with gaussian filter) from the previous level of the Gaussian Pyramid.
  3. Stop at a level where the image size becomes equal to the original image size.

The next python code shows how to create a Gaussian Pyramid from an image.

import numpy as np
import matplotlib.pyplot as plt
from import imread
from skimage.color import rgb2gray
from skimage.transform import pyramid_reduce, pyramid_expand, resize

def get_gaussian_pyramid(image):
    rows, cols, dim = image.shape
    gaussian_pyramid = [image]
    while rows> 1 and cols > 1:
        image = pyramid_reduce(image, downscale=2)
        rows //= 2
        cols //= 2
    return gaussian_pyramid

def get_laplacian_pyramid(gaussian_pyramid):
    laplacian_pyramid = [gaussian_pyramid[len(gaussian_pyramid)-1]]
    for i in range(len(gaussian_pyramid)-2, -1, -1):
        image = gaussian_pyramid[i] - resize(pyramid_expand(gaussian_pyramid[i+1]), gaussian_pyramid[i].shape)
    laplacian_pyramid = laplacian_pyramid[::-1]
    return laplacian_pyramid

image = imread('../images/antelops.jpeg')
gaussian_pyramid = get_gaussian_pyramid(image)
laplacian_pyramid = get_laplacian_pyramid(gaussian_pyramid)

w, h = 20, 12
for i in range(3):
    p = gaussian_pyramid[i]
    plt.title(str(p.shape[0]) + 'x' + str(p.shape[1]), size=20)
    w, h = w / 2, h / 2

w, h = 10, 6
for i in range(1,4):
    p = laplacian_pyramid[i]
    plt.imshow(rgb2gray(p), cmap='gray')
    plt.title(str(p.shape[0]) + 'x' + str(p.shape[1]), size=20)
    w, h = w / 2, h / 2


Some images from the Gaussian Pyramid


Some images from the Laplacian Pyramid



Blending images with Gaussian and Laplacian pyramids

Here is the algorithm:


Blending the following input images A, B with mask image M

Input Image A (Goddess Durga)madurga1

Input Image B (Lord Shiva)

Mask Image M


with the following python code creates the output image I shown below

A = imread('../images/madurga1.jpg')/255
B = imread('../images/mahadeb.jpg')/255
M = imread('../images/mask1.jpg')/255
# get the Gaussian and Laplacian pyramids, implement the functions
pyramidA = get_laplacian_pyramid(get_gaussian_pyramid(A))
pyramidB = get_laplacian_pyramid(get_gaussian_pyramid(B))
pyramidM = get_gaussian_pyramid(M)

pyramidC = []
for i in range(len(pyramidM)):
   im = pyramidM[i]*pyramidA[i] + (1-pyramidM[i])*pyramidB[i]
# implement the following function to construct an image from its laplacian pyramids
I = reconstruct_image_from_laplacian_pyramid(pyramidC)

Output Image I (Ardhanarishwar)
ardhanariswara_output.pngThe following animation shows how the output image is formed:




Another blending (horror!) example (from prof. dmartin)



Detection of a Human Object with HOG Descriptor Features using SVM (Primal QuadProg implementation using CVXOPT) in Python

In this article, first how to extract the HOG descriptor from an image will be discuss. Then how a support vector machine binary classifier can be trained on a dataset containing labeled images (using the extracted HOG descriptor features) and later how the SVM model can be used (along with a sliding window) to predict whether or not a human object exists in a test image will be described.  How SVM can be represented as a Primal Quadratic Programming problem and can be solved with CVXOPT that will also be discussed. This problem appeared as an assignment problem in this Computer Vision course from UCF.

Problem 1: Compute HOG features

Let’s first Implement Histogram of Orientated Gradients (HOG). The dataset to be used is the INRIA Person Dataset from here. The dataset consists of positive and negative examples for training as well as testing images. Let us do the following:

i. Take 2003 positive training images of size 96×160
ii. Take 997 negative training images of size 96×160
iii. Compute HOG for positive and negative examples.
iv. Show the visualization of HOG for some positive and negative examples.

The Histograms of Oriented Gradients for Human Detection (HOG) is a very heavily cited paper by N. Dalal and B. Triggs from CVPR 2005. The following figure shows the  algorithm proposed by them can be used to compute the HOG features for a 96×160 image:


The next python code snippet shows some helper functions to compute the hog features:

import numpy as np
from scipy import signal
import scipy.misc

def s_x(img):
    kernel = np.array([[-1, 0, 1]])
    imgx = signal.convolve2d(img, kernel, boundary='symm', mode='same')
    return imgx
def s_y(img):
    kernel = np.array([[-1, 0, 1]]).T
    imgy = signal.convolve2d(img, kernel, boundary='symm', mode='same')
    return imgy

def grad(img):
    imgx = s_x(img)
    imgy = s_y(img)
    s = np.sqrt(imgx**2 + imgy**2)
    theta = np.arctan2(imgx, imgy) #imgy, imgx)
    theta[theta<0] = np.pi + theta[theta<0]
    return (s, theta)


The following figures animations show some positive and negative training examples along with the HOG features computed using the algorithm.


Positive Example 1

The next animation shows how the HOG features are computed using the above algorithm.


Positive Example 2


The next animation shows how the HOG features are computed using the above algorithm.



Positive Example 3


The next animation shows how the HOG features are computed using the above algorithm.

Negative Example 1


The next animation shows how the HOG features are computed using the above algorithm.


Problem 2: Use sklearn’s SVC and 80-20 validation to compute accuracy on the held-out training images dataset using the extracted HOG features.

Before implementing SVC on our own with primal quadratic programming solver, let’s use the scikit-learn SVC implementation (with linear kernel) to train a support vector classifier on the training positive and negative examples using the HOG features extracted  from the training images with 80-20 validation and compute accuracy of classification on the held-out images.

The following python code does exactly that, with the X matrix containing the 1620 HOG features extracted from each image and the corresponding label (pos/neg, depending on whether human is present or not), with 98.5% accuracy on the held-out dataset.

import time
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.8, random_state=123)
timestamp1 = time.time()
clf = SVC(C=1, kernel='linear'), ytrain)
print("%d support vectors out of %d points" % (len(clf.support_vectors_), len(Xtrain)))
timestamp2 = time.time()
print "sklearn LinearSVC took %.2f seconds" % (timestamp2 - timestamp1)
ypred = clf.predict(Xtest)
print('accuracy', accuracy_score(ytest, ypred))

430 support vectors out of 2400 points
sklearn LinearSVC took 3.40 seconds
accuracy 0.985

The next figures show the confusion matrices for the prediction on the held-out dataset with the SVC model learnt.




Problem 3: Implement SVM by solving the Primal form of the problem using Quadratic Programming

Let’s implement Support Vector Machine (SVM) using Quadratic Programming. We shall use python’s CVXOPT package for this purpose. Let’s do the following:

i. Try to understand each input term in cvxopt.solvers.qp.
ii. Formulate soft- margin primal SVM in term of inputs of cvxopt.solvers.qp
iii. Show ‘P’, ‘Q’, ‘G”, ‘h’, ‘A’ and ‘b’ Matrices.
iv. Obtain parameter vector ‘w’ and bias term ‘b’ using cvxopt.solvers.qp


To be done


Problem 4: Detect Human in testing images using trained model (‘w’, ‘b’) from the last problem

Let’s use the coefficients learnt by the SVM model from the training dataset and do the following:

i. Take at least 5 testing images from Test/pos.
ii. Test the trained model over testing images. Testing can be performed using
w*feature vector + b.
iii. Use sliding window approach to obtain detection at each location in the image.
iv. Perform non-maximal suppression and choose the highest scored location.
v. Display the bounding box at the final detection.


To be done




Few Machine Learning Problems (with Python implementations)

In this article a few machine learning problems from a few online courses will be described.


1. Fitting the distribution of heights data

This problem appeared as an assignment problem in the coursera course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself. This video explains this problem and the solution in details.

In this assessment the steepest descent will be used to fit a Gaussian model to the distribution of heights data that was first introduced in Mathematics for Machine Learning: Linear Algebra.

The algorithm is the same as Gradient descent but this time instead of descending a
pre-defined function, we shall descend the χ2 (chi squared) function which is both a function of the parameters that we are to optimize, but also the data that the model is
to fit to.


We are given a dataset with 100 data-points for the heights of people in a population, with x as heights (in cm.) and y as the probability that there is a person with that height, first few datapoints are shown in the following table:

x y
0 51.25 0.0
1 53.75 0.0
2 56.25 0.0
3 58.75 0.0
4 61.25 0.0
5 63.75 0.0
6 66.25 0.0
7 68.75 0.0
8 71.25 0.0
9 73.75 0.0

The dataset can be plotted as a histogram, i.e., a bar chart where each bar has a width representing a range of heights, and an area which is the probability of finding a person with a height in that range, using the following code.

import matplotlib.pylab as plt
plt.figure(figsize=(15,5)), y, width=3, color=greenTrans, edgecolor=green)


We can model that data with a function, such as a Gaussian, which we can specify with two parameters, rather than holding all the data in the histogram.  The Gaussian function is given as,


By definition χ2 as the squared difference of the data and model, i.e.,  χ|− (xμσ)|^2.

x an y are represented as vectors here, as these are lists of all of the data points, the
|abs-squared|^2 encodes squaring and summing of the residuals on each bar.

To improve the fit, we shall want to alter the parameters μ and σ, and ask how that changes the χ2. That is, we will need to calculate the Jacobian,

Let’s look at the first term, (χ2)/μ, using the multi-variate chain rule, this can be written as,


A similar expression for (χ2)/σ can be obtained as follows:



The Jacobians rely on the derivatives f/μ and f/σ. It’s pretty straightforward to implement the python functions dfdmu() and dfdsig() to compute the derivatives.

Next recall that steepest descent shall move around in parameter space proportional to the negative of the Jacobian, i.e.,
with the constant of proportionality being the aggression of the algorithm.

The following function computes the expression for the Jacobian.

def steepest_step (x, y, mu, sig, aggression) :
 J = np.array([
 -2*(y - f(x,mu,sig)) @ dfdmu(x,mu,sig),
 -2*(y - f(x,mu,sig)) @ dfdsig(x,mu,sig)
 step = -J * aggression
 return step

We need to run a few rounds of steepest descent to fit the model. The next piece of code builds the model with steepest descent to fit the heights data:

# Do a few rounds of steepest descent.
for i in range(50) :
 dmu, dsig = steepest_step(x, y, mu, sig, 2000)
 mu += dmu
 sig += dsig
 p = np.append(p, [[mu,sig]], axis=0)

The following animations show the steepest descent path for the parameters and the model fitted to the data, respectively.

The data is shown in orange, the model in magenta, and where they overlap it’s shown in green.

χ2 is represented in the figure as the sum of the squares of the pink and orange bars.

This particular model has not been fit well with the initial guess – since there is not a strong overlap.

But gradually the model fits the data better as more and more iterations of steepest descent are run.


Note that the path taken through parameter space is not necessarily the most direct path, as with steepest descent we always move perpendicular to the contours.hst


2. Back-propagation

This problem also appeared as an assignment problem in the coursera online course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself.

In this assignment, we shall train a neural network to draw a curve. The curve takes one input variable, the amount traveled along the curve from 0 to 1, and returns 2 outputs, the 2D coordinates of the position of points on the curve.

The below table shows the first few rows of the dataset. Here x is the input variable and y1, y2 are the output variables.


x y1 y2
0 0.00 0.500000 0.625000
1 0.01 0.500099 0.627015
2 0.02 0.500788 0.632968
3 0.03 0.502632 0.642581
4 0.04 0.506152 0.655407
5 0.05 0.511803 0.670852
6 0.06 0.519955 0.688198
7 0.07 0.530875 0.706641
8 0.08 0.544723 0.725326
9 0.09 0.561537 0.743386

The next figures show how the data looks:




To help capture the complexity of the curve, we shall use two hidden layers in our network with 6 and 7 neurons respectively.


We shall implement functions to calculate the Jacobian of the cost function, with respect to the weights and biases of the network. The code will form part of a stochastic steepest descent algorithm that will train our network.

Feed forward

The following figure shows the feed-forward equations,


In the following python code (taken from the same assignment) defines functions to set up our neural network. Namely an activation function, σ(z), it’s derivative, σ(z), a function to initialize weights and biases, and a function that calculates each activation of the network using feed-forward.

In this assignment we shall use the logistic function as our activation function, rather than the more familiar tanh or relu.


sigma = lambda z : 1 / (1 + np.exp(-z))
d_sigma = lambda z : np.cosh(z/2)**(-2) / 4

# This function initialises the network with it's structure, it also resets any training already done.
def reset_network (n1 = 6, n2 = 7, random=np.random) :
 global W1, W2, W3, b1, b2, b3
 W1 = random.randn(n1, 1) / 2
 W2 = random.randn(n2, n1) / 2
 W3 = random.randn(2, n2) / 2
 b1 = random.randn(n1, 1) / 2
 b2 = random.randn(n2, 1) / 2
 b3 = random.randn(2, 1) / 2

# This function feeds forward each activation to the next layer. It returns all weighted sums and activations.
def network_function(a0) :
 z1 = W1 @ a0 + b1
 a1 = sigma(z1)
 z2 = W2 @ a1 + b2
 a2 = sigma(z2)
 z3 = W3 @ a2 + b3
 a3 = sigma(z3)
 return a0, z1, a1, z2, a2, z3, a3

# This is the cost function of a neural network with respect to a training set.
def cost(x, y) :
 return np.linalg.norm(network_function(x)[-1] - y)**2 / x.size




Next we need to implement the functions for the Jacobian of the cost function with respect to the weights and biases. We will start with layer 3, which is the easiest, and work backwards through the layers.

f7 (1)

The cost function C is the sum (or average) of the squared losses over all training examples:



The following python code shows how the J_W3 function can be implemented.

# Jacobian for the third layer weights.
def J_W3 (x, y) :
 # First get all the activations and weighted sums at each layer of the network.
 a0, z1, a1, z2, a2, z3, a3 = network_function(x)
 # We'll use the variable J to store parts of our result as we go along, updating it in each line.
 # Firstly, we calculate dC/da3, using the expressions above.
 J = 2 * (a3 - y)
 # Next multiply the result we've calculated by the derivative of sigma, evaluated at z3.
 J = J * d_sigma(z3)
 # Then we take the dot product (along the axis that holds the training examples) with the final partial derivative,
 # i.e. dz3/dW3 = a2
 # and divide by the number of training examples, for the average over all training examples.
 J = J @ a2.T / x.size
 # Finally return the result out of the function.
 return J

The following python code snippet implements the Gradient Descent algorithm (where the parameter aggression represents the learning rate and noise acts as a regularization parameter here):

while iterations < max_iteration:
 j_W1 = J_W1(x, y) * (1 + np.random.randn() * noise)
 j_W2 = J_W2(x, y) * (1 + np.random.randn() * noise)
 j_W3 = J_W3(x, y) * (1 + np.random.randn() * noise)
 j_b1 = J_b1(x, y) * (1 + np.random.randn() * noise)
 j_b2 = J_b2(x, y) * (1 + np.random.randn() * noise)
 j_b3 = J_b3(x, y) * (1 + np.random.randn() * noise)

 W1 = W1 - j_W1 * aggression
 W2 = W2 - j_W2 * aggression
 W3 = W3 - j_W3 * aggression
 b1 = b1 - j_b1 * aggression
 b2 = b2 - j_b2 * aggression
 b3 = b3 - j_b3 * aggression

The next figures and animations show how the prediction curve (pink / orange) with the neural net approaches the original curve (green) as it’s trained longer.

With learning rate = 7


The next figures and animations visualize all the curves learnt at different iterations.

With learning rate = 1



We can change parameters of the steepest descent algorithm, e.g., how aggressive the learning step is, and how much noise to add. The next figure shows the actual and the predicted curves for a few different values of the paramaters.


We can compute the model error (sum of square deviation in between the actual and predicted outputs) for different values of the parameters. The next heatmap shows how the model error varies with the aggression and noise parameter values.



3. The Kernel Perceptron

This problem appeared in an assignment in the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta).

We need to implement the Kernel Perceptron algorithm to classify some datasets that are not linearly separable. The algorithm should allow kernels like the quadratic and RBF kernel.

The next figure describes the theory and the algorithm (in dual form) for Kernel Perceptron. As can be seen, the kernel trick can be used both at the training and the prediction time to avoid basis expansion (by replacing the dot products of the expanded feature vectors with a Mercer Kernel).


The datasets on which we are going to classify with the dual perceptron algorithm are 2-dimensional datasets, each datapoint with a label +1 or -1, the first few datapoints of a dataset is shown below:

X1 X2 y
0 1.0 1.0 1.0
1 2.0 1.0 1.0
2 3.0 1.0 1.0
3 4.0 1.0 1.0
4 5.0 1.0 1.0


The ground-truth of the data-points are represented by their color (red and black) and marker type (circle and triangle respectively).

The data-point that is mis-classified in a particular iteration is shown in blue.

When a mis-classified point is selected, the corresponding alpha value is up-voted, this is indicated by increase in the size of the data-point.

The decision boundary for the two classes are shown with green and magenta colors, respectively.

The following figures and animations show the classification of the datasets using kernel perceptron with RBF and quadratic kernels. The next python code snippet implements the kernel functions.

import numpy as np
def kernel(x, z, type, s):
 if type == 'rbf':
 return np.exp(, x-z)/s**2)
 if type == 'quadratic':
 return (1 +, z))**2
 return, z) 



The next figures / animations show the result of classification with a python implementation of the (Dual) Kernel Perceptron Algorithm.

Dataset 1

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.


Dataset 2

Results with RBF Kernel

Results with quadratic Kernel

Dataset 3

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

  • C is the setting of the soft-margin parameter C (default: 1.0)
  • s (for the RBF kernel) is the scaling parameter s (default: 1.0)



Dataset 4

Results with RBF Kernel


Results with quadratic Kernel


Dataset 5

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel. kp5out5_002_059

Results with Kernel SVM Classifier (sklearn)

The following code and the figures show the decision boundaries and the support vectors (datapoints with larger size) learnt with sklearn SVC.

 from sklearn.svm import SVC
 x = data[:,0:2]
 y = data[:,2]
 clf = SVC(kernel=kernel, C=C, kernel = 'rbf', gamma=1.0/(s*s)),y)

Dataset 1

With polynomial kernel (degree=2, C=1)


With RBF kernel (C=10, σ = 10)



Dataset 2

With polynomial kernel (degree=2, C=1)


With RBF kernel (C=10, σ = 10)


Dataset 3

With RBF kernel (C=10, σ = 10)



4. Models for handwritten digit classification

This problem is taken from a few assignments from the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta). The problem description is taken from the course itself.

In this assignment we will build a few classifiers that take an image of a handwritten digit and outputs a label 0-9. We will start with a particularly simple strategy for this problem known as the nearest neighbor classifier, then a Gaussian generative model for classification will be built and finally an SVM model will be used for classification.

The MNIST dataset

MNIST is a classic dataset in machine learning, consisting of 28×28 gray-scale images handwritten digits. The original training set contains 60,000 examples and the test set contains 10,000 examples. Here we will be working with a subset of this data: a training set of 7,500 examples and a test set of 1,000 examples.

The following figure shows the first 25 digits from the training dataset along with the labels.


Similarly, the following figure shows the first 25 digits from the test dataset along with the ground truth labels.


Nearest neighbor for handwritten digit recognition


Squared Euclidean distance

To compute nearest neighbors in our data set, we need to first be able to compute distances between data points. A natural distance function is Euclidean distance: for two vectors x∈ ℝ^d, their Euclidean distance is defined as


Often we omit the square root, and simply compute squared Euclidean distance.

For the purposes of nearest neighbor computations, the two are equivalent: for three vectors xy∈ ℝ^d, we have xyxz if and only if xy∥^2xz∥^2.

The following python function squared Euclidean distance.

## Computes squared Euclidean distance between two vectors.
def squared_dist(x,y):
 return np.sum(np.square(x-y)

Computing nearest neighbors

Now that we have a distance function defined, we can now turn to (1-) nearest neighbor classification, with the following naive implementation with 0 training / pre-processing time.

## Takes a vector x and returns the index of its nearest neighbor in train_data
def find_NN(x):
 # Compute distances from x to every row in train_data
 distances = [squared_dist(x,train_data[i,]) for i in range(len(train_labels))]
 # Get the index of the smallest distance
 return np.argmin(distances)

## Takes a vector x and returns the class of its nearest neighbor in train_data
def NN_classifier(x):
 # Get the index of the the nearest neighbor
 index = find_NN(x)
 # Return its class
 return train_labels[index]

The following figure shows a test example correctly classified by finding the nearest training example and another incorrectly classified.


Processing the full test set

Now let’s apply our nearest neighbor classifier over the full data set.

Note that to classify each test point, our code takes a full pass over each of the 7500 training examples. Thus we should not expect testing to be very fast.

## Predict on each test data point (and time it!)
t_before = time.time()
test_predictions = [NN_classifier(test_data[i,]) for i in range(len(test_labels))]
t_after = time.time()

## Compute the error
err_positions = np.not_equal(test_predictions, test_labels)
error = float(np.sum(err_positions))/len(test_labels)

print("Error of nearest neighbor classifier: ", error)
print("Classification time (seconds): ", t_after - t_before)


(‘Error of nearest neighbor classifier: ‘, 0.046)
(‘Classification time (seconds): ‘, 41.04900002479553)

The next figure shows the confusion matrix for classification


Faster nearest neighbor methods

Performing nearest neighbor classification in the way we have presented requires a full pass through the training set in order to classify a single point. If there are N training points in ℝ^d, this takes O(Nd) time.

Fortunately, there are faster methods to perform nearest neighbor look up if we are willing to spend some time pre-processing the training set. scikit-learnhas fast implementations of two useful nearest neighbor data structures: the ball tree and the k-d tree.

from sklearn.neighbors import BallTree

## Build nearest neighbor structure on training data
t_before = time.time()
ball_tree = BallTree(train_data)
t_after = time.time()

## Compute training time
t_training = t_after - t_before
print("Time to build data structure (seconds): ", t_training)

## Get nearest neighbor predictions on testing data
t_before = time.time()
test_neighbors = np.squeeze(ball_tree.query(test_data, k=1, return_distance=False))
ball_tree_predictions = train_labels[test_neighbors]
t_after = time.time()

## Compute testing time
t_testing = t_after - t_before
print("Time to classify test set (seconds): ", t_testing)

(‘Time to build data structure (seconds): ‘, 0.3269999027252197)
(‘Time to classify test set (seconds): ‘, 6.457000017166138)

similarly, with the KdTree data structure we have the following runtime:

(‘Time to build data structure (seconds): ‘, 0.2889997959136963)
(‘Time to classify test set (seconds): ‘, 7.982000112533569)

Next let’s use sklearn’s KNeighborsClassifier to compare with the runtimes.

from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1), train_labels)
predictions = neigh.predict(test_data) 

(‘Training Time (seconds): ‘, 0.2999999523162842)
(‘Time to classify test set (seconds): ‘, 8.273000001907349)

The next figure shows the error rate on the test dataset with k-NearestNeighbor classifier with different values of k.


Training the 1-NN classifier on the entire training dataset with 60k images and testing on the entire testset with 10k images yields the the following results:

(‘Training Time (seconds): ‘, 19.694000005722046)
(‘Time to classify test set (seconds): ‘, 707.7590000629425)

with the following accuracy on the test dataset and the confusion matrix:

accuracy: 0.9691 (error 3.09%)


Gaussian generative models for handwritten digit classification

Recall that the 1-NN classifier yielded a 3.09% test error rate on the MNIST data set of handwritten digits. We will now see that a Gaussian generative model does almost as well, while being significantly faster and more compact.

For this assignment we shall be using the entire MNIST dataset, the training dataset contains 60k images and the test dataset contains 10k images.

Fit a Gaussian generative model to the training data

The following figure taken from the lecture videos from the same course describes the basic theory.




Let’s Define a function, fit_generative_model, that takes as input a training set (data x and labels y) and fits a Gaussian generative model to it. It should return the parameters of this generative model; for each label j = 0,1,...,9, we have:

  • pi[j]: the frequency of that label
  • mu[j]: the 784-dimensional mean vector
  • sigma[j]: the 784×784 covariance matrix

This means that pi is 10×1, mu is 10×784, and sigma is 10x784x784.

We need to fit a Gaussian generative model. The parameters pi, mu and sigma are computed with corresponding maximum likelihood estimates (MLE) values: empirical count, mean and covariance matrix for each of the class labels from the data. However, now there is an added ingredient.

The empirical covariances are very likely to be singular (or close to singular), which means that we won’t be able to do calculations with them. Thus it is important to regularize these matrices. The standard way of doing this is to add cI to them, where c is some constant and I is the 784-dimensional identity matrix. (To put it another way, we compute the empirical covariances and then increase their diagonal entries by some constant c).

This modification is guaranteed to yield covariance matrices that are non-singular, for any c > 0, no matter how small. But this doesn’t mean that we should make c as small as possible. Indeed, c is now a parameter, and by setting it appropriately, we can improve the performance of the model. We will study regularization in greater detail over the coming weeks.

The following python code snippet shows the function:

def fit_generative_model(x,y):
 k = 10 # labels 0,1,...,k-1
 d = (x.shape)[1] # number of features
 mu = np.zeros((k,d))
 sigma = np.zeros((k,d,d))
 pi = np.zeros(k)
 c = 3500 # regularizer
 for label in range(k):
   indices = (y == label)
   pi[label] = ... # empirical count
   mu[label] = ... # empirical mean
   sigma[label] = ... # empirical regularized covariance matrix
 return mu, sigma, pi

Now let”s visualize the means of the Gaussians for the digits.


Time taken to fit the generative model (in seconds) : 2.60100007057

Make predictions on test data

Now let’s see how many errors the generative model makes on the test set.
 The model makes 438 errors out of 10000 (test accuracy: 95.562%)
Time taken to classify the test data (in seconds): 19.5959999561

The following figure shows the confusion matrix.


SVM for handwritten digit classification

The entire training dataset from the MNIST dataset  is used to train the SVM model, the training dataset contains 60k images and the test dataset contains 10k images.

First let’s try linear SVM, the following python code:

from sklearn.svm import LinearSVC
clf = LinearSVC(C=C, loss='hinge'),train_labels)
score = clf.score(test_data,test_labels)

The following figure shows the training and test accuracies of LinearSVC with different values of the hyper-parameter C.


Next let’s try SVM with quadratic kernel, as can be seen it gives 98.06% accuracy on the test dataset with C=1.

from sklearn.svm import SVC
clf = SVC(C=1., kernel='poly', degree=2),train_labels)
print clf.score(train_data,train_labels)
print clf.score(test_data,test_labels)

training accuracy: 1.0
test accuracy: 0.9806 (error: 1.94%)

The following figure shows the confusion matrix:



Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

  • First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
  • Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
  • Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
  • The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:


or as the following if the explicit bias term is discarded:


where the 0-1 loss is approximated by the hinge-loss.


  • Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
  • The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
  • Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
  • The learning rate η is gradually decreased with iteration.


The following figure shows a simplified version of the algorithm:




The following python code implements the algorithm:

class SVMPegasos(LinearClassifier):
    """Implementation of SVM with SGD with PEGASOS Algorithm"""

    def __init__(self, n_iter=10, lambda1=1):
       self.n_iter = n_iter
       self.lambda1 = lambda1

    def fit(self, X, Y):
       Y = list(Y)
       # convert all output values to +1 or -1
       Yn = [sign(y, self.positive_class) for y in Y]
       X = X.toarray()
       m, n_features = X.shape[0], X.shape[1]
       self.w = numpy.zeros( n_features )
       for i in range(self.n_iter):
           eta = 1. / (self.lambda1*(i+1))
           j = numpy.random.choice(m, 1)[0]
           x, y = X[j], Yn[j]
           score =
           if y*score < 1:
              self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
              self.w = (1 - eta*self.lambda1)*self.w

Some Notes

  • The optional projection step has been left out (the line in square brackets in the paper).
  •  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
  •  The number η is the step length in gradient descent.
  • The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
  • Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
  • Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
  • Classification accuracy obtained on the test split was 95%.


2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.


Using the PEGASOS implementation to classify the following linearly non-separable datasets

Dataset 1 flame

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
  • Classification accuracy obtained on the test split was 95.8%.


Dataset 2

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
  • Classification accuracy obtained on the test split was 100%.


3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.





Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
  • Classification accuracy obtained on the test split was 97.5%.


4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.


The next figures show the distribution of the word lengths for the texts for different sentiments.



  • Create sklearn Pipeline to
    • Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
    • Use sklearn SelectKBest to choose top 5000 features
    • Train / test the model
  • Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
  • Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
  • Also Record the time taken to train the classifier.
  • Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.


The following python code shows how the text-classification pipeline is implemented with SVMPegasos.

def classify_with_svm():
    # read all the documents
    X, Y = read_corpus('all_sentiment_shuffled.txt')
    # split into training and test parts
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
    classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
    tokenizer = lambda x: x)),
    ('fs', SelectKBest(k=5000)),
    ('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
    t0 = time.time(), Ytrain)
    t1 = time.time()
    print('Training time:', t1-t0, 'seconds.')
    Yguess = classifier.predict(Xtest)
    print('accuracy:', accuracy_score(Ytest, Yguess))


Some Notes

  • The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
  • This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
  • The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
  • For Perceptron the entire training data was passed 10 times to the model while training.
  • For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.


As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.




Implementing a Soft-Margin Kernelized Support Vector Machine Binary Classifier with Quadratic Programming in R and Python

In this article, couple of implementations of the support vector machine binary classifier with quadratic programming libraries (in R and python respectively) and application on a few datasets are going to be discussed.  The following video lectures / tutorials / links have been very useful for the implementation:

The next figure taken from here describes the basics of Soft-Margin SVM (without kernels).

SVM in a nutshell

  • Given a (training) dataset consisting of positive and negative class instances.
  • Objective is to find a maximum-margin classifier, in terms of a hyper-plane (the vectors w and b) that separates the positive and negative instances (in the training dataset).
  • If the dataset is noisy (with some overlap in positive and negative samples), there will be some error in classifying them with the hyper-plane.
  • In the latter case the objective will be to minimize the errors in classification along with maximizing the margin and the problem becomes a soft-margin SVM (as opposed to the hard margin SVM without slack variables).
  • A slack variable per training point is introduced to include the classification errors (for the miss-classified points in the training dataset) in the objective, this can also be thought of adding regularization.
  • The optimization problem is quadratic in nature, since it has quadratic objective with linear constraints.
  • It is easier to solve the optimization problem in the dual rather than the primal space, since there are less number of variables.
  • Hence the optimization problem is often solved in the dual space by converting the minimization to a maximization problem (keeping in mind the weak/strong duality theorem and the complementary slackness conditions), by first constructing the Lagrangian and then using the KKT conditions for a saddle point.
  • If the dataset is not linearly separable, the kernel trick is used to conceptually map the datapoints to some higher-dimensions only by computing the (kernel) gram matrix /  dot-product of the datapoints (the matrix needs to positive semi-definite as per Mercer’s theorem).
  • Some popular kernel functions are the linear, polynomial, Gaussian (RBF, corresponding to the infinite dimensional space) kernels.
  • The dual optimization problem is solved (with standard quadratic programming packages) and the solution is found in terms of a few support vectors (defining the linear/non-liear decision boundary, SVs correspond to the non-zero values of the dual variable / the primal Lagrange multipler), that’s why the name SVM.
  • Once the dual optimization problem is solved ,  the values of the primal variables are computed to construct the hyper-plane / decision surface.
  • Finally the dual and primal variables (optimum values obtained from the solutions) are used in conjunction to predict the class of a new (unseen) datapoint.
  • The hyper-parameters (e.g., C) can be tuned to fit different models and choose the most accurate one from a held-out (validation) dataset.


The following figure describes the soft-margin SVM in a more formal way.


The following figures show how the SVM dual quadratic programming problem can be formulated using the R quadprog QP solver (following the QP formulation in the R package quadprog).


The following figures show how the SVM dual quadratic programming problem can be formulated using the Python CVXOPT QP solver (following the QP formulation in the python library CVXOPT).


The following R code snippet shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

linear.kernel <- function(x1, x2) {
 return (x1%*%x2)
} <- function(X, y, FUN=linear.kernel, C=NULL) {
 n.samples <- nrow(X)
 n.features <- ncol(X)
 # Gram matrix
 K <- matrix(rep(0, n.samples*n.samples), nrow=n.samples)
 for (i in 1:n.samples){
  for (j in 1:n.samples){
   K[i,j] <- FUN(X[i,], X[j,])
 Dmat <- outer(y,y) * K
 Dmat <- as.matrix(nearPD(Dmat)$mat) # convert Dmat to nearest pd matrix
 dvec <- rep(1, n.samples)
 if (!is.null(C)) { # soft-margin
  Amat <- rbind(y, diag(n.samples), -1*diag(n.samples))
  bvec <- c(0, rep(0, n.samples), rep(-C, n.samples))
 } else {           # hard-margin
  Amat <- rbind(y, diag(n.samples))
  bvec <- c(0, rep(0, n.samples))
 res <- solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1)
 a = res$solution # Lagrange multipliers
 # Support vectors have non-zero Lagrange multipliers
 # ...




The following python code snippet adapted from here and from  Mathieu Blondel’s Blog, shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

import numpy as np
import cvxopt
def fit(X, y, kernel, C):
    n_samples, n_features = X.shape
    # Compute the Gram matrix
    K = np.zeros((n_samples, n_samples))
    for i in range(n_samples):
       for j in range(n_samples):
           K[i,j] = kernel(X[i], X[j])
    # construct P, q, A, b, G, h matrices for CVXOPT
    P = cvxopt.matrix(np.outer(y,y) * K)
    q = cvxopt.matrix(np.ones(n_samples) * -1)
    A = cvxopt.matrix(y, (1,n_samples))
    b = cvxopt.matrix(0.0)
    if C is None:      # hard-margin SVM
       G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
       h = cvxopt.matrix(np.zeros(n_samples))
    else:              # soft-margin SVM
       G = cvxopt.matrix(np.vstack((np.diag(np.ones(n_samples) * -1), np.identity(n_samples))))
       h = cvxopt.matrix(np.hstack((np.zeros(n_samples), np.ones(n_samples) * C)))
    # solve QP problem
    solution = cvxopt.solvers.qp(P, q, G, h, A, b)
    # Lagrange multipliers
    a = np.ravel(solution['x'])
    # Support vectors have non zero lagrange multipliers
    sv = a > 1e-5 # some small threshold
    # ...



  1. Since the objective function for QP is convex if and only if the matrix P (in python CVXOPT) or Dmat (in R quadprog) is positive-semidefinite, it needs to be ensured that the corresponding matrix for SVM is psd too.
  2. The corresponding matrix is computed from the Kernel gram matrix (which is psd or non-negative-definite by Mercer’s theorem) and the labels from the data. Due to numerical errors, often a few eigenvalues of the matrix tend to be very small negative values.
  3. Although python CVXOPT will allow very small numerical errors in P matrix with a warning message, R quardprog will strictly require that the Dmat matrix is strictly positive definite, otherwise it will fail.
  4. Hence, with R quadprog the D matrix first needs to be converted to a positive definite matrix using some algorithm (particularly in case when it contains very small negative eigenvalues, which is quite common, since D comes from the data).
  5. A small threshold (e.g., 1e-5) is chosen to find the support vectors (corresponding to non-zero Lagrange multipliers, by complementary slackness condition).
  6. Cases corresponding to the hard and soft-margin SVMs must be handled separately, otherwise it will lead to inconsistent system of solutions.


Using the SVM implementations for classification on some datasets

The datasets



For each dataset, the 80-20 Validation on the dataset is used to

  • First fit (train) the model on randomly selected 80% samples of the dataset.
  • Predict (test) on the held-out (remaining 20%) of the dataset and compute accuracy.
  • Different values of the hyper-parameter C and different kernels are used.
  • For the polynomial kernel, polynomial of degree 3 is used and the RBF kernel with the standard deviation of 5 is used, although these hyper-parameters can be tuned too.



As can be seen from the results below,

  • The points with blue circles are the support vectors.
  • When the C value is low (close to hard-margin SVM), the model learnt tends to overfit the training data.
  • When the C value is high (close to soft-margin SVM), the model learnt tends to be more generalizable (C acts as a regularizer).
  • There are more support vectors required to define the decision surface for the hard-margin SVM than the soft-margin SVM for datasets not linearly separable.
  • The linear (and sometimes polynomial) kernel performs pretty badly on the datasets that are not linearly separable.
  • The decision boundaries are also shown.


With the Python (CVXOPT) implementation






With the R (quadprog) implementation












Some Reinforcement Learning: The Greedy and Explore-Exploit Algorithms for the Multi-Armed Bandit Framework in Python

In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.

The Problem Statement and Some Theory

Given a set of  actions with some unknown reward distributions,  maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately.  This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.

The following figure defines the problem mathematically  and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.

The following figure shows a motivating application of the multi-armed bandit problem in drug discovery.  Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.

The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.

  • The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
  • The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
  • The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas
    the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.


The following code from the github repository of the same course shows how the basic bandit Framework can be defined:

import numpy as np
import sys

### Interface
class Environment(object):

def reset(self):
raise NotImplementedError('Inheriting classes must override reset.')

def actions(self):
raise NotImplementedError('Inheriting classes must override actions.')

def step(self):
raise NotImplementedError('Inheriting classes must override step')

class ActionSpace(object):

def __init__(self, actions):
self.actions = actions
self.n = len(actions)

### BanditEnv Environment

class BanditEnv(Environment):

def __init__(self, num_actions = 10, distribution = "bernoulli", evaluation_seed="387"):
super(BanditEnv, self).__init__()

self.action_space = ActionSpace(range(num_actions))
self.distribution = distribution


self.reward_parameters = None
if distribution == "bernoulli":
self.reward_parameters = np.random.rand(num_actions)
elif distribution == "normal":
self.reward_parameters = (np.random.randn(num_actions),
elif distribution == "heavy-tail":
self.reward_parameters = np.random.rand(num_actions)
print("Please use a supported reward distribution") #, flush = True)

if distribution != "normal":
self.optimal_arm = np.argmax(self.reward_parameters)
self.optimal_arm = np.argmax(self.reward_parameters[0])

def reset(self):
self.is_reset = True
return None

def compute_gap(self, action):
if self.distribution != "normal":
gap = np.absolute(self.reward_parameters[self.optimal_arm] -
gap = np.absolute(self.reward_parameters[0][self.optimal_arm] -
return gap

def step(self, action):
self.is_reset = False

valid_action = True
if (action is None or action = self.action_space.n):
print("Algorithm chose an invalid action; reset reward to -inf")#, flush = True)
reward = float("-inf")
gap = float("inf")
valid_action = False

if self.distribution == "bernoulli":
if valid_action:
reward = np.random.binomial(1, self.reward_parameters[action])
gap = self.reward_parameters[self.optimal_arm] -
elif self.distribution == "normal":
if valid_action:
reward = self.reward_parameters[0][action] + self.reward_parameters[1][action] * np.random.randn()
gap = self.reward_parameters[0][self.optimal_arm] - self.reward_parameters[0][action]
elif self.distribution == "heavy-tail":
if valid_action:
reward = self.reward_parameters[action] + np.random.standard_cauchy()
gap = self.reward_parameters[self.optimal_arm] - self.reward_parameters[action] #HACK to compute expected gap
print("Please use a supported reward distribution")#, flush = True)

return(None, reward, self.is_reset, '')

#Policy interface
class Policy:
#num_actions: (int) Number of arms [indexed by 0 ... num_actions-1]
def __init__(self, num_actions):
self.num_actions = num_actions

def act(self):

def feedback(self, action, reward):

Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).

In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.

Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.

Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.



  • The next figure shows the reward distributions for a 5-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..4.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.



Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
  2. Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
  3. Both Greedy and Round-Robin has linear regrets (w.r.t. timesteps).
  4. In this case the greedy algorithm even with sub-optimal action still performs relatively better than the round-robin.






  • The next figure shows the reward distributions for a 10-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.



Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter ε) and the Optimistic-Greedy (with different values of the hyper-parameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Epsilon-Greedy algorithm behaves exactly like Greedy when ε = 0 and behaves randomly when ε = 1. For both of these cases, the ε-greedy algorithm has linear regret.
  2.  ε = 0.1 and ε = 0.15 find the optimal arm 6 eventually and they have sub-linear regrets. 
  3. Optimistic-Greedy algorithm behaves exactly like Greedy when R = 0 and behaves randomly when R = 10000. For both of these cases, the ε-greedy algorithm has linear regret.
  4. R = 3 and R = 5 find the optimal arm 6 eventually and they have sub-linear regrets(w.r.t. timesteps).


ε = 0



ε = 0.05



ε = 0.1


ε = 0.15


ε = 1.0



Optimistic Greedy


R = 0


R = 1




R = 3



R = 5



R = 10000



The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.


  • The next figure shows the reward distributions for a 10-armed bandit framework.
  • If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
  • Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
  • We first generate the parameters for the distribution corresponding to each arm randomly.
  • As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
  • Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.


Again, let’s assume that we don’t know p_i values, we  implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

  1. Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
  2. Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
  3. For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the non-informative flat prior.
  4. Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
  5. The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.




Thompson Beta


Thompson Normal


Learning Distributed Word  Representations with Neural Network: an implementation in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by  Prof.  Geoffrey Hinton from the University of Toronto in 2012.  This problem also appeared as an assignment in this course from the same university.  The problem description is taken from the assignment pdf.


Problem Statement

In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:


  • The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.
  • The 4-grams are chosen so that all the words involved come from a small
    vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.
  • Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

load data.mat
ans =
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left

  • The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.
  • Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .

  • The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.
  • Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.


load data.mat
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

% 3-gram features for a training data-tuple
%ans =
%ans = now
%ans = where
%ans = do

% target for the same data tuple from training dataset
%ans = 91
%ans = we

  • The validation and test data are also similar. They contain 46,568 4-grams each.
  • Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.
  • The data needs to get loaded and then separated into inputs and target. After that,  mini-batches of size 100 for the training set are created.
  • First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.
  • At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.
  • The training method will output a ‘model’ (weight matrices, biases for each layer in the network).


Description of the Network


  • As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.
  • The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.
  • This layer is connected to the hidden layer, which in turn is connected to the output layer.
  • The output layer is a softmax over the 250 words.
  • The training consists of two steps:  (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.


Forward Propagation

  • The forward propagation is pretty straight-forward and can be implemented as shown in the following code:
    function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
     fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
     hid_to_output_weights, hid_bias, output_bias)
    % This method forward propagates through a neural network.
    % Inputs:
    % input_batch: The input data as a matrix of size numwords X batchsize where,
    % numwords is the number of words, batchsize is the number of data points.
    % So, if input_batch(i, j) = k then the ith word in data point j is word
    % index k of the vocabulary.
    % word_embedding_weights: Word embedding as a matrix of size
    % vocab_size X numhid1, where vocab_size is the size of the vocabulary
    % numhid1 is the dimensionality of the embedding space.
    % embed_to_hid_weights: Weights between the word embedding layer and hidden
    % layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
    % number of hidden units.
    % hid_to_output_weights: Weights between the hidden layer and output softmax
    % unit as a matrix of size numhid2 X vocab_size
    % hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
    % output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
    % Outputs:
    % embedding_layer_state: State of units in the embedding layer as a matrix of
    % size numhid1*numwords X batchsize
    % hidden_layer_state: State of units in the hidden layer as a matrix of size
    % numhid2 X batchsize
    % output_layer_state: State of units in the output layer as a matrix of size
    % vocab_size X batchsize
    [numwords, batchsize] = size(input_batch);
    [vocab_size, numhid1] = size(word_embedding_weights);
    numhid2 = size(embed_to_hid_weights, 2);
    % Look up the inputs word indices in the word_embedding_weights matrix.
    embedding_layer_state = reshape(...
     word_embedding_weights(reshape(input_batch, 1, []),:)',...
     numhid1 * numwords, []);
    % Compute inputs to hidden units.
    inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
     repmat(hid_bias, 1, batchsize);
    % Apply logistic activation function.
    hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize);
    % Compute inputs to softmax.
    inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize);
    % Subtract maximum.
    % Remember that adding or subtracting the same constant from each input to a
    % softmax unit does not affect the outputs. Here we are subtracting maximum to
    % make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their
    % exponents.
    inputs_to_softmax = inputs_to_softmax...
     - repmat(max(inputs_to_softmax), vocab_size, 1);
    % Compute exp.
    output_layer_state = exp(inputs_to_softmax);
    % Normalize to get probability distribution.
    output_layer_state = output_layer_state ./ repmat(...
     sum(output_layer_state, 1), vocab_size, 1);




  •  The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.



  • As the model trains it prints out some numbers that tell how well the training is going.
  • The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.
  • After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.
  • The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.
  • At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.


Some Applications

1. Predict next word

  • Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.
  • The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.
  • It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.




2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random  text:

  1. Given 3 words to start from, initialize the text with those 3 words.
  2. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words.
  3. Choose one of the most probable words predicted randomly and insert it at the end of the text.
  4. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:

function gen_rand_text(words, model, k=3)

probs = [];
i = 4;
while (i < 20 || word != '.')
[word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k);                   words = {words{:}, word};
probs = [probs; prob];
i = i + 1;
fprintf(1, "%s ", words{:}) ;
fprintf(1, '\n');
fprintf(1, "%.2f ", round(probs.*100)./100) ;
fprintf(1, '\n');


Starting with the words  'i was going‘, here are some texts that were generated using the model:


Starting with the words  ‘life in new‘, here is a piece of text that was generated using the model:


3. Find nearest words

  •  The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.
  • As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.
  • For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.




4. Visualization in 2-dimension with t-SNE

  •  In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.
  • As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.
  • Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).



5. Solving Word-Analogy Problem

  •  with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation.  Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.
  • The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.


  • The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.
    analogy('year', 'years', 'day', model); % singular-plural relation
    %dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186
    analogy('on', 'off', 'new', model) % antonyms relation
    %dist_E('on','off')=2.013958, dist_E('new','old')=2.265665
    analogy('use', 'used', 'do', model) % present-past relation
    %dist_E('use','used')=2.556175, dist_E('do','did')=2.456098
    analogy('he', 'his', 'they', model) % pronoun-relations
    %dist_E('he','his')=3.824808, dist_E('they','their')=3.825453
    analogy('today', 'yesterday', 'now', model)
    %dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935


Model Selection

  • Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.
  • The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures,  the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).
  • The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.



Autonomous Driving – Car detection with YOLO Model with Keras in Python

In this article, object detection using the very powerful YOLO model will be described, particularly in the context of car detection for autonomous driving. This problem appeared as an assignment in the coursera course Convolution Networks which is a part of the Deep Learning Specialization (taught by Prof. Andrew Ng.,  from Stanford and, the lecture videos corresponding to the YOLO algorithm can be found here).  The problem description is taken straightaway from the assignment.

Given a set of images (a car detection dataset), the goal is to detect objects (cars) in those images using a pre-trained YOLO (You Only Look Once) model, with bounding boxes. Many of the ideas are from the two original YOLO papers: Redmon et al., 2016  and Redmon and Farhadi, 2016 .

Some Theory

Let’s first clear the concepts regarding classification, localization, detection and how the object detection problem can be transformed to supervised machine learning problem and subsequently can be solved using a deep convolution neural network. As can be seen from the next figure,

  • Image classification with localization aims to find the location of an object in an image by not only classifying the image (e.g., a binary classification problem: whether there is a car in an image or not), but also finding a bounding box around the object, if one found.
  • Detection goes a level further by aiming to identify multiple instances of same/ different types of objects, by marking their locations (the localization problem usually tries to find a single object location).
  • The localization problem can be converted to a supervised machine learning multi-class classification problem in the following way: in addition to the class label of the object to be identified, the output vector corresponding to an input training image must also contain the location (bounding box coordinates relative to image size) of the object.
  • A typical output data vector will contain 8 entries for a 4-class classification, as shown in the next figure, the first entry will correspond to whether or not an object of any from the 3 classes of objects. In case one is present in an image, the next 4 entries will define the bounding box containing the object, followed by 3 binary values for the 3 class labels indicating the class of the object. In case none of the objects are present, the first entry will be 0 and the others will be ignored.



  • Now moving from localization to detection, one can proceed in two steps as shown below in the next figure: first use small tightly cropped images to train a convolution neural net for image classification and then use sliding windows of different window sizes (smaller to larger) to classify a test image within that window using the convnet learnt and run the windows sequentially through the entire image, but it’s infeasibly slow computationally.
  • However, as shown in the next figure, the convolutional implementation of the sliding windows by replacing the fully-connected layers by 1×1 filters makes it possible to simultaneously classify the image-subset inside all possible sliding windows parallelly, making it much more efficient computationally.



  • The convolutional sliding windows, although computationally much more efficient, still has the problem of detecting the accurate bounding boxes, since the boxes don’t align with the sliding windows and the object shapes also tend to be different.
  • YOLO algorithm overcomes this limitation by dividing a training image into grids and assigning an object to a grid if and only if the center of the object falls inside the grid, that way each object in a training image can get assigned to exactly one grid and then the corresponding bounding box is represented by the coordinates relative to the grid. The next figure described the details of the algorithm.
  • In the test images, multiple adjacent grids may think that an object actually belongs to them, in order to resolve the iou (intersection of union) measure is used to find the maximum overlap and the non-maximum-suppression algorithm is used to discard all the other bounding boxes with low-confidence of containing an object, keeping the one with the highest confidence among the competing ones and discard the others.
  • Still there is a problem of multiple objects falling in the same grid. Multiple anchor boxes (of different shapes) are used to resolve the problem, each anchor box of a particular shape being likely to eventually detect  an object of a particular shape.



The following figure shows the slides taken from the presentation You Only Look Once: Unified, Real-Time Object Detection in the CVPR 2016 summarizing the algorithm:


Problem Statement

Let’s assume that we are working on a self-driving car. As a critical component of this project, we’d like to first build a car detection system. To collect data, we’ve mounted a camera to the hood (meaning the front) of the car, which takes pictures of the road ahead every few seconds while we drive around.

The above pictures are taken from a car-mounted camera while driving around Silicon Valley.  We would like to especially thank for providing this dataset! is a company building the brains of self-driving vehicles.


We’ve gathered all these images into a folder and have labelled them by drawing bounding boxes around every car we found. Here’s an example of what our bounding boxes look like.

Definition of a box


If we have 80 classes that we want YOLO to recognize, we can represent the class label c either as an integer from 1 to 80, or as an 80-dimensional vector (with 80 numbers) one component of which is 1 and the rest of which are 0. Here we will use both representations, depending on which is more convenient for a particular step.

In this exercise, we shall learn how YOLO works, then apply it to car detection. Because the YOLO model is very computationally expensive to train, we will load pre-trained weights for our use.  The instructions for how to do it can be obtained from here and here.



YOLO (“you only look once“) is a popular algorithm because it achieves high accuracy while also being able to run in real-time. This algorithm “only looks once” at the image in the sense that it requires only one forward propagation pass through the network to make predictions. After non-max suppression, it then outputs recognized objects together with the bounding boxes.

Model details

First things to know:

  • The input is a batch of images of shape (m, 608, 608, 3).
  • The output is a list of bounding boxes along with the recognized classes. Each bounding box is represented by 6 numbers (pc,bx,by,bh,bw,c) as explained above. If we expand c into an 80-dimensional vector, each bounding box is then represented by 85 numbers.

We will use 5 anchor boxes. So we can think of the YOLO architecture as the following: IMAGE (m, 608, 608, 3) -> DEEP CNN -> ENCODING (m, 19, 19, 5, 85).

Let’s look in greater detail at what this encoding represents.

Encoding architecture for YOLO


If the center/midpoint of an object falls into a grid cell, that grid cell is responsible for detecting that object.

Since we are using 5 anchor boxes, each of the 19 x19 cells thus encodes information about 5 boxes. Anchor boxes are defined only by their width and height.

For simplicity, we will flatten the last two last dimensions of the shape (19, 19, 5, 85) encoding. So the output of the Deep CNN is (19, 19, 425).

Flattening the last two last dimensions



Now, for each box (of each cell) we will compute the following element-wise product and extract a probability that the box contains a certain class.

Find the class detected by each box


Here’s one way to visualize what YOLO is predicting on an image:

  • For each of the 19×19 grid cells, find the maximum of the probability scores (taking a max across both the 5 anchor boxes and across different classes).
  • Color that grid cell according to what object that grid cell considers the most likely.

Doing this results in this picture:


Each of the 19×19 grid cells colored according to which class has the largest predicted probability in that cell.

Note that this visualization isn’t a core part of the YOLO algorithm itself for making predictions; it’s just a nice way of visualizing an intermediate result of the algorithm.

Another way to visualize YOLO’s output is to plot the bounding boxes that it outputs. Doing that results in a visualization like this:


Each cell gives us 5 boxes. In total, the model predicts: 19x19x5 = 1805 boxes just by looking once at the image (one forward pass through the network)! Different colors denote different classes.

In the figure above, we plotted only boxes that the model had assigned a high probability to, but this is still too many boxes. You’d like to filter the algorithm’s output down to a much smaller number of detected objects. To do so, we’ll use non-max suppression. Specifically, we’ll carry out these steps:

  • Get rid of boxes with a low score (meaning, the box is not very confident about detecting a class).
  • Select only one box when several boxes overlap with each other and detect the same object.


Filtering with a threshold on class scores

We are going to apply a first filter by thresholding. We would like to get rid of any box for which the class “score” is less than a chosen threshold.

The model gives us a total of 19x19x5x85 numbers, with each box described by 85 numbers. It’ll be convenient to rearrange the (19,19,5,85) (or (19,19,425)) dimensional tensor into the following variables:

  • box_confidence: tensor of shape (19×19,5,1) containing pc (confidence probability that there’s some object) for each of the 5 boxes predicted in each of the 19×19 cells.
  • boxes: tensor of shape (19×19,5,4) containing (bx,by,bh,bw) for each of the 5 boxes per cell.
  • box_class_probs: tensor of shape (19×19,5,80) containing the detection probabilities (c1,c2,…c80) for each of the 80 classes for each of the 5 boxes per cell.

Exercise: Implement yolo_filter_boxes().

  • Compute box scores by doing the element-wise product as described in the above figure.
  • For each box, find:
    • the index of the class with the maximum box score.
    • the corresponding box score.
  • Create a mask by using a threshold.  The mask should be True for the boxes you want to keep.
  • Use TensorFlow to apply the mask to box_class_scores, boxes and box_classes to filter out the boxes we don’t want.
    We should be left with just the subset of boxes we want to keep.

Let’s first load the packages and dependencies that are going to be useful.

import argparse
import os
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import scipy.misc
import numpy as np
import pandas as pd
import PIL
import tensorflow as tf
from keras import backend as K
from keras.layers import Input, Lambda, Conv2D
from keras.models import load_model, Model
from yolo_utils import read_classes, read_anchors, generate_colors, preprocess_image, draw_boxes, scale_boxes
from yad2k.models.keras_yolo import yolo_head, yolo_boxes_to_corners, preprocess_true_boxes, yolo_loss, yolo_body


def yolo_filter_boxes(box_confidence, boxes, box_class_probs, threshold = .6):
 """Filters YOLO boxes by thresholding on object and class confidence.

 box_confidence -- tensor of shape (19, 19, 5, 1)
 boxes -- tensor of shape (19, 19, 5, 4)
 box_class_probs -- tensor of shape (19, 19, 5, 80)
 threshold -- real value, if [ highest class probability score = threshold)

 # Step 4: Apply the mask to scores, boxes and classes

return scores, boxes, classes


Non-max suppression

Even after filtering by thresholding over the classes scores, we still end up a lot of overlapping boxes. A second filter for selecting the right boxes is called non-maximum suppression (NMS).


n this example, the model has predicted 3 cars, but it’s actually 3 predictions of the same car. Running non-max suppression (NMS) will select only the most accurate (highest probability) one of the 3 boxes.

Non-max suppression uses the very important function called “Intersection over Union”, or IoU.

Definition of “Intersection over Union”



Exercise: Implement iou(). Some hints:

  • In this exercise only, we define a box using its two corners (upper left and lower right): (x1, y1, x2, y2) rather than the midpoint and height/width.
  • To calculate the area of a rectangle we need to multiply its height (y2 – y1) by its width (x2 – x1)
  • We’ll also need to find the coordinates (xi1, yi1, xi2, yi2) of the intersection of two boxes. Remember that:
    xi1 = maximum of the x1 coordinates of the two boxes
    yi1 = maximum of the y1 coordinates of the two boxes
    xi2 = minimum of the x2 coordinates of the two boxes
    yi2 = minimum of the y2 coordinates of the two boxes

In this code, we use the convention that (0,0) is the top-left corner of an image, (1,0) is the upper-right corner, and (1,1) the lower-right corner.

def iou(box1, box2):
 """Implement the intersection over union (IoU) between box1 and box2

 box1 -- first box, list object with coordinates (x1, y1, x2, y2)
 box2 -- second box, list object with coordinates (x1, y1, x2, y2)

# Calculate the (y1, x1, y2, x2) coordinates of the intersection of box1 and box2. Calculate its Area.

# Calculate the Union area by using Formula: Union(A,B) = A + B - Inter(A,B)

# compute the IoU

return iou


We are now ready to implement non-max suppression. The key steps are:

  • Select the box that has the highest score.
  • Compute its overlap with all other boxes, and remove boxes that overlap it more than iou_threshold.
  • Go back to step 1 and iterate until there’s no more boxes with a lower score than the current selected box.

This will remove all boxes that have a large overlap with the selected boxes. Only the “best” boxes remain.

Exercise: Implement yolo_non_max_suppression() using TensorFlow. TensorFlow has two built-in functions that are used to implement non-max suppression (so we don’t actually need to use your iou() implementation):

def yolo_non_max_suppression(scores, boxes, classes, max_boxes = 10, iou_threshold = 0.5):
 Applies Non-max suppression (NMS) to set of boxes

 scores -- tensor of shape (None,), output of yolo_filter_boxes()
 boxes -- tensor of shape (None, 4), output of yolo_filter_boxes() that have been scaled to the image size (see later)
 classes -- tensor of shape (None,), output of yolo_filter_boxes()
 max_boxes -- integer, maximum number of predicted boxes you'd like
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 scores -- tensor of shape (, None), predicted score for each box
 boxes -- tensor of shape (4, None), predicted box coordinates
 classes -- tensor of shape (, None), predicted class for each box

 Note: The "None" dimension of the output tensors has obviously to be less than max_boxes. Note also that this
 function will transpose the shapes of scores, boxes, classes. This is made for convenience.

 max_boxes_tensor = K.variable(max_boxes, dtype='int32') # tensor to be used in tf.image.non_max_suppression()
 K.get_session().run(tf.variables_initializer([max_boxes_tensor])) # initialize variable max_boxes_tensor

 # Use tf.image.non_max_suppression() to get the list of indices corresponding to boxes you keep

 # Use K.gather() to select only nms_indices from scores, boxes and classes

 return scores, boxes, classes


Wrapping up the filtering

It’s time to implement a function taking the output of the deep CNN (the 19x19x5x85 dimensional encoding) and filtering through all the boxes using the functions we’ve just implemented.

Exercise: Implement yolo_eval() which takes the output of the YOLO encoding and filters the boxes using score threshold and NMS. There’s just one last implementational detail we have to know. There’re a few ways of representing boxes, such as via their corners or via their midpoint and height/width. YOLO converts between a few such formats at different times, using the following functions (which are provided):

boxes = yolo_boxes_to_corners(box_xy, box_wh)

which converts the yolo box coordinates (x,y,w,h) to box corners’ coordinates (x1, y1, x2, y2) to fit the input of yolo_filter_boxes

boxes = scale_boxes(boxes, image_shape)

YOLO’s network was trained to run on 608×608 images. If we are testing this data on a different size image – for example, the car detection dataset had 720×1280 images – his step rescales the boxes so that they can be plotted on top of the original 720×1280 image.

def yolo_eval(yolo_outputs, image_shape = (720., 1280.), max_boxes=10, score_threshold=.6, iou_threshold=.5):
 Converts the output of YOLO encoding (a lot of boxes) to your predicted boxes along with their scores, box coordinates and classes.

 yolo_outputs -- output of the encoding model (for image_shape of (608, 608, 3)), contains 4 tensors:
 box_confidence: tensor of shape (None, 19, 19, 5, 1)
 box_xy: tensor of shape (None, 19, 19, 5, 2)
 box_wh: tensor of shape (None, 19, 19, 5, 2)
 box_class_probs: tensor of shape (None, 19, 19, 5, 80)
 image_shape -- tensor of shape (2,) containing the input shape, in this notebook we use (608., 608.) (has to be float32 dtype)
 max_boxes -- integer, maximum number of predicted boxes you'd like
 score_threshold -- real value, if [ highest class probability score < threshold], then get rid of the corresponding box
 iou_threshold -- real value, "intersection over union" threshold used for NMS filtering

 scores -- tensor of shape (None, ), predicted score for each box
 boxes -- tensor of shape (None, 4), predicted box coordinates
 classes -- tensor of shape (None,), predicted class for each box

 # Retrieve outputs of the YOLO model

 # Convert boxes to be ready for filtering functions 

 # Use one of the functions you've implemented to perform Score-filtering with a threshold of score_threshold

 # Scale boxes back to original image shape.

 # Use one of the functions you've implemented to perform Non-max suppression with a threshold of iou_threshold 

 return scores, boxes, classes


Summary for YOLO:

  • Input image (608, 608, 3)
  • The input image goes through a CNN, resulting in a (19,19,5,85) dimensional output.
  • After flattening the last two dimensions, the output is a volume of shape (19, 19, 425):
    • Each cell in a 19×19 grid over the input image gives 425 numbers.
    • 425 = 5 x 85 because each cell contains predictions for 5 boxes, corresponding to 5 anchor boxes, as seen in lecture.
    • 85 = 5 + 80 where 5 is because (pc,bx,by,bh,bw) has 5 numbers, and and 80 is the number of classes we’d like to detect.
  • We then select only few boxes based on:
    • Score-thresholding: throw away boxes that have detected a class with a score less than the threshold.
    • Non-max suppression: Compute the Intersection over Union and avoid selecting overlapping boxes.
  • This gives us YOLO’s final output.


Test YOLO pretrained model on images

In this part, we are going to use a pre-trained model and test it on the car detection dataset. As usual, we start by creating a session to start your graph. Run the following cell.

sess = K.get_session()

Defining classes, anchors and image shape.

Recall that we are trying to detect 80 classes, and are using 5 anchor boxes. We have gathered the information about the 80 classes and 5 boxes in two files “coco_classes.txt” and “yolo_anchors.txt”. Let’s load these quantities into the model by running the next cell.

The car detection dataset has 720×1280 images, which we’ve pre-processed into 608×608 images.

class_names = read_classes(“coco_classes.txt”)
anchors = read_anchors(“yolo_anchors.txt”)
image_shape = (720., 1280.)


Loading a pretrained model

Training a YOLO model takes a very long time and requires a fairly large dataset of labelled bounding boxes for a large range of target classes. We are going to load an existing pretrained Keras YOLO model stored in “yolo.h5”. (These weights come from the official YOLO website, and were converted using a function written by Allan Zelener.  Technically, these are the parameters from the “YOLOv2” model, but we will more simply refer to it as “YOLO” in this notebook.)

yolo_model = load_model(“yolo.h5”)

This loads the weights of a trained YOLO model. Here’s a summary of the layers our model contains.


Layer (type) Output Shape Param # Connected to
input_1 (InputLayer) (None, 608, 608, 3) 0
conv2d_1 (Conv2D) (None, 608, 608, 32) 864 input_1[0][0]
batch_normalization_1 (BatchNor (None, 608, 608, 32) 128 conv2d_1[0][0]
leaky_re_lu_1 (LeakyReLU) (None, 608, 608, 32) 0 batch_normalization_1[0][0]
max_pooling2d_1 (MaxPooling2D) (None, 304, 304, 32) 0 leaky_re_lu_1[0][0]
conv2d_2 (Conv2D) (None, 304, 304, 64) 18432 max_pooling2d_1[0][0]
batch_normalization_2 (BatchNor (None, 304, 304, 64) 256 conv2d_2[0][0]
leaky_re_lu_2 (LeakyReLU) (None, 304, 304, 64) 0 batch_normalization_2[0][0]
max_pooling2d_2 (MaxPooling2D) (None, 152, 152, 64) 0 leaky_re_lu_2[0][0]
conv2d_3 (Conv2D) (None, 152, 152, 128 73728 max_pooling2d_2[0][0]
batch_normalization_3 (BatchNor (None, 152, 152, 128 512 conv2d_3[0][0]
leaky_re_lu_3 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_3[0][0]
conv2d_4 (Conv2D) (None, 152, 152, 64) 8192 leaky_re_lu_3[0][0]
batch_normalization_4 (BatchNor (None, 152, 152, 64) 256 conv2d_4[0][0]
leaky_re_lu_4 (LeakyReLU) (None, 152, 152, 64) 0 batch_normalization_4[0][0]
conv2d_5 (Conv2D) (None, 152, 152, 128 73728 leaky_re_lu_4[0][0]
batch_normalization_5 (BatchNor (None, 152, 152, 128 512 conv2d_5[0][0]
leaky_re_lu_5 (LeakyReLU) (None, 152, 152, 128 0 batch_normalization_5[0][0]
max_pooling2d_3 (MaxPooling2D) (None, 76, 76, 128) 0 leaky_re_lu_5[0][0]
conv2d_6 (Conv2D) (None, 76, 76, 256) 294912 max_pooling2d_3[0][0]
batch_normalization_6 (BatchNor (None, 76, 76, 256) 1024 conv2d_6[0][0]
leaky_re_lu_6 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_6[0][0]
conv2d_7 (Conv2D) (None, 76, 76, 128) 32768 leaky_re_lu_6[0][0]
batch_normalization_7 (BatchNor (None, 76, 76, 128) 512 conv2d_7[0][0]
leaky_re_lu_7 (LeakyReLU) (None, 76, 76, 128) 0 batch_normalization_7[0][0]
conv2d_8 (Conv2D) (None, 76, 76, 256) 294912 leaky_re_lu_7[0][0]
batch_normalization_8 (BatchNor (None, 76, 76, 256) 1024 conv2d_8[0][0]
leaky_re_lu_8 (LeakyReLU) (None, 76, 76, 256) 0 batch_normalization_8[0][0]
max_pooling2d_4 (MaxPooling2D) (None, 38, 38, 256) 0 leaky_re_lu_8[0][0]
conv2d_9 (Conv2D) (None, 38, 38, 512) 1179648 max_pooling2d_4[0][0]
batch_normalization_9 (BatchNor (None, 38, 38, 512) 2048 conv2d_9[0][0]
leaky_re_lu_9 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_9[0][0]
conv2d_10 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_9[0][0]
batch_normalization_10 (BatchNo (None, 38, 38, 256) 1024 conv2d_10[0][0]
leaky_re_lu_10 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_10[0][0]
conv2d_11 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_10[0][0]
batch_normalization_11 (BatchNo (None, 38, 38, 512) 2048 conv2d_11[0][0]
leaky_re_lu_11 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_11[0][0]
conv2d_12 (Conv2D) (None, 38, 38, 256) 131072 leaky_re_lu_11[0][0]
batch_normalization_12 (BatchNo (None, 38, 38, 256) 1024 conv2d_12[0][0]
leaky_re_lu_12 (LeakyReLU) (None, 38, 38, 256) 0 batch_normalization_12[0][0]
conv2d_13 (Conv2D) (None, 38, 38, 512) 1179648 leaky_re_lu_12[0][0]
batch_normalization_13 (BatchNo (None, 38, 38, 512) 2048 conv2d_13[0][0]
leaky_re_lu_13 (LeakyReLU) (None, 38, 38, 512) 0 batch_normalization_13[0][0]
max_pooling2d_5 (MaxPooling2D) (None, 19, 19, 512) 0 leaky_re_lu_13[0][0]
conv2d_14 (Conv2D) (None, 19, 19, 1024) 4718592 max_pooling2d_5[0][0]
batch_normalization_14 (BatchNo (None, 19, 19, 1024) 4096 conv2d_14[0][0]
leaky_re_lu_14 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_14[0][0]
conv2d_15 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_14[0][0]
batch_normalization_15 (BatchNo (None, 19, 19, 512) 2048 conv2d_15[0][0]
leaky_re_lu_15 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_15[0][0]
conv2d_16 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_15[0][0]
batch_normalization_16 (BatchNo (None, 19, 19, 1024) 4096 conv2d_16[0][0]
leaky_re_lu_16 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_16[0][0]
conv2d_17 (Conv2D) (None, 19, 19, 512) 524288 leaky_re_lu_16[0][0]
batch_normalization_17 (BatchNo (None, 19, 19, 512) 2048 conv2d_17[0][0]
leaky_re_lu_17 (LeakyReLU) (None, 19, 19, 512) 0 batch_normalization_17[0][0]
conv2d_18 (Conv2D) (None, 19, 19, 1024) 4718592 leaky_re_lu_17[0][0]
batch_normalization_18 (BatchNo (None, 19, 19, 1024) 4096 conv2d_18[0][0]
leaky_re_lu_18 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_18[0][0]
conv2d_19 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_18[0][0]
batch_normalization_19 (BatchNo (None, 19, 19, 1024) 4096 conv2d_19[0][0]
conv2d_21 (Conv2D) (None, 38, 38, 64) 32768 leaky_re_lu_13[0][0]
leaky_re_lu_19 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_19[0][0]
batch_normalization_21 (BatchNo (None, 38, 38, 64) 256 conv2d_21[0][0]
conv2d_20 (Conv2D) (None, 19, 19, 1024) 9437184 leaky_re_lu_19[0][0]
leaky_re_lu_21 (LeakyReLU) (None, 38, 38, 64) 0 batch_normalization_21[0][0]
batch_normalization_20 (BatchNo (None, 19, 19, 1024) 4096 conv2d_20[0][0]
space_to_depth_x2 (Lambda) (None, 19, 19, 256) 0 leaky_re_lu_21[0][0]
leaky_re_lu_20 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_20[0][0]
concatenate_1 (Concatenate) (None, 19, 19, 1280) 0 space_to_depth_x2[0][0]
conv2d_22 (Conv2D) (None, 19, 19, 1024) 11796480 concatenate_1[0][0]
batch_normalization_22 (BatchNo (None, 19, 19, 1024) 4096 conv2d_22[0][0]
leaky_re_lu_22 (LeakyReLU) (None, 19, 19, 1024) 0 batch_normalization_22[0][0]
conv2d_23 (Conv2D) (None, 19, 19, 425) 435625 leaky_re_lu_22[0][0]
Total params: 50,983,561
Trainable params: 50,962,889
Non-trainable params: 20,672


Reminder: this model converts a pre-processed batch of input images (shape: (m, 608, 608, 3)) into a tensor of shape (m, 19, 19, 5, 85) as explained in the above Figure.

Convert output of the model to usable bounding box tensors

The output of yolo_model is a (m, 19, 19, 5, 85) tensor that needs to pass through non-trivial processing and conversion. The following code does this.

yolo_outputs = yolo_head(yolo_model.output, anchors, len(class_names))

We added yolo_outputs to your graph. This set of 4 tensors is ready to be used as input by our yolo_eval function.

Filtering boxes

yolo_outputs gave us all the predicted boxes of yolo_model in the correct format. We’re now ready to perform filtering and select only the best boxes. Lets now call yolo_eval, which you had previously implemented, to do this.

scores, boxes, classes = yolo_eval(yolo_outputs, image_shape)

Run the graph on an image

Let the fun begin. We have created a (sess) graph that can be summarized as follows:

  1. yolo_model.input is given to yolo_model. The model is used to compute the output yolo_model.output
  2. yolo_model.output is processed by yolo_head. It gives us yolo_outputs
  3. yolo_outputs goes through a filtering function, yolo_eval. It outputs your predictions: scores, boxes, classes

Exercise: Implement predict() which runs the graph to test YOLO on an image. We shall need to run a TensorFlow session, to have it compute scores, boxes, classes.

The code below also uses the following function:

image, image_data = preprocess_image(“images/” + image_file, model_image_size = (608, 608))

which outputs:

  • image: a python (PIL) representation of your image used for drawing boxes. You won’t need to use it.
  • image_data: a numpy-array representing the image. This will be the input to the CNN.

Important note: when a model uses BatchNorm (as is the case in YOLO), we will need to pass an additional placeholder in the feed_dict {K.learning_phase(): 0}.

def predict(sess, image_file):
Runs the graph stored in "sess" to predict boxes for "image_file". Prints and plots the preditions.

sess -- your tensorflow/Keras session containing the YOLO graph
image_file -- name of an image stored in the "images" folder.

out_scores -- tensor of shape (None, ), scores of the predicted boxes
out_boxes -- tensor of shape (None, 4), coordinates of the predicted boxes
out_classes -- tensor of shape (None, ), class index of the predicted boxes

Note: "None" actually represents the number of predicted boxes, it varies between 0 and max_boxes.

 # Preprocess your image

 # Run the session with the correct tensors and choose the correct placeholders in the
 # feed_dict. We'll need to use feed_dict={yolo_model.input: ... , K.learning_phase(): 0})

 # Print predictions info
print('Found {} boxes for {}'.format(len(out_boxes), image_file))
# Generate colors for drawing bounding boxes.
colors = generate_colors(class_names)
# Draw bounding boxes on the image file
draw_boxes(image, out_scores, out_boxes, out_classes, class_names, colors)
# Save the predicted bounding box on the image"out", image_file), quality=90)
# Display the results
output_image = scipy.misc.imread(os.path.join("out", image_file))

 return out_scores, out_boxes, out_classes

Let’s Run the following cell on the following “test.jpg” image to verify that our function is correct.


out_scores, out_boxes, out_classes = predict(sess, “test.jpg”)

The following figure shows the output after car detection. Each of the bounding boxes have the name of the object detected on the top left along with the confidence value.

Output (with detected cars with YOLO)

Found 7 boxes for test.jpg
car 0.60 (925, 285) (1045, 374)
car 0.66 (706, 279) (786, 350)
bus 0.67 (5, 266) (220, 407)
car 0.70 (947, 324) (1280, 705)
car 0.74 (159, 303) (346, 440)
car 0.80 (761, 282) (942, 412)
car 0.89 (367, 300) (745, 648)


The following animation shows the output Images with detected objects (cars) using YOLO for a set of input images.


What we should remember:

  • YOLO is a state-of-the-art object detection model that is fast and accurate.
  • It runs an input image through a CNN which outputs a 19x19x5x85 dimensional volume.
  • The encoding can be seen as a grid where each of the 19×19 cells contains information about 5 boxes.
  • You filter through all the boxes using non-max suppression. Specifically:
    Score thresholding on the probability of detecting a class to keep only accurate (high probability) boxes.
  • Intersection over Union (IoU) thresholding to eliminate overlapping boxes.
  • Because training a YOLO model from randomly initialized weights is non-trivial and requires a large dataset as well as lot of computation, we used previously trained model parameters in this exercise.


References: The ideas presented in this notebook came primarily from the two YOLO papers. The implementation here also took significant inspiration and used many components from Allan Zelener’s github repository. The pretrained weights used in this exercise came from the official YOLO website.

  1. Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi – You Only Look Once: Unified, Real-Time Object Detection (2015)
  2. Joseph Redmon, Ali Farhadi – YOLO9000: Better, Faster, Stronger (2016)
  3. Allan Zelener – YAD2K: Yet Another Darknet 2 Keras
  4. The official YOLO website .

Car detection dataset: Creative Commons License.

The Sample Dataset (provided by is licensed under a Creative Commons Attribution 4.0 International License.

Implementing Lucas-Kanade Optical Flow algorithm in Python

In this article an implementation of the Lucas-Kanade optical flow algorithm is going to be described. This problem appeared as an assignment in this computer vision course from UCSD. The inputs will be sequences of images (subsequent frames from a video) and the algorithm will output an optical flow field (u, v) and trace the motion of the moving objects. The problem description is taken from the assignment itself.


Problem Statement

Single-Scale Optical Flow

  • Let’s implement the single-scale Lucas-Kanade optical flow algorithm. This involves finding the motion (u, v) that minimizes the sum-squared error of the brightness constancy equations for each pixel in a window.  The algorithm will be implemented as a function with the following inputs:

     def optical_flow(I1, I2, window_size, tau) # returns (u, v)

  • Here, u and v are the x and y components of the optical flow, I1 and I2 are two images taken at times t = 1 and t = 2 respectively, and window_size is a 1 × 2 vector storing the width and height of the window used during flow computation.
  • In addition to these inputs, a theshold τ should be added, such that if τ is larger than the smallest eigenvalue of A’A, then the the optical flow at that position should not be computed. Recall that the optical flow is only valid in regions where

has rank 2, which is what the threshold is checking. A typical value for τ is 0.01.

  • We should try experimenting with different window sizes and find out the tradeoffs associated with using a small vs. a large window size.
  • The following figure describes the algorithm, which considers a nxn (n>=3) window around each pixel and solves a least-square problem to find the best flow vectors for the pixel.


  • The following code-snippet shows how the algorithm is implemented in python for a gray-level image.
import numpy as np
from scipy import signal
def optical_flow(I1g, I2g, window_size, tau=1e-2):

    kernel_x = np.array([[-1., 1.], [-1., 1.]])
    kernel_y = np.array([[-1., -1.], [1., 1.]])
    kernel_t = np.array([[1., 1.], [1., 1.]])#*.25
    w = window_size/2 # window_size is odd, all the pixels with offset in between [-w, w] are inside the window
    I1g = I1g / 255. # normalize pixels
    I2g = I2g / 255. # normalize pixels
    # Implement Lucas Kanade
    # for each point, calculate I_x, I_y, I_t
    mode = 'same'
    fx = signal.convolve2d(I1g, kernel_x, boundary='symm', mode=mode)
    fy = signal.convolve2d(I1g, kernel_y, boundary='symm', mode=mode)
    ft = signal.convolve2d(I2g, kernel_t, boundary='symm', mode=mode) +
         signal.convolve2d(I1g, -kernel_t, boundary='symm', mode=mode)
    u = np.zeros(I1g.shape)
    v = np.zeros(I1g.shape)
    # within window window_size * window_size
    for i in range(w, I1g.shape[0]-w):
        for j in range(w, I1g.shape[1]-w):
            Ix = fx[i-w:i+w+1, j-w:j+w+1].flatten()
            Iy = fy[i-w:i+w+1, j-w:j+w+1].flatten()
            It = ft[i-w:i+w+1, j-w:j+w+1].flatten()
            #b = ... # get b here
            #A = ... # get A here
            # if threshold τ is larger than the smallest eigenvalue of A'A:
            nu = ... # get velocity here

    return (u,v)


Some Results

  • The following figures and animations show the results of the algorithm on a few image sequences. Some of these input image sequences / videos are from the course and some are collected from the internet.
  • As can be seen, the algorithm performs best if the motion of the moving object(s) in between consecutive frames is slow. To the contrary, if the motion is large, the algorithm fails and we should implement / use multiple-scale version Lucas-Kanade with image pyramids.
  • Finally,  with small window size,  the algorithm captures subtle motions but not large motions. With large size it happens the other way.

Input Sequences



Output Optical Flow with different window sizes

window size = 15


window size = 21



Input Sequences

Output Optical Flow


Input Sequences (hamburg taxi)


Output Optical Flowtaxi_opt


Input Sequences


Output Optical Flow

Input Sequences


Output Optical Flowseq_opt

Input Sequences    fount3.gif


Output Optical Flowfount_opt

Input Sequences

Output Optical Flow

Input Sequencessynth

Output Optical Flowsynth_opt

Input Sequencescars1
Output Optical Flowcars1_optcars1_cmap

Input Sequencescars2

Output Optical Flowcars2_opt

Output Optical Flowcars2_opt2cars2_cmap

Input Sequences



Output Optical Flow with window size 45

Output Optical Flow with window size 10

Output Optical Flow with window size 25
Output Optical Flow with window size 45cars3_opt2_45