Solving Some Image Processing Problems with Python libraries

In this article a few popular image processing problems along with their solutions are going to be discussed. Python image processing libraries are going to be used to solve these problems. Some of the problems are from the exercises from this book  (available on amazon).


Image Transformations and Warping


0. Display RGB image color channels in 3D

  1. A gray-scale image can be thought of a 2-D function f(x,y) of the pixel locations (x,y), that maps each pixel into its corresponding gray level (an integer in [0,255], e.g.,).
  2. For an RGB image there are 3 such functions, f_R(x,y), f_G(x.y), f_B(x.y).
  3. matplotlib’s 3-D plot functions can be used to plot each function.

The following python code shows how to plot the RGB channels separately in 3D:

import matplotlib.pylab as plt
from mpl_toolkits.mplot3d import Axes3D
def plot_3d(X, Y, Z, title, cmap):
    # implement this function to plot the channel pixel values in 3D

im = imread('../images/parrot.jpg')
Y = np.arange(im.shape[0])
X = np.arange(im.shape[1])
Z1 = im[...,0]
Z2 = im[...,1]
Z3 = im[...,2]
plot_3d(X, Y, Z1, cmap='Reds', title='3D plot for the Red Channel')
plot_3d(X, Y, Z2, cmap='Greens', title='3D plot for the Green Channel')
plot_3d(X, Y, Z3, cmap='Blues', title='3D plot for the Blue Channel')

The RGB image


1. Wave Transform

  1. Use scikit-image’s warp() function to implement the wave transform.
  2. Note that wave transform can be expressed with the following equations:

We shall use the madrill image to implement the wave transform. The next python code fragment shows how to do it:

def wave(xy):
    xy[:, 1] += 20*np.sin(2*np.pi*xy[:, 0]/64)
    return xy

from import imread
from skimage.transform import warp
import matplotlib.pylab as plt
im = imread('images/mandrill.jpg')
im = warp(im, wave)

The next figure shows the original mandrill input image and the output image obtained after applying the wave transform.

mandrill      mandrill_w.png


2. Swirl Transform

  1. Use scikit-image’s warp() function to implement the swirl transform.
  2. Note that swirl transform can be expressed with the following equations

We shall use the madrill image to implement the wave transform. The next python code fragment shows how to do it:

def swirl(xy, x0, y0, R):
    r = np.sqrt((xy[:,1]-x0)**2 + (xy[:,0]-y0)**2)
    a = np.pi * r / R
    xy[:, 1] = (xy[:, 1]-x0)*np.cos(a) + (xy[:, 0]-y0)*np.sin(a) + x0
    xy[:, 0] = -(xy[:, 1]-x0)*np.sin(a) + (xy[:, 0]-y0)*np.cos(a) + y0
    return xy

im = imread('../images/mandrill.jpg')
im = warp(im, swirl, map_args={'x0':112, 'y0':112, 'R':512})

The next figure shows the original mandrill input image and the output image obtained after applying the swirl transform.

mandrill   ms.png


Compare this with the output of the scikit-image swirl() function.

3. Very simple Face morphing with α-blending

  1. Start from one face image (e.g., let image1 be the face of Messi) and end into another image (let image2 be the face of Ronaldo) iteratively, creating some intermediate images in between.
  2. At each iteration create an image by using a linear combination of the two image numpy ndarrays given by
     3. Iteratively increase α from 0 to 1.

The following code block shows how to implement it using matplotlib’s image and pylab modules.

im1 = mpimg.imread("../images/messi.jpg") / 255 # scale RGB values in [0,1]
im2 = mpimg.imread("../images/ronaldo.jpg") / 255
i = 1
for alpha in np.linspace(0,1,20):
 plt.imshow((1-alpha)*im1 + alpha*im2)
 i += 1
plt.subplots_adjust(wspace=0.05, hspace=0.05)

The next animation shows the simple face morphing:
There are more sophisticated techniques to improve the quality of morphing, but this is the simplest one.

4. Creating Instagram-like Gotham Filter

The Gotham filter

The Gotham filter is computed as follows (the steps taken from here), applying the following operations on an image, the corresponding python code, input and output images are shown along with the operations (with the following input image):


  1. A mid-tone red contrast boost
    from PIL import Image
    import numpy as np
    import matplotlib.pylab as plt
    im ='../images/city.jpg') # pixel values in [0,255]
    r, g, b = im.split()
    red_levels = [0., 12.75, 25.5, 51., 76.5, 127.5, 178.5, 204., 229.5, 242.25, 255.]
    r1 = Image.fromarray((np.reshape(np.interp(np.array(r).ravel(), np.linspace(0,255,len(red_levels)), red_levels), (im.height, im.width))).astype(np.uint8), mode='L')
    plt.title('original', size=20)
    im1 = Image.merge('RGB', (r1, g, b))
    plt.title('with red channel interpolation', size=20)
    plt.hist(np.array(r).ravel(), normed=True)
    plt.hist(np.array(r1).ravel(), normed=True)


  2. Make the blacks a little bluer
    plt.title('last image', size=20)
    b1 = Image.fromarray(np.clip(np.array(b) + 7.65, 0, 255).astype(np.uint8))
    im1 = Image.merge('RGB', (r1, g, b1))
    plt.title('with transformation', size=20)


  3. A small sharpening
    from PIL.ImageEnhance import Sharpness
    plt.title('last image', size=20)
    im2 = Sharpness(im1).enhance(3.0)
    plt.title('with transformation', size=20)


  4. A boost in blue channel for lower mid-tones
  5. A decrease in blue channel for upper mid-tones
    blue_levels = [0., 11.985, 30.09, 64.005, 81.09, 99.96, 107.1, 111.945, 121.125, 143.055, 147.9, 159.885, 171.105, 186.915, 215.985, 235.875, 255.]
    b2 = Image.fromarray((np.reshape(np.interp(np.array(b1).ravel(), np.linspace(0,255,len(blue_levels)), blue_levels), (im.height, im.width))).astype(np.uint8), mode='L')
    plt.title('last image', size=20)
    im3 = Image.merge('RGB', (r1, g, b2))
    plt.title('with blue channel interpolation', size=20)
    plt.hist(np.array(b1).ravel(), normed=True)
    plt.hist(np.array(b2).ravel(), normed=True)


The output image obtained after applying the Gotham filter is shown below:


Down-sampling with anti-aliasing using Gaussian Filter

  1. Start with a large gray-scale image and reduce the image size 16 times, by reducing both height and width by 4 times.
  2. Select every 4th pixel in the x and the y direction from the original image to compute the values of the pixels in the smaller image.
  3. Before down-sampling apply a Gaussian filter (to smooth the image) for anti-aliasing.
  4. Compare the quality of the output image obtained by down-sampling without a Gaussian filter (with aliasing).

The next code block performs the above steps. Since the Gaussian blur is a low-pass filter, it removes the high frequencies from the original input image, hence it’s possible to achieve sampling rate above the Nyquist rate (by sampling theorem) to avoid aliasing.

from scipy.ndimage import gaussian_filter
im = rgb2gray(imread('images/umbc.png'))
im_blurred = gaussian_filter(im, sigma=2.5) #(5,5,1)
n = 4 # create and image 16 times smaller in size
w, h = im.shape[0] // n, im.shape[1] // n
im_small = np.zeros((w,h))
for i in range(w):
   for j in range(h):
      im_small[i,j] = im[n*i, n*j]
im_small = np.zeros((w,h))
for i in range(w):
   for j in range(h):
      im_small[i,j] = im_blurred[n*i, n*j]

Original Image

    Image blurred with Gaussian Filter LPF blur_umbc.png

Down-sampled Image from the original image (with aliasing)


Down-sampled Image from the blurred image (with anti-aliasing)anti_alias_umbc

Some Applications of DFT


0. Fourier Transform of a Gaussian Kernel is another Gaussian Kernel

Also, the spread in the frequency domain  inversely proportional to the spread in the spatial domain (known as Heisenberg’s inequality). Here is the proof:


The following animation shows an example visualizing the Gaussian contours in spatial and corresponding frequency domains:



1. Using DFT to up-sample an image

  1. Let’s use the lena gray-scale image.
  2. First double the size of the by padding zero rows/columns at every alternate positions.
  3. Use FFT followed by an LPF.
  4. Finally use IFFT to get the output image.


The following code block shows the python code for implementing the steps listed above:

import numpy as np
import numpy.fft as fp
import matplotlib.pyplot as plt

im = np.mean(imread('images/lena.jpg'), axis=2)
im1 = np.zeros((2*im.shape[0], 2*im.shape[1]))
print(im.shape, im1.shape)
for i in range(im.shape[0]):
    for j in range(im.shape[1]):
        im1[2*i,2*j] = im[i,j]

def padwithzeros(vector, pad_width, iaxis, kwargs):
    vector[:pad_width[0]] = 0
    vector[-pad_width[1]:] = 0
    return vector

# the LPF kernel
kernel = [[0.25, 0.5, 0.25], [0.5, 1, 0.5], [0.25, 0.5, 0.25]]
# enlarge the kernel to the shape of the image
kernel = np.pad(kernel, (((im1.shape[0]-3)//2,(im1.shape[0]-3)//2+1), ((im1.shape[1]-3)//2,(im1.shape[1]-3)//2+1)), padwithzeros) 

plt.gray() # show the filtered result in grayscale

freq = fp.fft2(im1)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
freq_LPF = freq*freq_kernel # by the Convolution theorem
im2 = fp.ifft2(freq_LPF)
freq_im2 = fp.fft2(im2)

plt.title('Original Image', size=20)
plt.title('Padded Image', size=20)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq))).astype(int), cmap='jet')
plt.title('Original Image Spectrum', size=20)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq_kernel))).astype(int), cmap='jet')
plt.title('Image Spectrum of the LPF', size=20)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq_im2))).astype(int), cmap='jet')
plt.title('Image Spectrum after LPF', size=20)
plt.imshow(im2.astype(np.uint8)) # the imaginary part is an artifact
plt.title('Output Image', size=20)

The next figure shows the output. As can be seen from the next figure, the LPF removed the high frequency components from the Fourier spectrum of the padded image and with a subsequent inverse Fourier transform  we get a decent enlarged image.



2. Frequency Domain Gaussian Filter

  1. Use an input image and use DFT to create the frequency 2D-array.
  2. Create a small Gaussian 2D Kernel (to be used as an LPF) in the spatial domain and pad it to enlarge it to the image dimensions.
  3. Use DFT to obtain the Gaussian Kernel in the frequency domain.
  4. Use the Convolution theorem to convolve the LPF with the input image in the frequency domain.
  5. Use IDFT to obtain the output image.
  6. Plot the frequency spectrum of the image, the gaussian kernel and the image obtained after convolution in the frequency domain, in 3D.

The following code block shows the python code:

import matplotlib.pyplot as plt
from matplotlib import cm
from skimage.color import rgb2gray
from import imread
import scipy.fftpack as fp

im = rgb2gray(imread('images/temple.jpg'))
kernel = np.outer(signal.gaussian(im.shape[0], 10), signal.gaussian(im.shape[1], 10))
freq = fp.fft2(im)
assert(freq.shape == kernel.shape)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
convolved = freq*freq_kernel # by the Convolution theorem
im_blur = fp.ifft2(convolved).real
im_blur = 255 * im_blur / np.max(im_blur)

# center the frequency response
plt.imshow( (20*np.log10( 0.01 + fp.fftshift(freq_kernel))).real.astype(int), cmap='coolwarm')

plt.imshow(im, cmap='gray')

from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter
# ... code for 3D visualization of the spectrums

The original color temple image (time / spatial domain)


The temple image (frequency domain)


The Gaussian Kernel LPF in 2D (frequency domain)


The Gaussian Kernel LPF (frequency domain)

The smoothed temple image with the LPF (frequency domain)


If we set the standard deviation of the LPF Gaussian kernel to be 10 we get the following output as shown in the next figures. As can be seen, the frequency response value drops much quicker from the center.

The smoothed temple image with the LPF with higher s.d. (frequency domain)


The output image after convolution (spatial / time domain)


3. Using the inverse filter to restore a motion-blurred image

  1. First create a motion blur kernel of a given shape.
  2. Convolve the kernel with an input image in the frequency domain.
  3. Get the motion-blurred image in the spatial domain with IDFT.
  4. Compute the inverse filter kernel and convolve with the blurred image in the frequency domain.
  5. Get the convolved image back in the spatial domain.
  6. Plot all the images and kernels in the frequency domain.

The following code block shows the python code:

im = rgb2gray(imread('../images/madurga.jpg'))
# create the motion blur kernel
size = 21
kernel = np.zeros((size, size))
kernel[int((size-1)/2), :] = np.ones(size)
kernel = kernel / size
kernel = np.pad(kernel, (((im.shape[0]-size)//2,(im.shape[0]-size)//2+1), ((im.shape[1]-size)//2,(im.shape[1]-size)//2+1)), padwithzeros)

freq = fp.fft2(im)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
convolved1 = freq1*freq_kernel1
im_blur = fp.ifft2(convolved1).real
im_blur = im_blur / np.max(im_blur)

epsilon = 10**-6

freq = fp.fft2(im_blur)
freq_kernel = 1 / (epsilon + freq_kernel1)

convolved = freq*freq_kernel
im_restored = fp.ifft2(convolved).real
im_restored = im_restored / np.max(im_restored)

plt.title('Original image', size=20)
plt.title('Blurred image with motion blur kernel', size=20)
plt.title('Restored image with inverse filter', size=20)
plt.imshow(im_restored - im)
plt.title('Diff restored & original image', size=20)

# Plot the surface of the frequency responses here


Frequency response of the input image


(log) Frequency response of the motion blur kernel (LPF)


Input image convolved with the motion blur kernel (frequency domain)


(log) Frequency response of the inverse frequency filter kernel (HPF)


Motion-blurred image convolved with the inverse frequency filter kernel (frequency domain)



4. Impact of noise on the inverse filter

  1. Add some random noise to the Lena image.
  2. Blur the image with a Gaussian kernel.
  3. Restore the image using inverse filter.

With the original image

Let’s first blur and apply the inverse filter on the noiseless blurred image. The following figures show the outputs:


(log) Frequency response of the input imagelena_freq.png

(log) Frequency response of the Gaussian blur kernel (LPF)lena_gaussian_freq

(log) Frequency response of the blurred image


(log) Frequency response of the inverse kernel (HPF)


Frequency response of the output imagelena_inverse_spectrum

Adding noise to the original image

The following python code can be used to add Gaussian noise to an image:

from skimage.util import random_noise
im = random_noise(im, var=0.1)

The next figures show the noisy lena image, the blurred image with a Gaussian Kernel and the restored image with the inverse filter. As can be seen, being a high-pass filter, the inverse filter enhances the noise, typically corresponding to high frequencies.



5. Use a notch filter to remove periodic noise from the following half-toned car image.


  1. Use DFT to obtain the frequency spectrum of the image.
  2. Block the high frequency components that are most likely responsible fro noise.
  3. Use IDFT to come back to the spatial domain.


from scipy import fftpack
im = imread('images/halftone.png')
F1 = fftpack.fft2((im).astype(float))
F2 = fftpack.fftshift(F1)
for i in range(60, w, 135):
   for j in range(100, h, 200):
     if not (i == 330 and j == 500):
        F2[i-10:i+10, j-10:j+10] = 0
for i in range(0, w, 135):
   for j in range(200, h, 200):
     if not (i == 330 and j == 500):
        F2[max(0,i-15):min(w,i+15), max(0,j-15):min(h,j+15)] = 0
plt.imshow( (20*np.log10( 0.1 + F2)).astype(int),
im1 = fp.ifft2(fftpack.ifftshift(F2)).real
plt.imshow(im1, cmap='gray')

Frequency response of the input image

Frequency response of the input image with blocked frequencies with notch


Output image


With a low-pass-filter (LPF):

Frequency response of the input image with blocked frequencies with LPF

Output imagecar_notch

Histogram Matching with color images

As described here, here is the algorithm:

  1. The cumulative histogram is computed for each image dataset, see the figure below.
  2. For any particular value (xi) in the input image data to be adjusted has a cumulative histogram value given by G(xi).
  3. This in turn is the cumulative distribution value in the reference (template) image  dataset, namely H(xj). The input data value xi is replaced by xj.


im = imread('images/lena.jpg')
im_t = imread('images/vstyle.png')
im1 = np.zeros(im.shape).astype(np.uint8)
for i in range(3):
    c = cdf(im[...,i])
    c_t = cdf(im_t[...,i])
    im1[...,i] = hist_matching(c, c_t, im[...,i]) # implement this function with the above algorithm
    c1 = cdf(im1[...,i])
    col = 'r' if i == 0 else ('g' if i == 1 else 'b')
    plt.plot(np.arange(256), c, col + ':', label='input ' + col.upper(), linewidth=5)
    plt.plot(np.arange(256), c_t, col + '--', label='template ' + col.upper(), linewidth=5)
    plt.plot(np.arange(256), c1, col + '-', label='output ' + col.upper(), linewidth=2)
plt.title('CDF', size=20)
plt.legend(prop={'size': 15})


Input image                                                 

Template Imagevstyle

Output Image

The following figure shows how the histogram of the input image is matched with the histogram of the template image.


Another example:

Input image                                                 


Template Image


Output Image



Mathematical Morphology

1. Automatically cropping an image

  1. Let’s use the following image. The image has unnecessary white background  outside the molecule of the organic compound.L_2d
  2. First convert the image to a binary image and compute the convex hull of the molecule object.
  3. Use the convex hull image to find the bounding box for cropping.
  4. Crop the original image with the bounding box.

The next python code shows how to implement the above steps:

from PIL import Image
from import imread
from skimage.morphology import convex_hull_image
im = imread('../images/L_2d.jpg')
plt.title('input image')
im1 = 1 - rgb2gray(im)
threshold = 0.5
im1[im1  threshold] = 1
chull = convex_hull_image(im1)
plt.title('convex hull in the binary image')
imageBox = Image.fromarray((chull*255).astype(np.uint8)).getbbox()
cropped = Image.fromarray(im).crop(imageBox)'L_2d_cropped.jpg')
plt.title('cropped image')


This can also be found here.

2. Opening and Closing are Dual operations in mathematical morphology

  1. Start with a binary image and apply opening operation with some structuring element (e.g., a disk) on it to obtain an output image.
  2. Invert the image (to change the foreground to background and vice versa) and apply closing operation on it with the same structuring element to obtain another output image.
  3. Invert the second output image obtained and observe that it’s same as the first output image.
  4. Thus applying opening operation to the foreground of a binary image is equivalent to applying closing operation to the background of the same image with the same structuring element.

The next python code shows the implementation of the above steps.

from skimage.morphology import binary_opening, binary_closing, disk
from skimage.util import invert
im = rgb2gray(imread('../new images/circles.jpg'))
im[im  0.5] = 1
plt.title('original', size=20)
im1 = binary_opening(im, disk(12))
plt.title('opening with disk size ' + str(12), size=20)
im1 = invert(binary_closing(invert(im), disk(12)))
plt.title('closing with disk size ' + str(12), size=20)

As can be seen the output images obtained are exactly same.


Floyd-Steinberg Dithering (to convert a grayscale to a binary image)

The next figure shows the algorithm for error diffusion dithering.


def find_closest_palette_color(oldpixel):
    return int(round(oldpixel / 255)*255)

im = rgb2gray(imread('../my images/godess.jpg'))*255
pixel = np.copy(im)
w, h = im.shape

for x in range(w):
    for y in range(h):
        oldpixel = pixel[x][y]
        newpixel = find_closest_palette_color(oldpixel)
        pixel[x][y] = newpixel
        quant_error = oldpixel - newpixel
        if x + 1 < w-1:
          pixel[x + 1][y] = pixel[x + 1][y] + quant_error * 7 / 16
        if x > 0 and y < h-1:
           pixel[x - 1][y + 1] = pixel[x - 1][y + 1] + quant_error * 3 / 16
        if y < h-1:
           pixel[x ][y + 1] = pixel[x ][y + 1] + quant_error * 5 / 16
        if x < w-1 and y < h-1:
           pixel[x + 1][y + 1] = pixel[x + 1][y + 1] + quant_error * 1 / 16

plt.imshow(pixel, cmap='gray')

The input image (gray-scale)


The output Image (binary)


The next animation shows how an another input grayscale image gets converted to output binary image using the error diffusion dithering.


Sharpen a color image

  1. First blur the image with an LPF (e.g., Gaussian Filter).
  2. Compute the detail image as the difference between the original and the blurred image.
  3. Now the sharpened image can be computed as a linear combination of the original image and the detail  image. The next figure illustrates the concept.


The next python code shows how this can be implemented in python:

from scipy import misc, ndimage
import matplotlib.pyplot as plt
import numpy as np

def rgb2gray(im):
    return np.clip(0.2989 * im[...,0] + 0.5870 * im[...,1] + 0.1140 * im[...,2], 0, 1)

im = misc.imread('../my images/me.jpg')/255
im_blurred = ndimage.gaussian_filter(im, (5,5,0))
im_detail = np.clip(im - im_blurred, 0, 1)
fig, axes = plt.subplots(nrows=2, ncols=3, sharex=True, sharey=True, figsize=(15, 15))
axes = axes.ravel()
axes[0].set_title('Original image', size=15)
axes[1].set_title('Blurred image, sigma=5', size=15)
axes[2].set_title('Detail image', size=15)
alpha = [1, 5, 10]
for i in range(3):
    im_sharp = np.clip(im + alpha[i]*im_detail, 0, 1)
    axes[3+i].set_title('Sharpened image, alpha=' + str(alpha[i]), size=15)
for ax in axes:

The next figure shows the output of the above code block. As cane be seen, the output gets more sharpened as the value of alpha gets increased.


The next animation shows how the image gets more and more sharpened with increasing alpha.


Edge Detection with LOG and Zero-Crossing Algorithm by Marr and Hildreth

The following figure shows LOG filter and its DOG approximation.



In order to detect edges as a binary image, finding the zero-crossings in the LoG-convolved image was proposed by Marr and Hildreth. Identification of the edge pixels can be done by viewing the sign of the LoG-smoothed image by defining it as a binary image, the algorithm is as follows:

Algorithm to compute the zero-crossing 

  1. First convert the LOG-convolved image to a binary image, by replacing the pixel values by 1 for positive values and 0 for negative values.
  2. In order to compute the zero crossing pixels, we need to simply look at the boundaries of the non-zero regions in this binary image.
  3. Boundaries can be found by finding any non-zero pixel that has an immediate neighbor which is is zero.
  4. Hence, for each pixel, if it is non-zero, consider its 8 neighbors, if any of the neighboring pixels is zero, the pixel can be identified as an edge.

The next python code and the output images / animations generated show how to detect the edges from the  zebra image with LOG + zero-crossings:

from scipy import ndimage, misc
import matplotlib.pyplot as plt
from scipy.misc import imread
from skimage.color import rgb2gray
def any_neighbor_zero(img, i, j):
    for k in range(-1,2):
      for l in range(-1,2):
         if img[i+k, j+k] == 0:
            return True
    return False
def zero_crossing(img):
  img[img > 0] = 1
  img[img  0 and any_neighbor_zero(img, i, j):
        out_img[i,j] = 255
  return out_img

img = rgb2gray(imread('../images/zebras.jpg'))

fig = plt.figure(figsize=(25,15))
plt.gray() # show the filtered result in grayscale

for sigma in range(2,10, 2):
result = ndimage.gaussian_laplace(img, sigma=sigma)
plt.title('LoG with zero-crossing, sigma=' + str(sigma), size=30)


Original Input Imagezebras.jpg

Output with edges detected with LOG + zero-crossing at different sigma scales


With another input image

Output with edges detected with LOG + zero-crossing at different sigma scales



Constructing the Gaussian Pyramid with scikit-image transform module’s reduce function and Laplacian Pyramid from the Gaussian Pyramid and the expand function

The Gaussian Pyramid can be computed with the following steps:

  1. Start with the original image.
  2. Iteratively compute the image at each level of the pyramid first by smoothing the image (with gaussian filter) and then downsampling it .
  3. Stop at a level where the image size becomes sufficiently small (e.g., 1×1).

The Laplacian Pyramid can be computed with the following steps:

  1. Start with the Gaussian Pyramid and with the smallest image.
  2. Iteratively compute the difference image in between the image at the current level and the image obtained by first upsampling and then smoothing the image (with gaussian filter) from the previous level of the Gaussian Pyramid.
  3. Stop at a level where the image size becomes equal to the original image size.

The next python code shows how to create a Gaussian Pyramid from an image.

import numpy as np
import matplotlib.pyplot as plt
from import imread
from skimage.color import rgb2gray
from skimage.transform import pyramid_reduce, pyramid_expand, resize

def get_gaussian_pyramid(image):
    rows, cols, dim = image.shape
    gaussian_pyramid = [image]
    while rows> 1 and cols > 1:
        image = pyramid_reduce(image, downscale=2)
        rows //= 2
        cols //= 2
    return gaussian_pyramid

def get_laplacian_pyramid(gaussian_pyramid):
    laplacian_pyramid = [gaussian_pyramid[len(gaussian_pyramid)-1]]
    for i in range(len(gaussian_pyramid)-2, -1, -1):
        image = gaussian_pyramid[i] - resize(pyramid_expand(gaussian_pyramid[i+1]), gaussian_pyramid[i].shape)
    laplacian_pyramid = laplacian_pyramid[::-1]
    return laplacian_pyramid

image = imread('../images/antelops.jpeg')
gaussian_pyramid = get_gaussian_pyramid(image)
laplacian_pyramid = get_laplacian_pyramid(gaussian_pyramid)

w, h = 20, 12
for i in range(3):
    p = gaussian_pyramid[i]
    plt.title(str(p.shape[0]) + 'x' + str(p.shape[1]), size=20)
    w, h = w / 2, h / 2

w, h = 10, 6
for i in range(1,4):
    p = laplacian_pyramid[i]
    plt.imshow(rgb2gray(p), cmap='gray')
    plt.title(str(p.shape[0]) + 'x' + str(p.shape[1]), size=20)
    w, h = w / 2, h / 2


Some images from the Gaussian Pyramid


Some images from the Laplacian Pyramid



Blending images with Gaussian and Laplacian pyramids

Here is the algorithm:


Blending the following input images A, B with mask image M

Input Image A (Goddess Durga)madurga1

Input Image B (Lord Shiva)

Mask Image M


with the following python code creates the output image I shown below

A = imread('../images/madurga1.jpg')/255
B = imread('../images/mahadeb.jpg')/255
M = imread('../images/mask1.jpg')/255
# get the Gaussian and Laplacian pyramids, implement the functions
pyramidA = get_laplacian_pyramid(get_gaussian_pyramid(A))
pyramidB = get_laplacian_pyramid(get_gaussian_pyramid(B))
pyramidM = get_gaussian_pyramid(M)

pyramidC = []
for i in range(len(pyramidM)):
   im = pyramidM[i]*pyramidA[i] + (1-pyramidM[i])*pyramidB[i]
# implement the following function to construct an image from its laplacian pyramids
I = reconstruct_image_from_laplacian_pyramid(pyramidC)

Output Image I (Ardhanarishwar)
ardhanariswara_output.pngThe following animation shows how the output image is formed:




Another blending (horror!) example (from prof. dmartin)



Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

  • First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
  • Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
  • Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
  • The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:


or as the following if the explicit bias term is discarded:


where the 0-1 loss is approximated by the hinge-loss.


  • Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
  • The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
  • Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
  • The learning rate η is gradually decreased with iteration.


The following figure shows a simplified version of the algorithm:




The following python code implements the algorithm:

class SVMPegasos(LinearClassifier):
    """Implementation of SVM with SGD with PEGASOS Algorithm"""

    def __init__(self, n_iter=10, lambda1=1):
       self.n_iter = n_iter
       self.lambda1 = lambda1

    def fit(self, X, Y):
       Y = list(Y)
       # convert all output values to +1 or -1
       Yn = [sign(y, self.positive_class) for y in Y]
       X = X.toarray()
       m, n_features = X.shape[0], X.shape[1]
       self.w = numpy.zeros( n_features )
       for i in range(self.n_iter):
           eta = 1. / (self.lambda1*(i+1))
           j = numpy.random.choice(m, 1)[0]
           x, y = X[j], Yn[j]
           score =
           if y*score < 1:
              self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
              self.w = (1 - eta*self.lambda1)*self.w

Some Notes

  • The optional projection step has been left out (the line in square brackets in the paper).
  •  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
  •  The number η is the step length in gradient descent.
  • The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
  • Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
  • Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
  • Classification accuracy obtained on the test split was 95%.


2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.


Using the PEGASOS implementation to classify the following linearly non-separable datasets

Dataset 1 flame

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
  • Classification accuracy obtained on the test split was 95.8%.


Dataset 2

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
  • Classification accuracy obtained on the test split was 100%.


3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.





Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
  • Classification accuracy obtained on the test split was 97.5%.


4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.


The next figures show the distribution of the word lengths for the texts for different sentiments.



  • Create sklearn Pipeline to
    • Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
    • Use sklearn SelectKBest to choose top 5000 features
    • Train / test the model
  • Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
  • Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
  • Also Record the time taken to train the classifier.
  • Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.


The following python code shows how the text-classification pipeline is implemented with SVMPegasos.

def classify_with_svm():
    # read all the documents
    X, Y = read_corpus('all_sentiment_shuffled.txt')
    # split into training and test parts
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
    classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
    tokenizer = lambda x: x)),
    ('fs', SelectKBest(k=5000)),
    ('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
    t0 = time.time(), Ytrain)
    t1 = time.time()
    print('Training time:', t1-t0, 'seconds.')
    Yguess = classifier.predict(Xtest)
    print('accuracy:', accuracy_score(Ytest, Yguess))


Some Notes

  • The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
  • This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
  • The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
  • For Perceptron the entire training data was passed 10 times to the model while training.
  • For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.


As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.




Learning Distributed Word  Representations with Neural Network: an implementation in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by  Prof.  Geoffrey Hinton from the University of Toronto in 2012.  This problem also appeared as an assignment in this course from the same university.  The problem description is taken from the assignment pdf.


Problem Statement

In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:


  • The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.
  • The 4-grams are chosen so that all the words involved come from a small
    vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.
  • Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

load data.mat
ans =
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left

  • The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.
  • Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .

  • The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.
  • Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.


load data.mat
[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

% 3-gram features for a training data-tuple
%ans =
%ans = now
%ans = where
%ans = do

% target for the same data tuple from training dataset
%ans = 91
%ans = we

  • The validation and test data are also similar. They contain 46,568 4-grams each.
  • Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.
  • The data needs to get loaded and then separated into inputs and target. After that,  mini-batches of size 100 for the training set are created.
  • First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.
  • At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.
  • The training method will output a ‘model’ (weight matrices, biases for each layer in the network).


Description of the Network


  • As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.
  • The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.
  • This layer is connected to the hidden layer, which in turn is connected to the output layer.
  • The output layer is a softmax over the 250 words.
  • The training consists of two steps:  (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.


Forward Propagation

  • The forward propagation is pretty straight-forward and can be implemented as shown in the following code:
    function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
     fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
     hid_to_output_weights, hid_bias, output_bias)
    % This method forward propagates through a neural network.
    % Inputs:
    % input_batch: The input data as a matrix of size numwords X batchsize where,
    % numwords is the number of words, batchsize is the number of data points.
    % So, if input_batch(i, j) = k then the ith word in data point j is word
    % index k of the vocabulary.
    % word_embedding_weights: Word embedding as a matrix of size
    % vocab_size X numhid1, where vocab_size is the size of the vocabulary
    % numhid1 is the dimensionality of the embedding space.
    % embed_to_hid_weights: Weights between the word embedding layer and hidden
    % layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
    % number of hidden units.
    % hid_to_output_weights: Weights between the hidden layer and output softmax
    % unit as a matrix of size numhid2 X vocab_size
    % hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
    % output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
    % Outputs:
    % embedding_layer_state: State of units in the embedding layer as a matrix of
    % size numhid1*numwords X batchsize
    % hidden_layer_state: State of units in the hidden layer as a matrix of size
    % numhid2 X batchsize
    % output_layer_state: State of units in the output layer as a matrix of size
    % vocab_size X batchsize
    [numwords, batchsize] = size(input_batch);
    [vocab_size, numhid1] = size(word_embedding_weights);
    numhid2 = size(embed_to_hid_weights, 2);
    % Look up the inputs word indices in the word_embedding_weights matrix.
    embedding_layer_state = reshape(...
     word_embedding_weights(reshape(input_batch, 1, []),:)',...
     numhid1 * numwords, []);
    % Compute inputs to hidden units.
    inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
     repmat(hid_bias, 1, batchsize);
    % Apply logistic activation function.
    hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize);
    % Compute inputs to softmax.
    inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize);
    % Subtract maximum.
    % Remember that adding or subtracting the same constant from each input to a
    % softmax unit does not affect the outputs. Here we are subtracting maximum to
    % make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their
    % exponents.
    inputs_to_softmax = inputs_to_softmax...
     - repmat(max(inputs_to_softmax), vocab_size, 1);
    % Compute exp.
    output_layer_state = exp(inputs_to_softmax);
    % Normalize to get probability distribution.
    output_layer_state = output_layer_state ./ repmat(...
     sum(output_layer_state, 1), vocab_size, 1);




  •  The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.



  • As the model trains it prints out some numbers that tell how well the training is going.
  • The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.
  • After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.
  • The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.
  • At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.


Some Applications

1. Predict next word

  • Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.
  • The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.
  • It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.




2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random  text:

  1. Given 3 words to start from, initialize the text with those 3 words.
  2. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words.
  3. Choose one of the most probable words predicted randomly and insert it at the end of the text.
  4. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:

function gen_rand_text(words, model, k=3)

probs = [];
i = 4;
while (i < 20 || word != '.')
[word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k);                   words = {words{:}, word};
probs = [probs; prob];
i = i + 1;
fprintf(1, "%s ", words{:}) ;
fprintf(1, '\n');
fprintf(1, "%.2f ", round(probs.*100)./100) ;
fprintf(1, '\n');


Starting with the words  'i was going‘, here are some texts that were generated using the model:


Starting with the words  ‘life in new‘, here is a piece of text that was generated using the model:


3. Find nearest words

  •  The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.
  • As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.
  • For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.




4. Visualization in 2-dimension with t-SNE

  •  In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.
  • As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.
  • Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).



5. Solving Word-Analogy Problem

  •  with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation.  Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.
  • The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.


  • The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.
    analogy('year', 'years', 'day', model); % singular-plural relation
    %dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186
    analogy('on', 'off', 'new', model) % antonyms relation
    %dist_E('on','off')=2.013958, dist_E('new','old')=2.265665
    analogy('use', 'used', 'do', model) % present-past relation
    %dist_E('use','used')=2.556175, dist_E('do','did')=2.456098
    analogy('he', 'his', 'they', model) % pronoun-relations
    %dist_E('he','his')=3.824808, dist_E('they','their')=3.825453
    analogy('today', 'yesterday', 'now', model)
    %dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935


Model Selection

  • Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.
  • The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures,  the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).
  • The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.



Implementing Lucas-Kanade Optical Flow algorithm in Python

In this article an implementation of the Lucas-Kanade optical flow algorithm is going to be described. This problem appeared as an assignment in this computer vision course from UCSD. The inputs will be sequences of images (subsequent frames from a video) and the algorithm will output an optical flow field (u, v) and trace the motion of the moving objects. The problem description is taken from the assignment itself.


Problem Statement

Single-Scale Optical Flow

  • Let’s implement the single-scale Lucas-Kanade optical flow algorithm. This involves finding the motion (u, v) that minimizes the sum-squared error of the brightness constancy equations for each pixel in a window.  The algorithm will be implemented as a function with the following inputs:

     def optical_flow(I1, I2, window_size, tau) # returns (u, v)

  • Here, u and v are the x and y components of the optical flow, I1 and I2 are two images taken at times t = 1 and t = 2 respectively, and window_size is a 1 × 2 vector storing the width and height of the window used during flow computation.
  • In addition to these inputs, a theshold τ should be added, such that if τ is larger than the smallest eigenvalue of A’A, then the the optical flow at that position should not be computed. Recall that the optical flow is only valid in regions where

has rank 2, which is what the threshold is checking. A typical value for τ is 0.01.

  • We should try experimenting with different window sizes and find out the tradeoffs associated with using a small vs. a large window size.
  • The following figure describes the algorithm, which considers a nxn (n>=3) window around each pixel and solves a least-square problem to find the best flow vectors for the pixel.


  • The following code-snippet shows how the algorithm is implemented in python for a gray-level image.
import numpy as np
from scipy import signal
def optical_flow(I1g, I2g, window_size, tau=1e-2):

    kernel_x = np.array([[-1., 1.], [-1., 1.]])
    kernel_y = np.array([[-1., -1.], [1., 1.]])
    kernel_t = np.array([[1., 1.], [1., 1.]])#*.25
    w = window_size/2 # window_size is odd, all the pixels with offset in between [-w, w] are inside the window
    I1g = I1g / 255. # normalize pixels
    I2g = I2g / 255. # normalize pixels
    # Implement Lucas Kanade
    # for each point, calculate I_x, I_y, I_t
    mode = 'same'
    fx = signal.convolve2d(I1g, kernel_x, boundary='symm', mode=mode)
    fy = signal.convolve2d(I1g, kernel_y, boundary='symm', mode=mode)
    ft = signal.convolve2d(I2g, kernel_t, boundary='symm', mode=mode) +
         signal.convolve2d(I1g, -kernel_t, boundary='symm', mode=mode)
    u = np.zeros(I1g.shape)
    v = np.zeros(I1g.shape)
    # within window window_size * window_size
    for i in range(w, I1g.shape[0]-w):
        for j in range(w, I1g.shape[1]-w):
            Ix = fx[i-w:i+w+1, j-w:j+w+1].flatten()
            Iy = fy[i-w:i+w+1, j-w:j+w+1].flatten()
            It = ft[i-w:i+w+1, j-w:j+w+1].flatten()
            #b = ... # get b here
            #A = ... # get A here
            # if threshold τ is larger than the smallest eigenvalue of A'A:
            nu = ... # get velocity here

    return (u,v)


Some Results

  • The following figures and animations show the results of the algorithm on a few image sequences. Some of these input image sequences / videos are from the course and some are collected from the internet.
  • As can be seen, the algorithm performs best if the motion of the moving object(s) in between consecutive frames is slow. To the contrary, if the motion is large, the algorithm fails and we should implement / use multiple-scale version Lucas-Kanade with image pyramids.
  • Finally,  with small window size,  the algorithm captures subtle motions but not large motions. With large size it happens the other way.

Input Sequences



Output Optical Flow with different window sizes

window size = 15


window size = 21



Input Sequences

Output Optical Flow


Input Sequences (hamburg taxi)


Output Optical Flowtaxi_opt


Input Sequences


Output Optical Flow

Input Sequences


Output Optical Flowseq_opt

Input Sequences    fount3.gif


Output Optical Flowfount_opt

Input Sequences

Output Optical Flow

Input Sequencessynth

Output Optical Flowsynth_opt

Input Sequencescars1
Output Optical Flowcars1_optcars1_cmap

Input Sequencescars2

Output Optical Flowcars2_opt

Output Optical Flowcars2_opt2cars2_cmap

Input Sequences



Output Optical Flow with window size 45

Output Optical Flow with window size 10

Output Optical Flow with window size 25
Output Optical Flow with window size 45cars3_opt2_45


Efficient Graph-Based Image Segmentation in Python

In this article, an implementation of an efficient graph-based image segmentation technique will be described, this algorithm was proposed by Felzenszwalb et. al. from MIT in this paper.  The slides on this paper can be found from this link from the Stanford Vision Lab too. The algorithm is closely related to Kruskal’s algorithm for constructing a minimum spanning tree of a graph, as stated by the author and hence can  be implemented to run in O(m log m) time, where m is the number of edges in the graph.

Problem Definition and the basic idea (from the paper)

  • Let G = (V, E) be an undirected graph with vertices vi ∈ V, the set of elements to be segmented, and edges (vi, vj ) ∈ E corresponding to pairs of neighboring vertices. Each edge (vi, vj ) ∈ E has a corresponding weight w((vi, vj)), which is a non-negative measure of the dissimilarity between neighboring elements vi and vj.

  • In the case of image segmentation, the elements in V are pixels and the weight of an edge is some measure of the dissimilarity between the two pixels connected by that edge (e.g., the difference in intensity, color, motion, location or some other local attribute).

  • Particularly for the implementation described here, an edge weight function based on the absolute intensity difference (in the yiq space) between the pixels connected by an edge, w((vi, vj )) = |I(pi) − I(pj )|.

  • In the graph-based approach, a segmentation S is a partition of V into components
    such that each component (or region) C ∈ S corresponds to a connected component
    in a graph G0 = (V, E0), where E0 ⊆ E.

  • In other words, any segmentation is induced by a subset of the edges in E. There are different ways to measure the quality of a segmentation but in general we want the elements in a component to be similar, and elements in different components to be dissimilar.

  • This means that edges between two vertices in the same component should have relatively low weights, and edges between vertices in different components should have higher weights.

  • The next figure shows the steps in the algorithm. The algorithm is very similar to Kruskal’s algorithm for computing the MST for an undirected graph.


  • The threshold function τ controls the degree to which the difference between two
    components must be greater than their internal differences in order for there to be
    evidence of a boundary between them.

  • For small components, Int(C) is not a good estimate of the local characteristics of the data. In the extreme case, when |C| = 1, Int(C) = 0. Therefore, a threshold function based on the size of the component, τ (C) = k/|C| is needed to be usedwhere |C| denotes the size of C, and k is some constant parameter.

  • That is, for small components we require stronger evidence for a boundary. In practice k sets a scale of observation, in that a larger k causes a preference for larger components.

  • In general, a Gaussian filter is used to smooth the image slightly before computing the edge weights, in order to compensate for digitization artifacts. We always use a Gaussian with σ = 0.8, which does not produce any visible change to the image but helps remove artifacts.

  • The following python code shows how to create the graph.

import numpy as np
from scipy import signal
import matplotlib.image as mpimg

def gaussian_kernel(k, s = 0.5):
    # generate a (2k+1)x(2k+1) gaussian kernel with mean=0 and sigma = s
    probs = [exp(-z*z/(2*s*s))/sqrt(2*pi*s*s) for z in range(-k,k+1)]
return np.outer(probs, probs)

def create_graph(imfile, k=1., sigma=0.8, sz=1):
    # create the pixel graph with edge weights as dissimilarities
     rgb = mpimg.imread(imfile)[:,:,:3]
     gauss_kernel = gaussian_kernel(sz, sigma)
     for i in range(3):
         rgb[:,:,i] = signal.convolve2d(rgb[:,:,i], gauss_kernel, boundary='symm', mode='same')
     yuv = rgb2yiq(rgb)
     (w, h) = yuv.shape[:2]
     edges = {}
     for i in range(yuv.shape[0]):
         for j in range(yuv.shape[1]):
             #compute edge weight for nbd pixel nodes for the node i,j
             for i1 in range(i-1, i+2):
                 for j1 in range(j-1, j+2):
                     if i1 == i and j1 == j: continue

                     if i1 >= 0 and i1 = 0 and j1 < h:
                        wt = np.abs(yuv[i,j,0]-yuv[i1,j1,0])
                        n1, n2 = ij2id(i,j,w,h), ij2id(i1,j1,w,h)
                        edges[n1, n2] = edges[n2, n1] = wt
     return edges


Some Results

  • The images are taken from the paper itself or from the internet. The following figures and animations show the result of segmentation as a result of iterative merging of the components (by choosing least weight edges), depending on the internal difference of the components.

  • Although in the paper the author described the best value of the parameter k to be around 300, but  since in this implementation the pixel RGB values are normalized (to have values in between 0 – 1) and then converted to YIQ values and the YIQ intensities are used for computing the weights (which are typically very small), the value of k that works best in this scenario is 0.001-0.01.

  • As we can see from the below results, higher the value of the parameter k, larger the size of the final component and lesser the number of components in the result.

  • The minimum spanning tree creation is also shown, the red edges shown in the figures are the edges chosen by the algorithm to merge the components.

Input Image


Output Images for two different values of the parameter k



out_ 0.001player.png

out_ 0.010player.png

The forest created after a few iterations


Input Image


Output Images for two different values of the parameter k


out_ 0.010hill.png


out_ 0.001hill.png

The forest created after a few iterations


Input Image


Output Segmented Images

parrot_k_0.001out_ 0.001parrot_Dark2.pngout_ 0.001parrot_hot.png

Input Image


Segmented Output Images

out_ 0.001road2_Set1.png

Input Image


Output Images for two different values of the parameter k


out_ 0.001road.png
road_k_0.01out_ 0.010road.png

The forest created after a few iterations



  Input Image (UMBC)


Segmented Output Images

umbc_k_0.001out_ 0.001umbc.png

    Input Image


Segmented Output Image with k=0.001

out_ 0.001nj.png

    Input Image


Segmented Output Image with k=0.001


    Input Image


Segmented Output Image with k=0.001

out_ 0.001flowers2.png

    Input Image (liver)


Segmented Output Images 

out_ 0.001liver.pngout_ 0.010liver.png

    Input Image


Segmented Output Images with different values of k


out_ 0.001building.pngout_ 0.010building.png

    Input Image


Segmented Output Image

out_ 0.001frame056.png

    Input Image


Segmented Output Images for different k



Image Colorization Using Optimization in Python


This article is inspired by this SIGGRAPH paper by Levin et. al, for which they took this patent , the paper was referred to in the course CS1114 from Cornell.  This method is also discussed in the coursera online image processing course by NorthWestern University. Some part of the problem description is taken from the paper itself. Also, one can refer to the implementation provided by the authors in matlab, the following link and the following python implementation in github.


The Problem

Colorization is a computer-assisted process of adding color to a monochrome image or movie. In the paper the authors presented an optimization-based colorization method that is based on a simple premise: neighboring pixels in space-time that have similar intensities should have similar colors.

This premise is formulated using a quadratic cost function  and as an optimization problem. In this approach an artist only needs to annotate the image with a few color scribbles, and the indicated colors are automatically propagated in both space and time to produce a fully colorized image or sequence.

In this article the optimization problem formulation and the way to solve it to obtain the automatically colored image will be described for the images only.


The Algorithm

YUV/YIQ color space is chosen to work in, this is commonly used in video, where Y is the monochromatic luminance channel, which we will refer to simply as intensity, while U and V are the chrominance channels, encoding the color.

The algorithm is given as input an intensity volume Y(x,y,t) and outputs two color volumes U(x,y,t) and V(x,y,t). To simplify notation the boldface letters are used (e.g. r,s) to denote \left(x,y,t \right) triplets. Thus, Y(r) is the intensity of a particular pixel.

Now, One needs to impose the constraint that two neighboring pixels r,s should have similar colors if their intensities are similar. Thus, the problem is formulated as the following optimization problem that aims to minimize the difference between the  colorU(r) at pixel r and the weighted average of the colors at neighboring pixels, where w(r,s) is a weighting function that sums to one, large when Y(r) is similar to Y(s), and small when the two intensities are different. 


When the intensity is constant the color should be constant, and when the intensity is an edge the color should also be an edge. Since the cost functions are quadratic and
the constraints are linear, this optimization problem yields a large, sparse system of linear equations, which may be solved using a number of standard methods.

As discussed in the paper, this algorithm is closely related to algorithms proposed for other tasks in image processing. In image segmentation algorithms based on normalized cuts [Shi and Malik 1997], one attempts to find the second smallest eigenvector of the matrix D − W where W is a npixels×npixels matrix whose elements are the pairwise affinities between pixels (i.e., the r,s entry of the matrix is w(r,s)) and D is a diagonal matrix whose diagonal elements are the sum of the affinities (in this case it is always 1). The second smallest eigenvector of any symmetric matrix A is a unit norm vector x that minimizes x^{T}Ax and is orthogonal to the first eigenvector. By direct  inspection, the quadratic form minimized by normalized cuts is exactly the cost function J, that is x^{T}(D-W)x =J(x). Thus, this algorithm minimizes the same cost function but under different constraints. In image denoising algorithms based on anisotropic diffusion [Perona and Malik 1989; Tang et al. 2001] one often minimizes a  function
similar to equation 1, but the function is applied to the image intensity as well.

The following figures show an original gray-scale image and the marked image with color-scribbles that are going to be used to compute the output colored image.

Original Gray-scale Image Input

Gray-scale image Input Marked with Color-Scribbles 

Implementation of the Algorithm

Here are the the steps for the algorithm:

  1. Convert both the original gray-scale image and the marked image (marked with color scribbles for a few pixels by the artist) to from RGB to YUV / YIQ color space.
  2. Compute the difference image from the marked and the gray-scale image. The pixels that differ  are going to be pinned and they will appear in the output, they are directly going to be used as stand-alone constraints in the minimization problem.
  3. We need to compute the weight matrix W that depends on the similarities in the neighbor intensities for a pixel from the original gray-scale image.
  4. The optimization problem finally boils down to solving the system of linear equations of the form WU = b that has a closed form least-square solution
    U = W^{+}b = {(W^{T}W)}^{-1}W^{T}b. Same thing is to be repeated for the V channel too.
  5. However the W matrix is going to be very huge and sparse, hence sparse-matrix based implementations must be used to obtain an efficient solution. However, in this python implementation in github, the scipy sparse lil_matrix was used when constructing the sparse matrices, which is quite slow, we can construct more efficient scipy csc matrix rightaway, by using a dictionary to store the weights initially. It is much faster. The python code in the next figure shows my implementation for computing the weight matrix W.
  6. Once W is computed it’s just a matter of obtaining the least-square solution, by computing the pseudo-inverse, which can be more efficiently computed with LU factorization and a sparse LU solver, as in this  python implementation in github.
  7. Once the solution of the optimization problem is obtained in YUV / YIQ space, it needs to be converted back to RGB. The following formula is used for conversion.rgbyiq.png

import scipy.sparse as sp
from collections import defaultdict

def compute_W(Y, colored):

    (w, h) = Y.shape
    W = defaultdict()
    for i in range(w):
        for j in range(h):
            if not (i, j) in colored: # pixel i,j in color scribble
            (N, sigma) = get_nbrs(Y, i, j, w, h) #r = (i, j)
            Z = 0.
            id1 = ij2id(i,j,w,h) # linearized id
            for (i1, j1) in N: #s = (i1, j1)
                id2 = ij2id(i1,j1,w,h)
                W[id1,id2] = np.exp(-(Y[i,j]-Y[i1,j1])**2/(sigma**2)) if sigma > 0 else 0.
                Z += W[id1,id2]
            if Z > 0:
               for (i1, j1) in N: #s = (i1, j1)
                   id2 = ij2id(i1,j1,w,h)
                   W[id1,id2] /= -Z
     for i in range(w):
        for j in range(h):
            id = ij2id(i,j,w,h)
            W[id,id] = 1.

    rows, cols = zip(*(W.keys())) #keys
    data = W.values() #[W[k] for k in keys]
    return sp.csc_matrix((data, (rows, cols)), shape=(w*h, w*h)) #W



The following images and animations show the results obtained with the optimization algorithm. Most of the following images are taken from the paper itself.

Original Gray-scale Image Input                 Gray-scale image Input Markedbaby  baby_marked

Color image Output

The next animations show how the incremental scribbling results in better and better color images.

Original Gray-scale Image Input 


As can be seen from the following animation, the different parts of the building get colored as more and more color-hints are scribbled / annotated.

Gray-scale image Input Markedhouse_marked

Color image Output


Original Gray-scale Image Input 


Gray-scale image Input Marked

Color image Output

Original Gray-scale Image Input (me)me3_gray

Gray-scale image Input Marked incrementally

Color image Output

Original Gray-scale Image Input

Gray-scale image Input Marked
Color image Output

Original Gray-scale Image Input


Gray-scale image Input Marked


Color image Output



Recursive Graphics, Bi/Tri-linear Interpolation, Anti-aliasing and Image Transformation in Python

The following problem appeared in an assignment in the Princeton course COS 126 . The problem description is taken from the course itself.

Recursive Graphics

Write a program that plots a Sierpinski triangle, as illustrated below. Then develop a program that plots a recursive patterns of your own design.


Part 1. 

The Sierpinski triangle is an example of a fractal pattern. The pattern was described by Polish mathematician Waclaw Sierpinski in 1915, but has appeared in Italian art since the 13th century. Though the Sierpinski triangle looks complex, it can be generated with a short recursive program.

Examples. Below are the target Sierpinski triangles for different values of order N.


Our task is to implement a recursive function sierpinski(). We need to think recursively: our function should draw one black triangle (pointed downwards) and then call itself recursively 3 times (with an appropriate stopping condition). When writing our program, we should exercise modular design.

The following code shows an implementation:

class Sierpinski:
#Height of an equilateral triangle whose sides are of the specified length.
def height (self, length):
return sqrt(3) * length / 2.
#Draws a filled equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def filledTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length/2., x-length/2.]), np.array([y, y+h, y+h]), alpha=1)
#Draws an empty equilateral triangle whose bottom vertex is (x, y)
#of the specified side length.
def emptyTriangle(self, x, y, length):
h = self.height(length)
draw(np.array([x, x+length, x-length]), np.array([y+2*h, y, y]), alpha=0)
# Draws a Sierpinski triangle of order n, such that the largest filled
# triangle has bottom vertex (x, y) and sides of the specified length.
def sierpinski(self, n, x, y, length):
self.filledTriangle(x, y, length)
if n > 1:
self.sierpinski(n-1, x-length/2., y, length/2.)
self.sierpinski(n-1, x+length/2., y, length/2.)
self.sierpinski(n-1, x, y+self.height(length), length/2.)

The following animation shows how such a triangle of order 5 is drawn recursively.

The following animation shows how such a triangle of order 6 is drawn recursively.


A diversion: fractal dimension.

Formally, we can define the Hausdorff dimension or similarity dimension of a self-similar figure by partitioning the figure into a number of self-similar pieces of smaller size. We define the dimension to be the log (# self similar pieces) / log (scaling factor in each spatial direction). For example, we can decompose the unit square into 4 smaller squares, each of side length 1/2; or we can decompose it into 25 squares, each of side length 1/5. Here, the number of self-similar pieces is 4 (or 25) and the scaling factor is 2 (or 5). Thus, the dimension of a square is 2 since log (4) / log(2) = log (25) / log (5) = 2. We can decompose the unit cube into 8 cubes, each of side length 1/2; or we can decompose it into 125 cubes, each of side length 1/5. Therefore, the dimension of a cube is log(8) / log (2) = log(125) / log(5) = 3.

We can also apply this definition directly to the (set of white points in) Sierpinski triangle. We can decompose the unit Sierpinski triangle into 3 Sierpinski triangles, each of side length 1/2. Thus, the dimension of a Sierpinski triangle is log (3) / log (2) ≈ 1.585. Its dimension is fractional—more than a line segment, but less than a square! With Euclidean geometry, the dimension is always an integer; with fractal geometry, it can be something in between. Fractals are similar to many physical objects; for example, the coastline of Britain resembles a fractal, and its fractal dimension has been measured to be approximately 1.25.


Part 2.

Drawing a tree recursively, as described here:


The following shows how a tree of order 10 is drawn:



The next problem appeared in an assignment in the Cornell course CS1114 . The problem description is taken from the course itself.

Bilinear Interpolation

Let’s consider a 2D matrix of values at integer grid locations (e.g., a grayscale image). To interpolate values on a 2D grid, we can use the 2D analogue of linear interpolation: bilinear interpolation. In this case, there are four neighbors for each possible point we’d like to interpolation, and the intensity values of these four neighbors are all combined to compute the interpolated intensity, as shown in the next figure.


In the figure, the Q values represent intensities. To combine these intensities, we perform linear interpolation in multiple directions: we first interpolate in the x direction (to get the value at the blue points), then in the y direction (to get the value at the green points).

Image transformations

Next, we’ll use the interpolation function to help us implement image transformations.

A 2D affine transformation can be represented with a 3 ×3 matrix T:


Recall that the reason why this matrix is 3×3, rather than 2 ×2, is that we operate in homogeneous coordinates; that is, we add an extra 1 on the end of our 2D coordinates (i.e., (x,y) becomes (x,y,1)), in order to represent translations with a matrix multiplication. To apply a transformation T to a pixel, we multiply T by the pixel’s location:


The following figure shows a few such transformation matrices:


To apply a transformation T to an entire image I, we could apply the transformation to each of I’s pixels to map them to the output image. However, this forward warping procedure has several problems. Instead, we’ll use inverse mapping to warp the pixels of the output image back to the input image. Because this won’t necessarily hit an integer-valued location, we’ll need to use the (bi-linear) interpolation to determine the intensity of the input image at the desired location, as shown in the next figure.


To demo the transformation function, let’s implement the following on a gray scale bird image:

1. Horizontal flipping


2. Scaling by a factor of 0.5

3. Rotation by 45 degrees around the center of the image


The next animations show rotation and sheer transformations with the Lena image:



Next, let’s implement a function to transform RGB images. To do this, we need to simply call transform image three times, once for each channel, then put the results together into a single image.  Next figures and animations show some results on an RGB image.


non-linear transformations

The next figure shows the transform functions from here:


The next figures and animations show the application of the above non-linear transforms on the Lena image.









Some more non-linear transforms:




There is a problem with our interpolation method above: it is not very good at shrinking images, due to aliasing. For instance, if let’s try to down-sample the following bricks image by a factor of 0.4, we get the image shown in the following figure: notice the strange banding effects in the output image.

Original Image


Down-sampled Image with Bilinear Interpolation

The problem is that a single pixel in the output image corresponds to about 2.8 pixels in the input image, but we are sampling the value of a single pixel—we should really be averaging over a small area.

To overcome this problem, we will create a data structure that will let us (approximately) average over any possible square regions of pixels in the input image: an image stack. An image stack is a 3D matrix that we can think of as, not surprisingly, a stack of images, one on top of the other. The top image in the cube will be the original input image. Images further down the stack will be the input image with progressively larger amounts of blur. The size of the matrix will be rows × cols × num levels, where the original (grayscale) image has size rows×cols and there are num levels images in the stack.

Before we use the stack, we must write a function to create it, which takes as input a (grayscale) image and a number of levels in the stack, and returns a 3D matrix stack corresponding to the stack. Again, the first image on the stack will be the original image. Every other image in the stack will be a blurred version of the previous image. A good blur kernel to use is:


Now, for image k in the stack, we know that every pixel is a (weighted) average of some number of pixels (a k × k patch, roughly speaking) in the input image. Thus, if we down-sample the image by a factor of k, we want to sample pixels from level k of the stack.

⇒  Let’s write the following function to create image stack that takes a grayscale image and a number max levels, and returns an image stack.

from scipy import signal

def create_image_stack(img, max_level):
K = np.ones((3,3)) / 9.
image_stack = np.zeros((img.shape[0], img.shape[1], max_level))
image_stack[:,:,0] = img
for l in range(1, max_level):
image_stack[:,:,l] = signal.convolve2d(image_stack[:,:,l-1], K,
boundary=’symm’, mode=’same’)
return image_stack

The next animation shows the image stack created from the bricks image.



Trilinear Interpolation

Now, what happens if we down-sample the image by a fractional factor, such as 3.6? Unfortunately, there is no level 3.6 of the stack. Fortunately, we have a tool to solve this problem: interpolation. We now potentially need to sample a value at position (row,col,k) of the image stack, where all three coordinates are fractional. We therefore something more powerful than bilinear interpolation: trilinear interpolation! Each position we want to sample now has eight neighbors, and we’ll combine all of their values together in a weighted sum.

This sounds complicated, but we can write this in terms of our existing functions. In particular, we now interpolate separately along different dimensions: trilinear interpolation can be implemented with two calls to bilinear interpolation and one call to linear interpolation.

Let’s implement a function trilerp like the following that takes an image stack, and a row, column, and stack level k, and returns the interpolated value.

def trilerp (img_stack, x, y, k):

if k < 1: k = 1
if  k == int(k):
return bilerp(img_stack[:,:,k-1], x, y)
f_k, c_k = int(floor(k)), int(ceil(k))
v_f_k = bilerp(img_stack[:,:,f_k-1], x, y)
v_c_k = bilerp(img_stack[:,:,c_k-1], x, y)
return linterp(k, f_k, c_k, v_f_k, v_c_k)

Now we can finally implement a transformation function that does proper anti-aliasing. In order to do this, let’s implement a function that will

  • First compute the image stack.
  • Then compute, for the transformation T, how much T is scaling down the image. If T is defined by the six values a,b,c,d,e,f above, then, to a first approximation, the downscale factor is:
    However, if k < 1 (corresponding to scaling up the image), we still want to sample from level 1. This situation reverts to normal bilinear interpolation.
  • Next call the trilerp function on the image stack, instead of bilerp on the input image.

The next figure shows the output image obtained image transformation with proper anti-aliasing:

Down-sampled Image with Anti-aliasing using Trilinear Interpolationout_scale_t.png

As we can see from the above output, the aliasing artifact has disappeared.

The same results are obtained on the color image, as shown below, by applying the trilerp function on the color channels separately and creating separate image stacks for different color channels.

Original Image

Down-sampled Image with Bilinear Interpolation

Down-sampled Image with Anti-aliasing

The following animation shows the branding artifacts created when using bilinear interpolation for  different scale factors and how they are removed with anti-aliasing.

Down-sampled Images with Bilinear Interpolation


Down-sampled Images with Anti-aliasing


Some Applications of Markov Chain in Python

In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.

1. Markov Model of Natural Language

This problem appeared as an assignment in the Princeton course cos126 . This assignment was developed by Prof. Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.

Problem Statement

Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these frequencies as probabilities. For example, if the input text is "gagggagaggcgagaaa", the Markov model of order 0 predicts that each letter is 'a' with probability 7/17, 'c' with probability 1/17, and 'g' with probability 9/17 because these are the fraction of times each letter occurs. The following sequence of letters is a typical example generated from this model:

g a g g c g a g a a g a g a a g a a a g a g a g a g a a a g a g a a g ...

A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with statistical properties of English text because there a high correlation among successive letters in a word or sentence. For example, 'w' is more likely to be followed with 'e' than with 'u', while 'q' is more likely to be followed with 'u' than with 'e'.

We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters. Let a k-gram mean any k consecutive letters. Then for example, if the text has 100 occurrences of "th", with 60 occurrences of "the", 25 occurrences of "thi", 10 occurrences of "tha", and 5 occurrences of "tho", the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20.

A brute-force solution. Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of order 1:

“ To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. ”

Our task is to write a python program to automate this laborious task in a more efficient way — Shannon’s brute-force approach is prohibitively slow when the size of the input text is large.

Markov model data type.
 Create an immutable data type MarkovModel to represent a Markov model of order k from a given text string. The data type must implement the following API:


  • Constructor. To implement the data type, create a symbol table, whose keys will be Stringk-grams. You may assume that the input text is a sequence of characters over the ASCII alphabet so that all char values are between 0 and 127. The value type of your symbol table needs to be a data structure that can represent the frequency of each possible next character. The frequencies should be tallied as if the text were circular (i.e., as if it repeated the first k characters at the end).
  • Order. Return the order k of the Markov Model.
  • Frequency. There are two frequency methods.
    • freq(kgram) returns the number of times the k-gram was found in the original text.
    • freq(kgram, c) returns the number of times the k-gram was followed by the character c in the original text.
  • Randomly generate a character. Return a character. It must be a character that followed the k-gram in the original text. The character should be chosen randomly, but the results of calling rand(kgram) several times should mirror the frequencies of characters that followed the k-gram in the original text.
  • Generate pseudo-random text. Return a String of length T that is a randomly generated stream of characters whose first k characters are the argument kgram. Starting with the argument kgram, repeatedly call rand() to generate the next character. Successive k-grams should be formed by using the most recent k characters in the newly generated text. Use a StringBuilder object to build the stream of characters (otherwise, as we saw when discussing performance, your code will take order of N2 time to generate N characters, which is too slow).

To avoid dead ends, treat the input text as a circular string: the last character is considered to precede the first character.

For example, if k = 2 and the text is the 17-character string "gagggagaggcgagaaa", then the salient features of the Markov model are captured in the table below:

               frequency of   probability that
                next char       next char is 
kgram   freq    a   c   g        a    c    g
 aa      2      1   0   1       1/2   0   1/2  
 ag      5      3   0   2       3/5   0   2/5  
 cg      1      1   0   0        1    0    0
 ga      5      1   0   4       1/5   0   4/5  
 gc      1      0   0   1        0    0    1  
 gg      3      1   1   1       1/3  1/3  1/3  
        17      7   1   9

Taken from an example from the same assignment.

Note that the frequency of "ag" is 5 (and not 4) because we are treating the string as circular.

Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For example, if the current state is "ga" in the Markov model of order 2 discussed above, then the next character is 'a' with probability 1/5 and 'g' with probability 4/5. The next state in the Markov chain is obtained by appending the new character to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such states. Below is a possible trajectory consisting of 9 transitions.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
probability for a:       1/5      3/5      1/3       0        1       1/5      3/51/5      1/2
probability for c:        0        0       1/3       0        0        0        0        0        0
probability for g:       4/5      2/5      1/3       1        0       4/5      2/5      4/5      1/2

Taken from an example from the same assignment.

Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters.

To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then, simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
output:              ga        g        g        c        g        a        g        a        a        g

Text generation client. Implement a client program TextGenerator that takes two command-line integers k and T, reads the input text from standard input and builds a Markov model of order k from the input text; then, starting with the k-gram consisting of the first k letters of the input text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain. We may assume that the text has length at least k, and also that T ≥ k.

A Python Implementation

import numpy as np
from collections import defaultdict

class MarkovModel:

    def __init__(self, text, k): 
        create a Markov model of order k from given text
        Assume that text has length at least k.
        self.k = k               
        self.tran = defaultdict(float)
        self.alph = list(set(list(text)))
        self.kgrams = defaultdict(int)
        n = len(text)
        text += text[:k]
        for i in range(n):
            self.tran[text[i:i+k],text[i+k]] += 1.
            self.kgrams[text[i:i+k]] += 1

    def order(self):                   # order k of Markov model
        return self.k

    def freq(self, kgram):             # number of occurrences of kgram in text
        assert len(kgram) == self.k    # (check if kgram is of length k)
        return self.kgrams[kgram]

    def freq2(self, kgram, c):   	# number of times that character c follows kgram
        assert len(kgram) == self.k     # (check if kgram is of length k)  
        return self.tran[kgram,c]  

    def rand(self, kgram):             # random character following given kgram
        assert len(kgram) == self.k    # (check if kgram is of length k.
        Z = sum([self.tran[kgram, alph] for alph in self.alph])
        return np.random.choice(self.alph, 1, p=np.array([self.tran[kgram, alph] for alph in self.alph])/Z)

    def gen(self, kgram, T):           # generate a String of length T characters
        assert len(kgram) == self.k    # by simulating a trajectory through the corresponding
        str = ''                       # Markov chain. The first k characters of the newly
        for _ in range(T):             # generated String should be the argument kgram.
             #print kgram, c           # check if kgram is of length k.
             c =  self.rand(kgram)[0]  # Assume that T is at least k.
             kgram = kgram[1:] + c     
             str += c			 
        return str

Some Results

m = MarkovModel('gagggagaggcgagaaa', 2)

generates the following MarkovChain where each state represents a 2-gram.



Input: news item (taken from the assignment)

Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.

“We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Counsel for Microsoft Corporation. “We are disappointed with this decision, but we will immediately comply with the Court’s order.”

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft’s goal.

“We will continue to listen to our customers and provide them the tools they need to write great software using the Java language,” added Tod Nielsen, General Manager for Microsoft’s Developer Relations Group/Platform Marketing.

Markov Model learnt


Generated output: random news item, using input as an order 7 model


Microsoft is no longer able to use the Java language,” added Tod Nielsen, General Counsel for Microsoft’s Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developer Relations Group/Platform Marketing.Microsoft to be Microsoft Corporation. “We are disappointed with the Court’s order.”

Microsoft is no longer able to use the Java language. Providing the best tools and provide them the tools and programming options will continue to listen to our customers and provide them the tools and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the Java programming option of taking advantage of Windows features when writing software Developer Relations Group/Platform Marketing.Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Manager for Microsoft’s goal.

“We will continue to listen to our customers and programming language. Providing the Java language,” added Tod Nielsen, General Manager for Microsoft is no longer able to use the Java language.

Noisy Text Correction

Imagine we receive a message where some of the characters have been corrupted by noise. We represent unknown characters by the ~ symbol (we assume we don’t use ~ in our messages). Add a method replaceUnknown that decodes a noisy message by replacing each ~ with the most likely character given our order k Markov model, and conditional on the surrounding text:

def replaceUnknown(corrupted)  # replace unknown characters with most probable characters

Assume unknown characters are at least k characters apart and also appear at least k characters away from the start and end of the message.  This maximum-likelihood approach doesn’t always get it perfect, but it fixes most of the missing characters correctly.

Here are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should consider all possible replacement characters. We want the replacement character that makes sense not only at the unknown position (given the previous characters) but also when the replacement is used in the context of the k subsequent known characters. For example we expect the unknown character in "was ~he wo" to be 't' and not simply the most likely character in the context of "was ". You can compute the probability of each hypothesis by multiplying the probabilities of generating each of k+1 characters in sequence: the missing one, and the k next ones. The following figure illustrates how we want to consider k+1 windows to maximize the log-likelihood:


Using the algorithm described above, here are the results obtained for the following example:

Original : it was the best of times, it was the worst of times.
Noisy :      it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.
Corrected (k=4): it was the best of times, it was the worst of times.
Corrected (k=2): it was the best of times, in was the wo st of times.


2. Detecting authorship

This problem appeared as an assignment in the Cornell course cs1114 .

The Problem Statement

In this assignment, we shall be implementing an authorship detector which, when
given a large sample size of text to train on, can then guess the author of an unknown

  • The algorithm to be implemented works based on the following idea: An author’s
    writing style can be defined quantitatively by looking at the words he uses. Specifically, we want to keep track of his word flow – that is, which words he tends to use after other words.
  • To make things significantly simpler, we’re going to assume that the author always
    follows a given word with the same distribution of words. Of course, this isn’t true,
    since the words you choose when writing obviously depend on context. Nevertheless, this simplifying assumption should hold over an extended amount of text, where context becomes less relevant.
  • In order to implement this model of an author’s writing style, we will use a Markov
    chain. A Markov chain is a set of states with the Markov property – that is, the
    probabilities of each state are independent from the probabilities of every other state. This behavior correctly models our assumption of word independence.
  • A Markov chain can be represented as a directed graph. Each node is a state (words,
    in our case), and a directed edge going from state Si to Sj represents the probability we will go to Sj when we’re at Si. We will implement this directed graph as a transition matrix. Given a set of words W1, W2, …Wn, we can construct an n by n transition matrix A, where an edge from Wi to Wj of weight p means Aij = p.
  • The edges, in this case, represent the probability that word j follows word i from the
    given author. This means, of course, that the sum of the weights of all edges leaving
    from each word must add up to 1.
  • We can construct this graph from a large sample text corpus. Our next step would be finding the author of an unknown, short chunk of text. To do this, we simply compute the probability of this unknown text occurring, using the words in that order, in each of our Markov chains. The author would likely be the one with the highest probability.
  • We shall implement the Markov chain model of writing style. We are given some sample texts to train our model on, as well as some challenges for you to
    figure out.


Constructing the transition matrix

Our first step is to construct the transition matrix representing our Markov chain.
First, we must read the text from a sample file. We shall want to create a sparse array using the scipy csr sparse matrix. Along with the transition matrix, we shall be creating a corresponding vector that contains 2 word frequencies (normalized by the total number of words in the document (including repeated words)).

Calculating likelihood

Once we have our transition matrix, we can calculate the likelihood of an unknown
sample of text. We are given several pieces of literature by various authors,
as well as excerpts from each of the authors as test dataset. Our goal is to identify the authors of each excerpt.

To do so, we shall need to calculate the likelihood of the excerpt occurring in each author’s transition matrix. Recall that each edge in the directed graph that the transition
matrix represents is the probability that the author follows a word with another.
Since we shall be multiplying numerous possibly small probabilities together, our
calculated likelihood will likely be extremely small. Thus, you should compare log(likelihood) instead. Keep in mind the possibility that the author may have used a word he has never used before. Our calculated likelihood should not eliminate an author completely because of this. We shall be imposing a high penalty if a word is missing.

Finding the author with the maximum likelihood

Now that we can compute likelihoods, the next step is to write a routine that takes a set of transition matrices and dictionaries, and a sequence of text, and returns the index of the transition matrix that results in the highest likelihood. You will write this in a function classify text, which takes transition matrices, dictionaries, histograms, and the name of the file containing the test text, and returns a single integer best index.  The following figure shows how to detect an author k (A_k) of the test text t_1..t_n  using the transition matrix P_k with MLE :


Python Implementation

from np import log
def log0(x):
return 0 if x <= 0 else log(x)

def compute_text_likelihood(filename, T, dict_rev, histogram, index):

Compute the (log) likelihood L of a given string (in ‘filename’) given
a word transition probability T, dictionary ‘dict’, and histogram

text = word_tokenize(open(filename).read().replace(‘\n’, ‘ ‘).lower())
num_words = len(text)
text = [word for word in text if word in histogram] # keep only the words that are found in the training dataset
ll = log0(histogram[text[0]]) – log0(sum(histogram.values()))
for i in range(1, len(text)):
ll += log0(T[dict_rev[text[i-1]], dict_rev[text[i]]])
return ll + (num_words – num_matches)*penalty

def classify_text(tmatrices, dict_revs, histograms, filename):

Return the index of the most likely author given the transition matrices, dictionaries, and a test file

for i in range(len(tmatrices)):
ll = compute_text_likelihood(filename, tmatrices[i], dict_revs[i], histograms[i], i)

print i, ll

Training Dataset

The list of authors whose writings are there in the training dataset:

0. Austen
1. Carroll
2. Hamilton
3. Jay
4. Madison
5. Shakespeare
6. Thoreau
7. Twain


A few lines of excerpts from the training files (the
literature by several authors, taken from Project Gutenberg), the word clouds and a few states from the corresponding Markov Models Constructed for a few authors


Author JANE AUSTEN (Texts taken from Emma, Mansfield Park, Pride and Prejudice)

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and
happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister’s marriage,
been mistress of his house from a very early period. Her mother had
died too long ago for her to have more than an indistinct remembrance
of her caresses; and her place had been supplied by an excellent woman
as governess, who had fallen little short of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a
governess than a friend, very fond of both daughters, but particularly
of Emma. Between _them_ it was more the intimacy of sisters. Even
before Miss Taylor had ceased to hold the nominal office of governess,
the mildness of her temper had hardly allowed her to impose any
restraint; and the shadow of authority being now long passed away, they
had been living together as friend and friend very mutually attached,
and Emma doing just what she liked; highly esteeming Miss Taylor’s
judgment, but directed chiefly by her own.

The real evils, indeed, of Emma’s situation were the power of having
rather too much her own way, and a disposition to think a little too
well of herself; these were the disadvantages which threatened alloy to
her many enjoyments. The danger, however, was at present so
unperceived, that they did not by any means rank as misfortunes with




Author Lewis Carroll (Texts taken from Alice’s Adventures in Wonderland, Sylvie and Bruno)

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, ‘and what is the use of a book,’ thought Alice ‘without pictures or

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear!
Oh dear! I shall be late!’ (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural); but when the Rabbit actually TOOK A WATCH
OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,
Alice started to her feet, for it flashed across her mind that she had
never before seen a rabbit with either a waistcoat-pocket, or a watch
to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large
rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how
in the world she was to get out again.

The rabbit-hole went straight on like a tunnel for some way, and then
dipped suddenly down, so suddenly that Alice had not a moment to think
about stopping herself before she found herself falling down a very deep




Author William Shakespeare (Texts taken from Henry IV Part 1, Romeo and Juliet, Twelfth Night)


So shaken as we are, so wan with care,
Find we a time for frighted peace to pant,
And breathe short-winded accents of new broils
To be commenced in strands afar remote.
No more the thirsty entrance of this soil
Shall daub her lips with her own children’s blood;
No more shall trenching war channel her fields,
Nor bruise her flowerets with the armed hoofs
Of hostile paces: those opposed eyes,
Which, like the meteors of a troubled heaven,
All of one nature, of one substance bred,
Did lately meet in the intestine shock
And furious close of civil butchery,
Shall now, in mutual well-beseeming ranks,
March all one way, and be no more opposed
Against acquaintance, kindred, and allies:
The edge of war, like an ill-sheathed knife,
No more shall cut his master. Therefore, friends,
As far as to the sepulchre of Christ–
Whose soldier now, under whose blessed cross
We are impressed and engaged to fight–
Forthwith a power of English shall we levy,
To chase these pagans in those holy fields
Over whose acres walk’d those blessed feet
Which fourteen hundred years ago were nail’d
For our advantage on the bitter cross.
But this our purpose now is twelvemonth old,
And bootless ’tis to tell you we will go:
Therefore we meet not now.–Then let me hear
Of you, my gentle cousin Westmoreland,
What yesternight our Council did decree
In forwarding this dear expedience.

My liege, this haste was hot in question,
And many limits of the charge set down
But yesternight; when, all athwart, there came
A post from Wales loaden with heavy news;
Whose worst was, that the noble Mortimer,
Leading the men of Herefordshire to fight
Against th’ irregular and wild Glendower,
Was by the rude hands of that Welshman taken;
A thousand of his people butchered,
Upon whose dead corpse’ there was such misuse,
Such beastly, shameless transformation,
By those Welshwomen done, as may not be
Without much shame re-told or spoken of.




Author MARK TWAIN (Texts taken from A Connecticut Yankee in King Arthur’s Court, The Adventures of Huckleberry Finn, The Prince and the Pauper)

I am an American. I was born and reared in Hartford, in the State
of Connecticut–anyway, just over the river, in the country. So
I am a Yankee of the Yankees–and practical; yes, and nearly
barren of sentiment, I suppose–or poetry, in other words. My
father was a blacksmith, my uncle was a horse doctor, and I was
both, along at first. Then I went over to the great arms factory
and learned my real trade; learned all there was to it; learned
to make everything: guns, revolvers, cannon, boilers, engines, all
sorts of labor-saving machinery. Why, I could make anything
a body wanted–anything in the world, it didn’t make any difference
what; and if there wasn’t any quick new-fangled way to make a thing,
I could invent one–and do it as easy as rolling off a log. I became
head superintendent; had a couple of thousand men under me.

Well, a man like that is a man that is full of fight–that goes
without saying. With a couple of thousand rough men under one,
one has plenty of that sort of amusement. I had, anyway. At last
I met my match, and I got my dose. It was during a misunderstanding
conducted with crowbars with a fellow we used to call Hercules.
He laid me out with a crusher alongside the head that made everything
crack, and seemed to spring every joint in my skull and made it
overlap its neighbor. Then the world went out in darkness, and
I didn’t feel anything more, and didn’t know anything at all
–at least for a while.

When I came to again, I was sitting under an oak tree, on the
grass, with a whole beautiful and broad country landscape all
to myself–nearly. Not entirely; for there was a fellow on a horse,
looking down at me–a fellow fresh out of a picture-book. He was
in old-time iron armor from head to heel, with a helmet on his
head the shape of a nail-keg with slits in it; and he had a shield,
and a sword, and a prodigious spear; and his horse had armor on,
too, and a steel horn projecting from his forehead, and gorgeous
red and green silk trappings that hung down all around him like
a bedquilt, nearly to the ground.

“Fair sir, will ye just?” said this fellow.

“Will I which?”

“Will ye try a passage of arms for land or lady or for–”

“What are you giving me?” I said. “Get along back to your circus,
or I’ll report you.”

Now what does this man do but fall back a couple of hundred yards
and then come rushing at me as hard as he could tear, with his
nail-keg bent down nearly to his horse’s neck and his long spear
pointed straight ahead. I saw he meant business, so I was up
the tree when he arrived.






Classifying unknown texts from the Test Dataset

Each of the Markov models learnt from the training texts for each author are used to compute the log-likelihood of the unknown test text, the author with the maximum log-likelihood is chosen to be the likely author of the text.
Unknown Text1 

Against the interest of her own individual comfort, Mrs. Dashwood had
determined that it would be better for Marianne to be any where, at
that time, than at Barton, where every thing within her view would be
bringing back the past in the strongest and most afflicting manner, by
constantly placing Willoughby before her, such as she had always seen
him there. She recommended it to her daughters, therefore, by all
means not to shorten their visit to Mrs. Jennings; the length of which,
though never exactly fixed, had been expected by all to comprise at
least five or six weeks. A variety of occupations, of objects, and of
company, which could not be procured at Barton, would be inevitable
there, and might yet, she hoped, cheat Marianne, at times, into some
interest beyond herself, and even into some amusement, much as the
ideas of both might now be spurned by her.

From all danger of seeing Willoughby again, her mother considered her
to be at least equally safe in town as in the country, since his
acquaintance must now be dropped by all who called themselves her
friends. Design could never bring them in each other’s way: negligence
could never leave them exposed to a surprise; and chance had less in
its favour in the crowd of London than even in the retirement of
Barton, where it might force him before her while paying that visit at
Allenham on his marriage, which Mrs. Dashwood, from foreseeing at first
as a probable event, had brought herself to expect as a certain one.

She had yet another reason for wishing her children to remain where
they were; a letter from her son-in-law had told her that he and his
wife were to be in town before the middle of February, and she judged
it right that they should sometimes see their brother.

Marianne had promised to be guided by her mother’s opinion, and she
submitted to it therefore without opposition, though it proved
perfectly different from what she wished and expected, though she felt
it to be entirely wrong, formed on mistaken grounds, and that by
requiring her longer continuance in London it deprived her of the only
possible alleviation of her wretchedness, the personal sympathy of her
mother, and doomed her to such society and such scenes as must prevent
her ever knowing a moment’s rest.

But it was a matter of great consolation to her, that what brought evil
to herself would bring good to her sister; and Elinor, on the other
hand, suspecting that it would not be in her power to avoid Edward
entirely, comforted herself by thinking, that though their longer stay
would therefore militate against her own happiness, it would be better
for Marianne than an immediate return into Devonshire.

Her carefulness in guarding her sister from ever hearing Willoughby’s
name mentioned, was not thrown away. Marianne, though without knowing
it herself, reaped all its advantage; for neither Mrs. Jennings, nor
Sir John, nor even Mrs. Palmer herself, ever spoke of him before her.
Elinor wished that the same forbearance could have extended towards
herself, but that was impossible, and she was obliged to listen day
after day to the indignation of them all.

log-likelihood values computed for the probable authors

Author  LL
0           -3126.5812874
1           -4127.9155186
2           -7364.15782346
3           -9381.06336055
4           -7493.78440066
5           -4837.98005673
6           -3515.44028659
7           -3455.85716104

As can be seen from above the maximum likelihood value corresponds to the author 0, i.e., Austen.  Hence, the most probable author of the unknown text is Austen.



Unknown Text2 

Then he tossed the marble away pettishly, and stood cogitating. The
truth was, that a superstition of his had failed, here, which he and
all his comrades had always looked upon as infallible. If you buried a
marble with certain necessary incantations, and left it alone a
fortnight, and then opened the place with the incantation he had just
used, you would find that all the marbles you had ever lost had
gathered themselves together there, meantime, no matter how widely they
had been separated. But now, this thing had actually and unquestionably
failed. Tom’s whole structure of faith was shaken to its foundations.
He had many a time heard of this thing succeeding but never of its
failing before. It did not occur to him that he had tried it several
times before, himself, but could never find the hiding-places
afterward. He puzzled over the matter some time, and finally decided
that some witch had interfered and broken the charm. He thought he
would satisfy himself on that point; so he searched around till he
found a small sandy spot with a little funnel-shaped depression in it.
He laid himself down and put his mouth close to this depression and

“Doodle-bug, doodle-bug, tell me what I want to know! Doodle-bug,
doodle-bug, tell me what I want to know!”

The sand began to work, and presently a small black bug appeared for a
second and then darted under again in a fright.

“He dasn’t tell! So it WAS a witch that done it. I just knowed it.”

He well knew the futility of trying to contend against witches, so he
gave up discouraged. But it occurred to him that he might as well have
the marble he had just thrown away, and therefore he went and made a
patient search for it. But he could not find it. Now he went back to
his treasure-house and carefully placed himself just as he had been
standing when he tossed the marble away; then he took another marble
from his pocket and tossed it in the same way, saying:

“Brother, go find your brother!”

He watched where it stopped, and went there and looked. But it must
have fallen short or gone too far; so he tried twice more. The last
repetition was successful. The two marbles lay within a foot of each

log-likelihood values computed for the probable authors

Author  LL
0             -2779.02810424
1             -2738.09304225
2             -5978.83684489
3             -6551.16571407
4             -5780.39620942
5             -4166.34886511
6             -2309.25043697
7             -2033.00112729

As can be seen from above the maximum likelihood value corresponds to the author 7, i.e., Twain.  Hence, the most probable author of the unknown text is Twain. The following figure shows the relevant states corresponding to the Markov model for Twain trained from the training dataset.


Some Optimization: Implementing the Orthogonal Matching Pursuit (OMP) and the Basis Pursuit (BP) Algorithms with Octave / Matlab


The following problems appeared in a project in the edX course 236862.1x: Introduction to the Fundamentals of Sparse Representations by Prof. Michael Elad from The Technion – Israel Institute of Technology.  The description of the problems are taken straightaway from the project.


Pursuit Algorithms

In this article we demonstrate the Orthogonal Matching Pursuit (OMP) and Basis Pursuit (BP) algorithms by running them on a set of test signals and checking whether they provide the desired outcome for the P0 problem.

Some Theory 

Our goal is to solve the following problem:


which is also known as the P0 problem (since the objective function to be minimized is the L0 norm of a vector x):  a problem that seeks the sparsest solution to a linear system.  The concept of seeking the sparsest representation for data are of central importance to data processing in a universal way (e.g., in machine learning, image processing, computer vision, remote sensing), although the problem is NP-hard in general.

In order to solve it in practice, some approximation algorithm is needed to be used, either using greedy or relaxation based approaches, as shown in the next figure.


In this article the implementation  of the above two algorithms and some experimental results on approximately solving the P0 problem with some synthetic signal dataset will be described. The following figure describes how the experiments are to be done:



Data Construction

In this part we create a dictionary and synthetic signals that will serve for us in the experiments.  Here are the ingredients involved:

  • The matrix A is the “dictionary” (of size 50-by-100) under which the ideal signal is known to be sparse. We need to first construct this dictionary by drawing at random a matrix with i.i.d. Gaussian entries. Normalize each atom to a unit L2 norm.

  • Then we need to generate a sparse vector x0 of cardinality s. Let’s draw at random the  locations of the non-zero entries. Then, draw at random the value of each entry from a uniform distribution in the range [1, 3] and multiply each by a random sign.

The following figure shows the heatmap of the dictionary A randomly generated.


Greedy Pursuit

The greedy algorithm OMP is described in the following figure:



Basis Pursuit 


Even though the P0 problem is relaxed to P1, it’s still not ready to be solved by an LP Solver since, the objective function is still not linear. As explained in this thesis, the P1 problem can be converted to a LP by introducing a bunch of new variables and using the trick shown in the following figure.


The matlab /octave function to implement the BP algorithm using LP is shown below:


Analyzing the Results

In this part we compare the results obtained via the OMP and BP, by executing the following steps.

  • Plot the average relative ℓ2 error, obtained by the OMP and BP versus the cardinality.
  • Plot the average support recovery error, obtained by the OMP and BP versus the cardinality.
  • Discuss the presented graphs and explain the results.





As can be seen from the above results,

• The Linear Programming-based relaxed implementation of Basis Pursuit has higher accuracy (in terms of both L2 and support-probability errors) than the greedy Orthogonal Matching Pursuit, particularly when the cardinality of the true solution increases beyond cardinality 6.

• Both OMP and BP (LP) performs equally well (with almost zero L2 error) upto cardinality 6 and then OMP clearly performs worse than the BP (LP).

• Although the average L2 error for OMP increases upto 0.2 when the cardinality of the true solution increases to 15, whereas that of BP only increases very slightly.

• Similar pattern can be seen for the probability of error in support.

Mutual Coherence of A and theoretical guarantees for OMP and BP


where the mutual coherence of matrix A is defined as follows:


For our experiment, the mutual coherence μ(A)  = 0.45697. Hence, as per the theoretical guarantee, the OMP and BP are guaranteed to find the solution of P0 for s < 1.594164. Although in practice we observe that till s = 6 the solution found is quite accurate, since the theoretical bound shown above is for the worst-case only.