# Euler Method to solve Ordinary differential Equations with R

In this article, the simplest numeric method, namely the Euler method to solve 1st order ODEs will be demonstrated with examples (and then use it to solve 2nd order ODEs). Also, we shall see how to plot the phase lines (gradient fields) for an ODE and understand from examples how to qualitatively find a solution curve with the phaselines. All the problems are taken from the edx Course: MITx — 18.03x: Introduction to Differential Equations.

Let’s implement the above algorithm steps with the following R code snippet.


library(tidyverse)
library(latex2exp)
library(gridExtra)

Euler <- function(f, xn, yn, h, N, n=0) {
df <- NULL
An <- f(xn, yn)
while (n < N) {
df <- rbind(df, data.frame(n=n, Xn=xn, Yn=yn, An=An, hAn=h*An))
An <- f(xn, yn)
xn <- xn + h
yn <- yn + h*An
n <- n + 1
}
return(df)
}



Let’s define the function f next.


f <- function(x, y) {
return(y*sin(x))
}

df <- Euler(f, -3, 1, 0.1, 100)



The below table shows the solution curve computed by the Euler method for the first few iterations:

Let’s implement the following R functions to draw the isoclines and the phase lines for an ODE.


plot_isoclines <- function(f, rx, ry, rm, l, xgap, title) {
lx <- rx[1]
ux <- rx[2]
x <- seq(lx, ux, xgap)
idf <- NULL
for (m in rm) {
y <- f(x, m)
dx <- l / 2 / sqrt(1 + m^2)
idf % ggplot() +
geom_segment(aes(x = x0, y = y0, xend = x1, yend = y1, col = m)) +
xlim(rx[1], rx[2]) +
ylim(ry[1], ry[2]) +
scale_color_gradient2(low = "blue", mid = "green", high = "red") +
ggtitle(title) +
xlab('x') +
ylab('y') +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
}



plot_phaselines <- function(f, rx, ry, dx, dy, l, ms=-1:1, title=NULL) {

lx <- rx[1]
ux <- rx[2]
ly <- ry[1]
uy <- ry[2]
x <- seq(lx, ux, dx)
y <- seq(ly, uy, dy)
pdf <- expand.grid(x = x, y = y)
pdf$m <- f(pdf$x, pdf$y) pdf$dx <- l / 2 / sqrt(1 + pdf$m^2) pdf$x0 <- pdf$x - pdf$dx
pdf$x1 <- pdf$x + pdf$dx pdf$y0 <- pdf$y - pdf$m*pdf$dx pdf$y1 <- pdf$y + pdf$m*pdf$dx pdf$m % ggplot() +
geom_segment(aes(x = x0, y = y0, xend = x1, yend = y1, col = m)) +
xlim(rx[1], rx[2]) +
ylim(ry[1], ry[2]) +
ggtitle(title) +
xlab('x') +
ylab('y') +
theme_bw() +
theme(plot.title = element_text(hjust = 0.5))
}



Now, let’s plot the 0 (the nullcline), 1, 2, 3 isoclines for the ODE: y’ = y.sin(x).


ifunc <- function(x, m) m / sin(x)

rx <- c(-4,4)
ry <- c(-3,3)
rm <- seq(-3,3,0.5)
l <- 0.2
xgap <- 0.05
ygap <- 0.05
plot_isoclines(ifunc, rx, ry, rm, l, xgap, TeX('m-isoclines for $\\frac{dy}{dx} = y.sin(x)$'))



xgap <- 0.25 #05
ygap <- 0.25 #05
plot_phaselines(f, rx, ry, xgap, ygap, l, -1:1, TeX('phaselines and m-isoclines for $\\frac{dy}{dx} = y.sin(x)$'))



Now, let’s solve the ODE with Euler’s method, using step size h = 0.1 with the initial value y(-3) = 1.


l <- 0.2
xgap <- 0.05 #05
ygap <- 0.05 #05
p <- plot_phaselines(f, rx, ry, xgap, ygap, l, -1:1, TeX('phaselines and m-isoclines for $\\frac{dy}{dx} = y.sin(x)$'))
df <- Euler(f, -3, 1, 0.1, 100)
p + geom_point(aes(Xn, Yn), data=df) + geom_path(aes(Xn, Yn), arrow = arrow(), data=df)



The black curve in the following figure shows the solution path with Euler.

Let’s plot the phase lines for the ODE y’ = y2 – x, solve it with Euler and plot the solution curve again, using the following code.


f <- function(x, y) {
return(y^2-x)
}

rx <- c(-4,4)
ry <- c(-3,3)
rm <- seq(-3,3,0.5)
l <- 0.2
xgap <- 0.1
ygap <- 0.1
p <- plot_phaselines(f, rx, ry, xgap, ygap, l, -2:2, TeX('phaselines and m-isoclines for $\\frac{dy}{dx} = y^2-x$'))
df <- Euler(f, -1, 0, 0.1, 100)
p + geom_point(aes(Xn, Yn), data=df) + geom_path(aes(Xn, Yn), arrow = arrow(), data=df)



The black solution curve shows the trajectory of the solution obtained with Euler.

Finally, let’ plot the isoclines for the ODE y’ = y2 – x2 and solve it with Euler. Let’s try starting with different initial values and observe the solution curves in each case.


f <- function(x, y) {
return(y^2-x^2)
}

rx <- c(-4,4)
ry <- c(-4,4)
l <- 0.2
xgap <- 0.1
ygap <- 0.1
p <- plot_phaselines(f, rx, ry, xgap, ygap, l, -2:2, TeX('phaselines and m-isoclines for $\\frac{dy}{dx} = y^2-x^2$, solving with Euler with diff. initial values'))
df <- Euler(f, -3.25, 3, 0.1, 100)
p <- p + geom_point(aes(Xn, Yn), data=df) + geom_path(aes(Xn, Yn), arrow = arrow(), data=df)
df <- Euler(f, -1, 1, 0.1, 50)



As can be seen, the solution curve starting with different initial values lead to entirely different solution paths, it depends on the phase lines and isoclines it gets trapped into.

Finally, let’s notice the impact of the step size on the solution obtained by the Euler method, in terms of the error and the number of iterations taken to obtain the solution curve. As can be seen from the following animation, with bigger step size h=0.5, the algorithm converges to the solution path faster, but with much higher error values, particularly for the initial iterations. On the contrary, smaller step size h=0.1 taken many more iterations to converge, but the error rate is much smaller, the solution curve is pretty smooth and follows the isoclines without any zigzag movement.

## Spring-mass-dashpot system – the Damped Harmonic Oscillator

A spring is attached to a wall on its left and a cart on its right. A dashpot is attached to the cart on its left and the wall on its right. The cart has wheels and can roll either left or right.

The spring constant is labeled k, the mass is labeled m, and the damping coefficient of the dashpot is labeled b. The center of the cart is at the rest at the location x equals 0, and we take rightwards to be the positive x direction.

Consider the spring-mass-dashpot system in which mass is m (kg), and spring constant is k (Newton per meter). The position x(t) (meters) of the mass at (seconds) is governed by the DE

where m, b, k > 0. Let’s investigate the effect of the damping constant, on the solutions to the DE.

The next animation shows how the solution changes for different values of 𝑏, given 𝑚=1 and 𝑘=16, with couple of different initial values (c).

Now, let’s observe how the amplitude of the system response increases with the increasing frequency of the input sinusoidal force, achieving the peak at resonance frequency (for m=1, k=16 and b=√14, the resonance occurs at wr = 3)

## Solving the 2nd order ODE for the damped Harmonic Oscillator with Euler method

Now, let’s see how the 2nd order ODE corresponding to the damped Harmonic oscillator system can be numerically solved using the Euler method. In order to be able to use Euler, we need to represent the 2nd order ODE as a collection of 1st order ODEs and then solve those 1st order ODEs simultaneously using Euler, as shown in the following figure:

Let’s implement a numerical solver with Euler for a 2nd order ODE using the following R code snippet:


Euler2 <- function(f, tn, xn, un, h, N, n=0) {
df <- NULL
while (n < N) {
df <- rbind(df, data.frame(n=n, tn=tn, xn=xn, un=un))
tn <- tn + h
xn <- xn + h*un
un <- un + h*f(tn, xn, un)
n <- n + 1
}
return(df)
}



For the damped Harmonic Oscillator, we have the 2nd order ODE

from which we get the following two 1st order ODEs:

Let’s solve the damped harmonic oscillator with Euler method:


f <- function(t, x, u) {
return(-b*u/m - k*x/m)
}

m <- 1
b <- 1
k <- 16
df <- Euler2(f, 0, 1, 0, 0.1, 100)



Now, let’s plot the solution curves using the following code block:


par(mfrow=c(1,2))
plot(df$tn, df$xn, main="x(t) vs. t", xlab='t', ylab='x(t)', col='red', pch=19, cex=0.7,
xaxs="i", yaxs="i")
lines(df$tn, df$xn, col='red')
grid(10, 10)
plot(df$tn, df$un, main=TeX('$\\dot{x}(t)$ vs. t'), xlab='t', ylab=TeX('$\\dot{x}(t)$'), col='blue',
pch=19, cex=0.7, xaxs="i", yaxs="i")
lines(df$tn, df$un, col='blue')
grid(10, 10)



The following animation shows the solution curves obtained by joining the points computed with the Euler method for the damped Harmonic Oscillator:

# Named-Entity-Recognition (NER) on Twitter with Bi-directional LSTM with tensorflow in python

In this article, we shall discuss on how to use a recurrent neural network to solve Named Entity Recognition (NER) problem. NER is a common task in NLP systems. Given a text document, a NER  system aims at extracting the entities (e.g., persons, organizations, locations, etc.) from the text.  Here, a BiLSTM (bi-directional long short term memory) will be used to recognize the named entities from Twitter texts. This problem appears as an assignment in the Coursera course Natural Language Processing by National Research University Higher School of Economics, it’s a part of Advanced Machine Learning Specialization. The problem is taken from the assignment.

### Named Entity Recognition Problem

Let’s try to understand by a few examples. The following figure shows three examples of  Twitter texts from the training corpus that we are going to use, along with the NER tags corresponding to each of the tokens from the texts.

Let’s say we want to extract

• the person names
• the company names
• the location names
• the music artist names
• the tv show names

from the texts. Then a perfect NER model needs to generate the following sequence of tags, as shown in the next figure.

Where B- and I- prefixes stand for the beginning and inside of the entity, while O stands for out of tag or no tag. Markup with the prefix scheme is called BIO markup. This markup is introduced for distinguishing of consequent entities with similar types.

In this article we shall use a recurrent neural network (RNN), particularly, a Bi-Directional Long Short-Term Memory Networks (Bi-LSTMs), to predict the NER tags given the input text tokens. The BiLSTM model is needed to b e trained first, so that it can be used for prediction. The RNN architecture that is used for NER is shown below.

### Bi-directional LSTM

• provides a universal approach for sequence tagging
• several layers can be stacked + linear layers can be added on top
• is trained by cross-entropy loss coming from each position

The corpus to be used here contains tweets with NE tags. Every line of a file contains a pair of a token (word/punctuation symbol) and a tag, separated by a whitespace. Different tweets are separated by an empty line.

The function read_data reads a corpus from the file_path and returns two lists: one with tokens and one with the corresponding tags. A user’s nickname in a tweet needs to be replaced by the <USR> token and any URL with the<URL> token (assuming that a URL and a nickname are just strings which start with http:// or https:// in case of URLs and a @ symbol for nicknames).

import re
tokens = []
tags = []

tweet_tokens = []
tweet_tags = []
for line in open(file_path, encoding='utf-8'):
line = line.strip()
if not line:
if tweet_tokens:
tokens.append(tweet_tokens)
tags.append(tweet_tags)
tweet_tokens = []
tweet_tags = []
else:
token, tag = line.split() # Replace all urls with  token
# Replace all users with  token
token = re.sub('http[s]*://.*', '', token.lower())
token = re.sub('@.*', '', token)
tweet_tokens.append(token)
tweet_tags.append(tag)



And now we can load three separate parts of the dataset:

• train data for training the model;
• validation data for evaluation and hyperparameters tuning;
• test data for final evaluation of the model.
train_tokens, train_tags = read_data('data/train.txt') validation_tokens, validation_tags = read_data('data/validation.txt') test_tokens, test_tags = read_data('data/test.txt')


We should always understand what kind of data you deal with. For this purpose, let’s print the data running the following cell:

for i in range(3):
for token, tag in zip(train_tokens[i], train_tags[i]):
print('%s\t%s' % (token, tag))
print()

# rt	O
#  O
# :	O
# online	O
# ticket	O
# sales	O
# for	O
# ghostland	B-musicartist
# observatory	I-musicartist
# extended	O
# until	O
# 6	O
# pm	O
# est	O
# due	O
# to	O
# high	O
# demand	O
# .	O
# get	O
# them	O
# before	O
# they	O
# sell	O
# out	O
# ...	O
#
# apple	B-product
# macbook	I-product
# pro	I-product
# a1278	I-product
# 13.3	I-product
# "	I-product
# laptop	I-product
# -	I-product
# md101ll/a	I-product
# (	O
# june	O
# ,	O
# 2012	O
# )	O
# -	O
# full	O
# by	O
# ebay	B-company
#  O
#  O
#
# happy	O
# birthday	O
#  O
# !	O
# may	O
# allah	B-person
# s.w.t	O
# bless	O
# you	O
# with	O
# goodness	O
# and	O
# happiness	O
# .	O


### Prepare dictionaries

To train a neural network, we will use two mappings:

• {token} {token id}: address the row in embeddings matrix for the current token;
• {tag} {tag id}: one-hot-encoding ground truth probability distribution vectors for computing the loss at the output of the network.

Now let’s implement the function build_dict which will return {token or tag} {index} and vice versa.


from collections import defaultdict
def build_dict(tokens_or_tags, special_tokens):
"""
tokens_or_tags: a list of lists of tokens or tags
special_tokens: some special tokens
"""
# Create a dictionary with default value 0
tok2idx = defaultdict(lambda: 0)
idx2tok = []

# Create mappings from tokens (or tags) to indices and vice versa.
# At first, add special tokens (or tags) to the dictionaries.
# The first special token must have index 0.

# Mapping tok2idx should contain each token or tag only once.
# To do so, you should:
# 1. extract unique tokens/tags from the tokens_or_tags variable, which is not
#    occur in special_tokens (because they could have non-empty intersection)
# 2. index them (for example, you can add them into the list idx2tok
# 3. for each token/tag save the index into tok2idx).

idx = 0
for tok in special_tokens:
tok2idx[tok] = idx
idx2tok.append(tok)
idx += 1
tokens_or_tags = list(set([item for sublist in tokens_or_tags for item in sublist]) - set(special_tokens))
#for idx, tok in enumerate(tokens_or_tags):
for tok in tokens_or_tags:
tok2idx[tok] = idx
idx2tok.append(tok)
idx += 1



After implementing the function build_dict we can create dictionaries for tokens and tags. Special tokens in our case will be:

• <UNK> token for out of vocabulary tokens;
• <PAD> token for padding sentence to the same length when we create batches of sentences.

special_tokens = ['', '']
special_tags = ['O']

# Create dictionaries
token2idx, idx2token = build_dict(train_tokens + validation_tokens, special_tokens)
tag2idx, idx2tag = build_dict(train_tags, special_tags)


We can see from the below output that there are 21 tags for the named entities in the corpus.
len(idx2tag)
# 21
idx2tag
# ['O'
#  'B-geo-loc',
#  'I-geo-loc',
#  'B-movie',
#  'I-other',
#  'I-musicartist',
#  'I-company',
#  'I-movie',
#  'B-facility',
#  'B-product',
#  'I-tvshow',
#  'B-company',
#  'I-person',
#  'B-sportsteam',
#  'I-sportsteam',
#  'B-tvshow',
#  'B-person',
#  'B-musicartist',
#  'B-other',
#  'I-facility',
#  'I-product']



The next additional functions will be helpful for creating the mapping between tokens and ids for a sentence.


def words2idxs(tokens_list):
return [token2idx[word] for word in tokens_list]

def tags2idxs(tags_list):
return [tag2idx[tag] for tag in tags_list]

def idxs2words(idxs):
return [idx2token[idx] for idx in idxs]

def idxs2tags(idxs):
return [idx2tag[idx] for idx in idxs]



### Generate batches

Neural Networks are usually trained with batches. It means that weight updates of the network are based on several sequences at every single time. The tricky part is that all sequences within a batch need to have the same length. So we will pad them with a special <PAD> token. It is also a good practice to provide RNN with sequence lengths, so it can skip computations for padding parts. The following python generator function batches_generator can be used to generate batches.


def batches_generator(batch_size, tokens, tags,
shuffle=True, allow_smaller_last_batch=True):
"""Generates padded batches of tokens and tags."""

n_samples = len(tokens)
if shuffle:
order = np.random.permutation(n_samples)
else:
order = np.arange(n_samples)

n_batches = n_samples // batch_size
if allow_smaller_last_batch and n_samples % batch_size:
n_batches += 1

for k in range(n_batches):
batch_start = k * batch_size
batch_end = min((k + 1) * batch_size, n_samples)
current_batch_size = batch_end - batch_start
x_list = []
y_list = []
max_len_token = 0
for idx in order[batch_start: batch_end]:
x_list.append(words2idxs(tokens[idx]))
y_list.append(tags2idxs(tags[idx]))
max_len_token = max(max_len_token, len(tags[idx]))

# Fill in the data into numpy nd-arrays filled with padding indices.
x = np.ones([current_batch_size, max_len_token], dtype=np.int32) * token2idx['&amp;amp;amp;amp;lt;PAD&amp;amp;amp;amp;gt;']
y = np.ones([current_batch_size, max_len_token], dtype=np.int32) * tag2idx['O']
lengths = np.zeros(current_batch_size, dtype=np.int32)
for n in range(current_batch_size):
utt_len = len(x_list[n])
x[n, :utt_len] = x_list[n]
lengths[n] = utt_len
y[n, :utt_len] = y_list[n]
yield x, y, lengths



## Build a recurrent neural network

This is the most important part where we shall specify the network architecture based on TensorFlow building blocks. We shall create an LSTM network which will produce probability distribution over tags for each token in a sentence. To take into account both right and left contexts of the token, we will use Bi-Directional LSTM (Bi-LSTM). Dense layer will be used on top to perform tag classification.


import tensorflow as tf
import numpy as np
tf.compat.v1.disable_eager_execution()
class BiLSTMModel():
pass



First, we need to create placeholders to specify what data we are going to feed into the network during the execution time. For this task we will need the following placeholders:

• input_batch — sequences of words (the shape equals to [batch_size, sequence_len]);
• ground_truth_tags — sequences of tags (the shape equals to [batch_size, sequence_len]);
• lengths — lengths of not padded sequences (the shape equals to [batch_size]);
• dropout_ph — dropout keep probability; this placeholder has a predefined value 1;
• learning_rate_ph — learning rate; we need this placeholder because we want to change the value during training.

It could be noticed that we use None in the shapes in the declaration, which means that data of any size can be fed.


def declare_placeholders(self):
"""Specifies placeholders for the model."""

# Placeholders for input and ground truth output.
self.input_batch = tf.compat.v1.placeholder(dtype=tf.int32, shape=[None, None], name='input_batch')
self.ground_truth_tags = tf.compat.v1.placeholder(dtype=tf.int32, shape=[None, None], name='ground_truth_tags')

# Placeholder for lengths of the sequences.
self.lengths = tf.compat.v1.placeholder(dtype=tf.int32, shape=[None], name='lengths')

# Placeholder for a dropout keep probability. If we don't feed
# a value for this placeholder, it will be equal to 1.0.
self.dropout_ph = tf.compat.v1.placeholder_with_default(tf.cast(1.0, tf.float32), shape=[])

# Placeholder for a learning rate (tf.float32).
self.learning_rate_ph = tf.compat.v1.placeholder(dtype=tf.float32, shape=[])

BiLSTMModel.__declare_placeholders = classmethod(declare_placeholders)



Now, let us specify the layers of the neural network. First, we need to perform some preparatory steps:

• Create embeddings matrix with tf.Variable. Specify its name (embeddings_matrix), type (tf.float32), and initialize with random values.
• Create forward and backward LSTM cells. TensorFlow provides a number of RNN cells ready for you. We suggest that you use LSTMCell, but we can also experiment with other types, e.g. GRU cells. This blogpost could be interesting if you want to learn more about the differences.
• Wrap cells with DropoutWrapper. Dropout is an important regularization technique for neural networks. Specify all keep probabilities using the dropout placeholder that we created before.

After that, we can build the computation graph that transforms an input_batch:

• Look up embeddings for an input_batch in the prepared embedding_matrix.
• Pass the embeddings through Bidirectional Dynamic RNN with the specified forward and backward cells. Use the lengths placeholder here to avoid computations for padding tokens inside the RNN.
• Create a dense layer on top. Its output will be used directly in loss function.


def build_layers(self, vocabulary_size, embedding_dim, n_hidden_rnn, n_tags):
"""Specifies bi-LSTM architecture and computes logits for inputs."""

# Create embedding variable (tf.Variable) with dtype tf.float32
initial_embedding_matrix = np.random.randn(vocabulary_size, embedding_dim) / np.sqrt(embedding_dim)
embedding_matrix_variable = tf.Variable(initial_value=initial_embedding_matrix, dtype=tf.float32) ######### YOUR CODE HERE #############

# Create RNN cells (for example, tf.nn.rnn_cell.BasicLSTMCell) with n_hidden_rnn number of units
# and dropout (tf.nn.rnn_cell.DropoutWrapper), initializing all *_keep_prob with dropout placeholder.
forward_cell = tf.compat.v1.nn.rnn_cell.DropoutWrapper(tf.compat.v1.nn.rnn_cell.LSTMCell(n_hidden_rnn), self.dropout_ph, self.dropout_ph) ######### YOUR CODE HERE #############
backward_cell = tf.compat.v1.nn.rnn_cell.DropoutWrapper(tf.compat.v1.nn.rnn_cell.LSTMCell(n_hidden_rnn), self.dropout_ph, self.dropout_ph) ######### YOUR CODE HERE #############

# Look up embeddings for self.input_batch (tf.nn.embedding_lookup).
# Shape: [batch_size, sequence_len, embedding_dim].
embeddings = tf.nn.embedding_lookup(embedding_matrix_variable, self.input_batch) ######### YOUR CODE HERE #############

# Pass them through Bidirectional Dynamic RNN (tf.nn.bidirectional_dynamic_rnn).
# Shape: [batch_size, sequence_len, 2 * n_hidden_rnn].
# Also don't forget to initialize sequence_length as self.lengths and dtype as tf.float32.
(rnn_output_fw, rnn_output_bw), _ = tf.compat.v1.nn.bidirectional_dynamic_rnn(forward_cell, backward_cell, embeddings, sequence_length=self.lengths, dtype=tf.float32) ######### YOUR CODE HERE #############
rnn_output = tf.concat([rnn_output_fw, rnn_output_bw], axis=2)

# Dense layer on top.
# Shape: [batch_size, sequence_len, n_tags].
self.logits = tf.compat.v1.layers.dense(rnn_output, n_tags, activation=None)

BiLSTMModel.__build_layers = classmethod(build_layers)


To compute the actual predictions of the neural network, we need to apply softmax to the last layer and find the most probable tags with argmax.


def compute_predictions(self):
"""Transforms logits to probabilities and finds the most probable tags."""

# Create softmax (tf.nn.softmax) function
self.softmax_output = tf.nn.softmax(self.logits)

# Use argmax (tf.argmax) to get the most probable tags
# Don't forget to set axis=-1
# otherwise argmax will be calculated in a wrong way
self.predictions = tf.argmax(self.softmax_output, axis=-1)



During training we do not need predictions of the network, but we need a loss function. We will use cross-entropy loss, efficiently implemented in TF as cross entropy with logits. Note that it should be applied to logits of the model (not to softmax probabilities!). Also note, that we do not want to take into account loss terms coming from <PAD> tokens. So we need to mask them out, before computing mean.


"""Computes masked cross-entopy loss with logits."""

# Create cross entropy function function (tf.nn.softmax_cross_entropy_with_logits_v2)
ground_truth_tags_one_hot = tf.one_hot(self.ground_truth_tags, n_tags)
loss_tensor = tf.nn.softmax_cross_entropy_with_logits(ground_truth_tags_one_hot, self.logits)

# Create loss function which doesn't operate with &amp;amp;lt;PAD&amp;amp;gt; tokens (tf.reduce_mean)
# Be careful that the argument of tf.reduce_mean should be
# multiplication of mask and loss_tensor.

BiLSTMModel.__compute_loss = classmethod(compute_loss)



The last thing to specify is how we want to optimize the loss. Let’s use Adam optimizer with a learning rate from the corresponding placeholder. We shall also need to apply clipping to eliminate exploding gradients. It can be easily done with the function clip_by_norm.


def perform_optimization(self):
"""Specifies the optimizer and train_op for the model."""

# Pay attention that you need to apply this operation only for gradients
# because self.grads_and_vars also contains variables.
# list comprehension might be useful in this case.
clip_norm = tf.cast(1.0, tf.float32)

BiLSTMModel.__perform_optimization = classmethod(perform_optimization)



Now we have specified all the parts of your network. It can be noticed, that we didn’t deal with any real data yet, so what we have written is just recipes on how the network should function. Now we will put them to the constructor of our Bi-LSTM class to use it in the next section.


def init_model(self, vocabulary_size, n_tags, embedding_dim, n_hidden_rnn, PAD_index):
self.__declare_placeholders()
self.__build_layers(vocabulary_size, embedding_dim, n_hidden_rnn, n_tags)
self.__compute_predictions()
self.__perform_optimization()

BiLSTMModel.__init__ = classmethod(init_model)



## Train the network and predict tags

Session.run is a point which initiates computations in the graph that we have defined. To train the network, we need to compute self.train_op, which was declared in perform_optimization. To predict tags, we just need to compute self.predictions. Anyway, we need to feed actual data through the placeholders that we defined before.


def train_on_batch(self, session, x_batch, y_batch, lengths, learning_rate, dropout_keep_probability):
feed_dict = {self.input_batch: x_batch,
self.ground_truth_tags: y_batch,
self.learning_rate_ph: learning_rate,
self.dropout_ph: dropout_keep_probability,
self.lengths: lengths}

session.run(self.train_op, feed_dict=feed_dict)

BiLSTMModel.train_on_batch = classmethod(train_on_batch)



Let’s implement the function predict_for_batch by initializing feed_dict with input
x_batch and lengths and running the session for self.predictions.


def predict_for_batch(self, session, x_batch, lengths):

feed_dict = {self.input_batch: x_batch,
self.lengths: lengths}

predictions = session.run(self.predictions, feed_dict=feed_dict)
softmax_output = session.run(self.softmax_output, feed_dict=feed_dict)

return predictions, softmax_output

BiLSTMModel.predict_for_batch = classmethod(predict_for_batch)



We finished with necessary methods of our BiLSTMModel model and almost ready to start experimenting.

### Evaluation

To simplify the evaluation process let’s use the two functions:

• predict_tags: uses a model to get predictions and transforms indices to tokens and tags;
• eval_conll: calculates precision, recall and F1 for the results.
• The function precision_recall_f1() is implemented / used to compute these metrics with training and validation data.

def predict_tags(model, session, token_idxs_batch, lengths):
"""Performs predictions and transforms indices to tokens and tags."""

tag_idxs_batch, softmax_batch = model.predict_for_batch(session, token_idxs_batch, lengths)

tags_batch, tokens_batch, probs_batch = [], [], []
for tag_idxs, token_idxs, softmax_probs in zip(tag_idxs_batch, token_idxs_batch, softmax_batch):
tags, tokens, probs = [], [], []
for tag_idx, token_idx, softmax_prob in zip(tag_idxs, token_idxs, softmax_probs):
tags.append(idx2tag[tag_idx])
tokens.append(idx2token[token_idx])
probs.append(softmax_prob)
tags_batch.append(tags)
tokens_batch.append(tokens)
probs_batch.append(probs)
return tags_batch, tokens_batch, probs_batch

def eval_conll(model, session, tokens, tags, short_report=True):
"""Computes NER quality measures using CONLL shared task script."""

y_true, y_pred = [], []
for x_batch, y_batch, lengths in batches_generator(1, tokens, tags):
tags_batch, tokens_batch, probs_batch = predict_tags(model, session, x_batch, lengths)
if len(x_batch[0]) != len(tags_batch[0]):
raise Exception("Incorrect length of prediction for the input, "
"expected length: %i, got: %i" % (len(x_batch[0]), len(tags_batch[0])))
predicted_tags = []
ground_truth_tags = []
for gt_tag_idx, pred_tag, token in zip(y_batch[0], tags_batch[0], tokens_batch[0]):
ground_truth_tags.append(idx2tag[gt_tag_idx])
predicted_tags.append(pred_tag)

# We extend every prediction and ground truth sequence with 'O' tag
# to indicate a possible end of entity.
y_true.extend(ground_truth_tags + ['O'])
y_pred.extend(predicted_tags + ['O'])

results = precision_recall_f1(y_true, y_pred, print_results=True, short_report=short_report)
return results



## Run the experiment

Create BiLSTMModel model with the following parameters:

• vocabulary_size — number of tokens;
• n_tags — number of tags;
• embedding_dim — dimension of embeddings, recommended value: 200;
• n_hidden_rnn — size of hidden layers for RNN, recommended value: 200;
• PAD_index — an index of the padding token (<PAD>).

Set hyperparameters. We might want to start with the following recommended values:

• batch_size: 32;
• 4 epochs;
• starting value of learning_rate: 0.005
• learning_rate_decay: a square root of 2;
• dropout_keep_probability: try several values: 0.1, 0.5, 0.9.

However, we can conduct more experiments to tune hyperparameters, to obtain higher accuracy on the held-out validation dataset.


from tensorflow.python.framework import ops
ops.reset_default_graph()
#tf.reset_default_graph()

batch_size = 32
n_epochs = 5
learning_rate = 0.005
learning_rate_decay = np.sqrt(2)
dropout_keep_probability = 0.75 #0.5



Finally, we are ready to run the training!


from tensorflow.python.framework import ops
ops.reset_default_graph()

batch_size = 32
n_epochs = 5
learning_rate = 0.005
learning_rate_decay = np.sqrt(2)
dropout_keep_probability = 0.75 #0.5

# -------------------- Epoch 1 of 5 --------------------

# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 67542 phrases; correct: 134.
# precision:  0.20%; recall:  2.99%; F1:  0.37

# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 8340 phrases; correct: 22.
# precision:  0.26%; recall:  4.10%; F1:  0.50

# -------------------- Epoch 2 of 5 -------------------

# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 2448 phrases; correct: 873.
# precision:  35.66%; recall:  19.45%; F1:  25.17

# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 211 phrases; correct: 78.
# precision:  36.97%; recall:  14.53%; F1:  20.86

# -------------------- Epoch 3 of 5 --------------------

# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 4915 phrases; correct: 2518.
# precision:  51.23%; recall:  56.09%; F1:  53.55

# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 447 phrases; correct: 175.
# precision:  39.15%; recall:  32.59%; F1:  35.57

# -------------------- Epoch 4 of 5 --------------------

# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 4742 phrases; correct: 3559.
# precision:  75.05%; recall:  79.28%; F1:  77.11

# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 442 phrases; correct: 183.
# precision:  41.40%; recall:  34.08%; F1:  37.39

# -------------------- Epoch 5 of 5 --------------------

# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 4683 phrases; correct: 3984.
# precision:  85.07%; recall:  88.75%; F1:  86.87

# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 463 phrases; correct: 190.
# precision:  41.04%; recall:  35.38%; F1:  38.00

# ...training finished.

The following figure shows how the precision / recall and F1 score changes on the training and validation datasets while training, for each of the entities.

Now let us see full quality reports for the final model on train, validation, and test sets.


print('-' * 20 + ' Train set quality: ' + '-' * 20)
train_results = eval_conll(model, sess, train_tokens, train_tags, short_report=False)

print('-' * 20 + ' Validation set quality: ' + '-' * 20)
validation_results = eval_conll(model, sess, validation_tokens, validation_tags, short_report=False)

print('-' * 20 + ' Test set quality: ' + '-' * 20)
test_results = eval_conll(model, sess, test_tokens, test_tags, short_report=False)

# Start training...
#
# -------------------- Epoch 1 of 5 --------------------
# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 67542 phrases; correct: 134.
#
# precision:  0.20%; recall:  2.99%; F1:  0.37
#
# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 8340 phrases; correct: 22.
#
# precision:  0.26%; recall:  4.10%; F1:  0.50
#
# -------------------- Epoch 2 of 5 --------------------
# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 2448 phrases; correct: 873.
#
# precision:  35.66%; recall:  19.45%; F1:  25.17
#
# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 211 phrases; correct: 78.
#
# precision:  36.97%; recall:  14.53%; F1:  20.86
#
# -------------------- Epoch 3 of 5 --------------------
# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 4915 phrases; correct: 2518.
#
# precision:  51.23%; recall:  56.09%; F1:  53.55
#
# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 447 phrases; correct: 175.
#
# precision:  39.15%; recall:  32.59%; F1:  35.57
#
# -------------------- Epoch 4 of 5 --------------------
# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 4742 phrases; correct: 3559.
#
# precision:  75.05%; recall:  79.28%; F1:  77.11
#
# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 442 phrases; correct: 183.
#
# precision:  41.40%; recall:  34.08%; F1:  37.39
#
# -------------------- Epoch 5 of 5 --------------------
# Train data evaluation:
# processed 105778 tokens with 4489 phrases; found: 4683 phrases; correct: 3984.
#
# precision:  85.07%; recall:  88.75%; F1:  86.87
#
# Validation data evaluation:
# processed 12836 tokens with 537 phrases; found: 463 phrases; correct: 190.
#
# precision:  41.04%; recall:  35.38%; F1:  38.00
#
# ...training finished.



The following animation shows the softmax probabilities corresponding to the NER tags predicted by the BiLSTM model on the tweets from the test corpus.

Also, visualized in another way, the following two animations show the tweet text, the probabilities for different NE tags, predicted by the BiLSTM NER model and the ground-truth tag.

### Conclusions

Nowadays, Bi-LSTM is one of the state of the art approaches for solving NER problem and it outperforms other classical methods. Despite the fact that we used small training corpora (in comparison with usual sizes of corpora in Deep Learning), our results are quite good. In addition, in this task there are many possible named entities and for some of them we have only several dozens of trainig examples, which is definately small. However, the implemented model outperforms classical CRFs for this task. Even better results could be obtained by some combinations of several types of methods, e.g. (refer to this paper).

# Simulating a Turing Machine with Python and executing programs on it

In this article, we shall implement a basic version of a Turing Machine in python and write a few simple programs to execute them on the Turing machine.  This article is inspired by the edX / MITx course Paradox and Infinity and few of the programs to be executed on the (simulated) Turing machine are taken from the course. Also, some programs are from this Cambridge tutorial.

### A few Definitions

Let’s start by defining a Turing machine (originally invented by Alan Turing) formally, as is shown in the following figure:

In this article, we shall assume that the tape alphabet can contain only the symbols  {0,1,⊔}, the head on the Turing machine moves in left (L) / right (R) direction (D) or doesn’t move (*), upon reading a symbol on the tape (i.,e., D can be one of {L,R,*}). Also, given a program along with a valid input, the Turing machine just halts (goes to the state halt) after writing the desired output on its tape.

### The python implementation for simulating a Turing Machine

The following python code shows a very simple implementation of a Turing machine.  It accepts a program as a string, with each transition function defined on a new line. The state halt is denoted by H.


N = 1000 # tape length, initialize to a large value

class TuringMachine:

'''
initialize the Turing Machine, read the program and input
'''
def __init__(self, program, input, state=0):
self.trf = {}
self.state = str(state)
self.tape = ''.join(['_']*N)
self.head = N // 2   # head is positioned in the middle
for line in program.splitlines():
s, a, r, d, s1 = line.split(' ')
self.trf[s,a] = (r, d, s1)

'''
step through a program
'''
def step(self):
if self.state != 'H':
action = self.trf.get((self.state, a))
if action:
r, d, s1 = action
if d != '*':
self.state = s1
print(self.tape.replace('_', ''), self.state)

'''
run a program
'''
def run(self, max_iter=9999):
iter = 0
while self.state != 'H' and iter < max_iter: # prevent infinite loop
self.step()
iter += 1
print(self.tape.replace('_', ''), self.state)



In the next section we shall show how a program on this Turing Machine will look like and how to run such a program, by instantiating the above class. We shall demonstrate with a few programs.

### 1. Binary Addition with Turing Machine

The following figure shows how to perform binary addition with a Turing machine, where the binary numbers to be added are input to the Turing machine and are separated by a single blank. The TM header is assumed to be positioned at the leftmost position of the first number, in the very beginning.

The program is loaded as a text file (program.txt) that looks like the following (here each line represents a transition function of the form 𝛿(𝑝,𝑋)=(𝑞,𝑌,D), where the 5 tuples are strictly in the order p, X, Y, D, q (the character _ represents a blank symbol on the tape):

0 0 0 r 0
0 1 1 r 0
0 _ _ r 1
1 0 0 r 1
1 1 1 r 1
1 _ _ l 2
2 0 1 l 2
2 1 0 l 3
2 _ _ r 5
3 0 0 l 3
3 1 1 l 3
3 _ _ l 4
4 0 1 r 0
4 1 0 l 4
4 _ 1 r 0
5 1 _ r 5
5 _ _ * H

Given the two binary numbers, the Turing Machine

• uses the second number as a counter
• decrements the second number by one
• increments the first number by one

till the second number becomes 0.

The following code shows how the above program can be run to add two input binary numbers 1101 (decimal 13) and 101 (decimal 5) to output the binary number 10010 (decimal 18). The final state of the machine is H (halt), as expected.

input = '1101_101'
tm = TuringMachine(program, input)
tm.run()
# 10010 H


The following animation shows how the binary numbers are added using the TM simulator.

### 2. Converting a Binary to a Unary number with TM

Let’s assume the TM tape contains a binary representation of a number n (n> 0) as a sequence of zeros and ones, and is otherwise blank. The reader positioned at the left-most member of the sequence. The following figure shows the program that replaces the original sequence with a sequence of n ones, where the original sequence names n in binary notation:

The following animation shows the result of running the program on the TM simulator, with the input 1010, the output obtained is a sequence of 10 ones.

### 3. Converting a Unary to a Binary number with TM

Let’s assume the TM tape contains a sequence of n ones (n> 0) and is otherwise blank. The reader positioned at the left-most member of the sequence. The following TM  program replaces the sequence of n ones, with a sequence of zeroes and ones that names n in binary notation:

The following animation shows the result of running the program on the TM simulator, with the input 11111111111 (eleven ones), the output obtained is the binary representation of 11.

### 4. Doubling the length of a sequence with TM

The following figure shows the program to be run to double the length of a sequence of ones, input to the TM

The following animation shows the result of running the program on the TM simulator, starting with five ones as input, obtaining ten ones on the tape as output.

### 5. Simulating the 4-state Busy Beaver with TM

The following figure shows the program to be run

The following animation shows the result of running the program on the TM simulator, as expected it outputs 13 ones on the tapes and halts in 107 steps.

### 6. Detecting a Palindrome with TM

The following program checks a string of symbols on the tape and returns a 1 if it is a palindrome and a 0 if it is not.

The following animation shows the result of running the program on the TM simulator with a palindrome.

The following animation shows the result of running the program on the TM simulator with a non-palindrome.

# Fourier Series and Differential Equations with some applications in R and Python

In this article, a few applications of Fourier Series in solving differential equations will be described. All the problems are taken from the edx Course: MITx – 18.03Fx: Differential Equations Fourier Series and Partial Differential Equations.

First a basic introduction to the Fourier series will be given and then we shall see how to solve the following ODEs / PDEs using Fourier series:

1. Find the steady state solution to un-damped / damped systems with pure / near resonance.
2. Solve the 4th order differential equation for beam bending system with boundary values, using theoretical and numeric techniques.
3. Solve the PDE for Heat / Diffusion using Neumann / Dirichlet boundary conditions
4. Solve Wave equation (PDE).
5. Remove noise from an audio file with Fourier transform.

## Some basics

Let f(t) be a periodic function with period L. Then the Fourier series of f(t)f(t) is given by the following superposition

The above formulae can be simplified as below, for even and odd functions

A few popular 2π periodic functions and their Fourier series are shown below:

### Computing the coefficients in R

Let’s first plot the square wave function using the following R code


library(ggplot2)
library(latex2exp)

Sq = function(t) {
ifelse(t > 0, ifelse(as.integer(t / pi) %% 2 == 0, 1, -1),
ifelse(as.integer(t / pi) %% 2 == 1, 1, -1))
}

t = seq(-3*pi, 3*pi, 0.01)
ggplot() + geom_line(aes(t, Sq(t)), size=1.5, col='red') + ylab('f(t)') +
scale_x_continuous(breaks=seq(-3*pi, 3*pi, pi),
labels=c(TeX('$-3\\pi$'),TeX('$-2\\pi$'),TeX('$-\\pi$'),0,TeX('$\\pi$'),TeX('$2\\pi$'),TeX('$3\\pi$'))) +
scale_y_continuous(breaks=c(-1,0,1), labels=c('-1', '0', '1')) +
ggtitle(TeX(paste('Square wave of period $2\\pi$'))) +
theme(plot.title = element_text(hjust = 0.5))




for (n in 1:6) {
b_n = round(2/pi * integrate(function(t) sin(n*t), 0, pi)$value, 15) print(paste(b_n, ifelse(n %% 2 == 0, 0, 4/n/pi))) } #[1] "1.27323954473516 1.27323954473516" #[1] "0 0" #[1] "0.424413181578388 0.424413181578388" #[1] "0 0" #[1] "0.254647908947032 0.254647908947033" #[1] "0 0"  R symbolic computation package rSymPy can be used to compute the Fourier coefficients as shown below: library(rSymPy) t = Var("t") n = Var("n") sympy("integrate(2*1*sin(n*t)/pi,(t,0,pi))") # sq [1] "-2*cos(pi*n)/(pi*n) + 2/(pi*n)"  The next animation shown how the first few terms in the Fourier series approximates the periodic square wave function. Notice from the above animation that the convergence of the Fourier series with the original periodic function is very slow near discontinuities (the square wave has discontinuities at points t=kπ, where kZ), which is known as Gibbs phenomenon. ### Computing the Fourier Coefficients of the Triangle Wave using Anti-derivatives Let’s aim at computing the Fourier coefficients for the 2π-periodic triangle wave by using the coefficients of the 2π-periodic square wave, using the anti-derivative property. First let’s write a few lines of code to plot the triangle wave. Tr = function(t) { i = as.integer(t / pi) ifelse(t > 0, ifelse(i %% 2 == 0, t - i*pi, -t + (i+1)*pi), ifelse(i %% 2 == 0, -t + i*pi, t - (i-1)*pi)) } t = seq(-3*pi, 3*pi, 0.01) ggplot() + geom_line(aes(t, Tr(t)), size=1.5, col='red') + ylab('f(t)') + scale_x_continuous(breaks=seq(-3*pi, 3*pi, pi), labels=c(TeX('$-3\\pi$'),TeX('$-2\\pi$'),TeX('$-\\pi$'),0,TeX('$\\pi$'),TeX('$2\\pi$'),TeX('$3\\pi$'))) + scale_y_continuous(breaks=c(-pi,0,pi), labels=c(TeX('$-\\pi$'), '0', TeX('$\\pi$'))) + ggtitle(TeX(paste('Triangle wave of period$2\\pi$'))) + theme(plot.title = element_text(hjust = 0.5))  ### computing the constant term a_0 = 1/pi*integrate(t, 0, pi)$value
print(paste(a_0, pi/2))
#[1] "1.5707963267949 1.5707963267949"

The next animation shows how the superposition of first few terms in the Fourier series approximates the triangle wave function.

### Computing the Fourier Coefficients of the Sawtooth function

Similarly, let’s visualize the sawtooth wave function using the following R code snippet:

St = function(t) {
i = floor((t + pi) / (2*pi))
t - i*2*pi
}

t = seq(-3*pi, 3*pi, 0.01)
ggplot() +
geom_line(aes(t, St(t)), size=1.5, col='red') +
scale_x_continuous(breaks=seq(-3*pi, 3*pi, pi),
labels=c(TeX('$-3\\pi$'),TeX('$-2\\pi$'),TeX('$-\\pi$'),0,TeX('$\\pi$'),TeX('$2\\pi$'),TeX('$3\\pi$'))) +
scale_y_continuous(breaks=c(-pi,0,pi), labels=c(TeX('-$\\pi$'), '0', TeX('$\\pi$'))) +
ylab('f(t)') +
ggtitle(TeX(paste('Sawtooth wave of period $2\\pi$'))) +
theme(plot.title = element_text(hjust = 0.5))


The following animation shows how the superposition of first few terms in the Fourier series of the Sawtooth wave approximate the function.

## Computing the Fourier series for the even periodic extension of a function

Let’s first plot the even periodic extension of the function f(t) = t2, -1 ≤ t ≤ 1, with the following R code.


Pr = function(t) {
i = as.integer(abs(t-ifelse(t&gt;0,-1,1)) / 2)
ifelse(t &gt; 0, (t-2*i)^2, (t+2*i)^2)
}

t = seq(-5, 5, 0.01)
ggplot() +
geom_line(aes(t, Pr(t)), size=1.5, col='red') +
scale_x_continuous(breaks=-5:5, labels=-5:5) +
ylab('f(t)') +
ggtitle(TeX(paste('Even periodic extension of the function $f(t) = t^2$,
$-1\\leq t \\leq 1$'))) +
theme(plot.title = element_text(hjust = 0.5))



Here is another function which is neither even nor odd, with period 2π

The following animation shows how the Fourier series of the function approximates the above function more closely as more and more terms are added.

## Solution of ODE with ERF and Resonance

Let’s solve the following differential equation (for a system without damping) and find out when the pure resonance takes place.

The next section shows how to find the largest gain corresponding to the following system with damping.

Let’s plot the amplitudes of the terms to see which term has the largest gain.
n = seq(1,15,2)
f = (1/sqrt((49-n^2)^2 + (0.1*n)^2)/n)
ggplot() + geom_point(aes(n, f))  +
geom_line(aes(n, f)) + ylab('c_n') +
scale_x_continuous(breaks=seq(1,15,2),
labels=as.character(seq(1,15,2)))


## Boundary Value Problems

As expected, since the right end of the beam is free, it will bend down when more loads are applied on the beam to the transverse direction of its length. The next animation shows how the beam bending varies with the value of q.

### Numerically solving a linear system to obtain the solution of the beam-bending system represented by the 4th-order differential equation in R

First create a near-tri-diagonal matrix A that looks like the following one, it takes care of the differential coefficients of the beam equation along with all the boundary value conditions.

Now, let’s solve the linear system using the following R code.

#Create a vector b that is zero for boundary conditions and -0.0000001 in every other entry.
b = rep(1,10)*(-1e-7)
# Create a vector v that solves Av = b.
v = solve(A, b)
#Create a column vector x of 10 evenly spaced points between 0 and 1 (for plotting)
x = seq(0,1,1/9)
#Plot v on the vertical axis, and x on the horizontal axis.
print(ggplot() + geom_line(aes(x,v), col='blue', size=2) + geom_point(aes(x, v), size=2) +
ggtitle(paste('Solution of the linear (beam bending) system')) + theme_bw())


As can be seen from the above figure, the output of the numerical solution agrees with the one obtained above theoretically.

## Heat / Diffusion Equation

The following animation shows how the temperature changes on the bar with time (considering only the first 100 terms for the Fourier series for the square wave).

Let’s solve the below diffusion PDE with the given Neumann BCs.

As can be seen from above, the initial condition can be represented as a 2-periodic triangle wave function (using even periodic extension), i.e.,

The next animation shows how the different points on the tube arrive at the steady-state solution over time.

The next figure shows the time taken for each point inside the tube, to approach within 1% the steady-state solution.

with the following

• L=5, the length of the bar is 5 units.
• Dirichlet BCs: u(0, t) =0, u(L,t) = sin(2πt/L), i.e., the left end of the bar is held at a constant temperature 0 degree (at ice bath) and the right end changes temperature in a sinusoidal manner.
• IC: u(x,0) = 0, i.e., the entire bar has temperature 0 degree.

The following R code implements the numerical method:


#Solve heat equation using forward time, centered space scheme

#Define simulation parameters
x = seq(0,5,5/19) #Spatial grid
dt = 0.06         #Time step
tMax = 20         #Simulation time
nu = 0.5          #Constant of proportionality

t = seq(0, tMax, dt) #Time vector

n = length(x)
m = length(t)

fLeft = function(t) 0                             #Left boundary condition
fRight = function(t) sin(2*pi*t/5)                #Right boundary condition
fInitial = function(x) matrix(rep(0,m*n), nrow=m) #Initial condition

#Run simulation
dx = x[2]-x[1]
r = nu*dt/dx^2

#Create tri-diagonal matrix A with entries 1-2r and r


The tri-diagonal matrix A is shown in the following figure


#Impose inital conditions
u = fInitial(x)
u[1,1] = fLeft(0)
u[1,n] = fRight(0)

for (i in 1:(length(t)-1)) {

u[i+1,] = A%*%(u[i,])   # Find solution at next time step
u[i+1,1] = fLeft(t[i])  # Impose left B.C.
u[i+1,n] = fRight(t[i]) # Impose right B.C.
}



The following animation shows the solution obtained to the heat equation using the numerical method (in R) described above,

## Wave Equation

The next figure shows how can a numerical method be used to solve the wave PDE

## Implementation in R


#Define simulation parameters
x = seq(0,1,1/99) #Spatial grid
dt = 0.5 #Time step
tMax = 200 #Simulation time
c = 0.015 #Wave speed

fLeft = function(t) 0 #Left boundary condition
fRight = f(t) 0.1*sin(t/10) #Right boundary condition
fPosInitial = f(x) exp(-200*(x-0.5)^2) #Initial position
fVelInitial = f(x) 0*x #Initial velocity

#Run simulation
dx = x[2]-x[1]
r = c*dt/dx

n = length(x)
t = seq(0,tMax,dt)  #Time vector
m = length(t) + 1


Now iteratively compute u(x,t) by imposing the following boundary conditions

1. u(0,t) = 0
2. u(L,t) = (1/10).sin(t/10)

along with the following initial conditions

1. u(x,0) = exp(-500.(x-1/2)2)
2. ∂u(x,0)/∂t = 0.x

as defined in the above code snippet.


#Create tri-diagonal matrix A here, as described in the algorithm

#Impose initial condition
u <- matrix(rep(0, n*m), nrow=m)
u[1,] = fPosInitial(x)
u[1,1] = fLeft(0)
u[1,n] = fRight(0)

for (i in 1:length(t)) {

#Find solution at next time step
if (i == 1) {
u[2,] = 1/2*(A%*%u[1,]) + dt*fVelInitial(x)
} else {
u[i+1,] = (A%*%u[i,])-u[i-1,]
}

u[i+1,1] = fLeft(t[i]) #Impose left B.C.
u[i+1,n] = fRight(t[i]) #Impose right B.C.
}



The following animation shows the output of the above implementation of the solution of wave PDE using R, it shows how the waves propagate, given a set of BCs and ICs.

## Some audio processing: De-noising an audio file with Fourier Transform

Let’s denoise an input noisy audio file (part of the theme music from Satyajit Ray’s famous movie পথের পাঁচালী, Songs of the road) using python scipy.fftpack module’s fft() implementation.

The noisy input file was generated and uploaded here.


from scipy.io import wavfile
import scipy.fftpack as fp
import numpy as np
import matplotlib.pylab as plt

# first fft to see the pattern then we can edit to see signal and inverse
# back to a better sound
Fs, y = wavfile.read('pather_panchali_noisy.wav') # can be found here: "https://github.com/sandipan/edx/blob/courses/pather_panchali_noisy.wav" data-mce-href="https://github.com/sandipan/edx/blob/courses/pather_panchali_noisy.wav".

# you may want to scale y, in this case, y was already scaled in between [-1,1]
Y = fp.fftshift(fp.fft(y))  #Take the Fourier series and take a symmetric shift
fshift = np.arange(-n//2, n//2)*(Fs/n)

plt.plot(fshift, Y.real) #plot(t, y); #plot the sound signal
plt.show()



You will obtain a figure like the following:


L = len(fshift) #Find the length of frequency values
Yfilt = Y

# find the frequencies corresponding to the noise here
print(np.argsort(Y.real)[::-1][:2])
# 495296 639296

Yfilt[495296] = Yfilt[639296] = 0  # find the frequencies corresponding to the spikes
plt.plot(fshift, Yfilt.real)
plt.show()

soundFiltered = fp.ifft(fp.ifftshift(Yfilt)).real
wavfile.write('pather_panchali_denoised.wav', Fs, soundNoisy)


The de-noised output wave file produced with the filtering can be found here
(use vlc media player to listen to input and output wave files)

# Solving Some Image Processing and Computer Vision Problems with Python libraries

In this article, a few image processing / computer vision problems and their solutions  with python libraries (scikit-image, cv2) will be discussed. Some of the problems are from the exercises from this book (available on amazon). This blog will be continued here.

## Removing Gaussian Noise from images by computing mean and median images with scikit-image

2. Create n (e.g, n=100) noisy images by adding i.i.d. Gaussian noise (with zero mean) to the original image, with scikit-image.
3. Compute the mean (median) of the noisy images.
4. Compare PSNR with the original image.
5. Vary n and compare the results.
from skimage import img_as_float
from skimage.util import random_noise
from skimage.measure import compare_psnr
import matplotlib.pylab as plt
import numpy as np

im = img_as_float(imread('../new images/parrot.jpg')) # original image
np.random.seed(0)
# generate n noisy images from the original image by adding Gaussian noise
n = 25
images = np.zeros((n, im.shape[0], im.shape[1], im.shape[2]))
sigma = 0.2
for i in range(n):
images[i,...] = random_noise(im, var=sigma**2)

im_mean = images.mean(axis=0)
im_median = np.median(images, axis=0)
plt.figure(figsize=(20,16))
plt.subplots_adjust(left=.02, right=.98, bottom=.001, top=.96, wspace=.05, hspace=.01)
plt.subplot(221), plt.imshow(im), plt.axis('off'), plt.title('Original image', size=20)
plt.subplot(222), plt.imshow(images[0]), plt.axis('off'), plt.title('Noisy PSNR: ' + str(compare_psnr(im, images[0])), size=20)
plt.subplot(223), plt.imshow(im_mean), plt.axis('off'), plt.title('Mean PSNR: ' + str(compare_psnr(im, im_mean)), size=20)
plt.subplot(224), plt.imshow(im_median), plt.axis('off'), plt.title('Median PSNR: ' + str(compare_psnr(im, im_median)), size=20)
plt.show()



The next figure shows the original image, a noisy image generated from it by adding Gaussian noise (with 0 mean) to it and the images obtained by taking mean / median over all the n noisy images generated. As can be seen, the Gaussian noise in the images gets cancelled out by taking mean / median.

with n = 25

with n=100


plt.hist(images[:,100,100,0], color='red', alpha=0.2, label='red')
plt.hist(images[:,100,100,1], color='green', alpha=0.2, label='green')
plt.hist(images[:,100,100,2], color='blue', alpha=0.2, label='blue')
plt.legend()
plt.grid()
plt.show()



The next figure shows how a pixel value (that can be considered a random variable) for a particular location in different noisy images follows approximately a Gaussian distribution.

Distribution of a pixel value at location (100,100) in the noisy images


ns = [25, 50, 100, 200]
# mean_psnrs contain the PSNR values for different n
plt.plot(ns, mean_psnrs, '.--', label='PSNR (mean)')
plt.plot(ns, median_psnrs, '.--', label='PSNR (median)')
plt.legend()
plt.xlabel('n'),  plt.ylabel('PSNR')
plt.show()



The following figure shows that the PSNR improves with large n (since by SLLN / WLLN, the sample mean converges to population mean 0 of the Gaussian noise). Also, for median the improvement in the image quality is higher for larger values of n.

## Tracking Pedestrians with HOG-SVM with OpenCV / scikit-image

2. Capture the video / extract frames from the video.
3. For each frame
1. Create HOG scale pyramid of the frame image.
2. At each scale, use a sliding window to extract the corresponding block from the frame, compute the HOG descriptor features.
3. Use cv2‘s HOGDescriptor_getDefaultPeopleDetector() – a pre-trained SVM classifier on the HOG descriptor to classify whether the corresponding block contains a pedestrian or not.
4. Run non-max-suppression to get rid of multiple detection of the same person.
5. Use cv2‘s  detectMultiScale() function to implement steps 3-4.

The code is adapted from the code here and here.


# HOG descriptor using default people (pedestrian) detector
hog = cv2.HOGDescriptor()
hog.setSVMDetector(cv2.HOGDescriptor_getDefaultPeopleDetector())

# run detection, using a spatial stride of 4 pixels,
# a scale stride of 1.02, and zero grouping of rectangles
# (to demonstrate that HOG will detect at potentially
# multiple places in the scale pyramid)
(foundBoundingBoxes, weights) = hog.detectMultiScale(frame, winStride=(4, 4), padding=(8, 8), scale=1.02, finalThreshold=0, useMeanshiftGrouping=False)

# convert bounding boxes from format (x1, y1, w, h) to (x1, y1, x2, y2)
rects = np.array([[x, y, x + w, y + h] for (x, y, w, h) in foundBoundingBoxes])

# run non-max suppression on the boxes based on an overlay of 65%
nmsBoundingBoxes = non_max_suppression(rects, probs=None, overlapThresh=0.65)



cv2 functions are used to extract HOG descriptor features and pedestrian detection with SVM,  whereas scikit-image functions are used to visualize the HOG features. The animations below display the original video, what HOG sees and  the detected pedestrians after non-max suppression. Notice there are a few false positive detection.

Original Video

HOG-descriptor features video (what HOG sees)Original Video with detected Pedestrians

1. Capture video with webcam with cv2.VideoCapture().
2. For each frame, use the pre-trained Adaboost Cascade classifiers (the haarcascade_frontalface_default classifier for face detection and haarcascade_eye_tree_eyeglasses classifier for better detection of the eyes with glasses, from the corresponding xml files that come with cv2’s installation) using Haar-like features with cv2.CascadeClassifier().
3. First detect the face(s) with the detectMultiScale() function and draw a bounding box. Then detect the eyes inside a detected face with the same function.
4. The following python code snippet shows how to detect faces and eyes with cv2. The code is adapted from here.


gray = cv2.cvtColor(frame, cv2.COLOR_BGR2GRAY)
frame = cv2.cvtColor(frame, cv2.COLOR_BGR2RGB)
# return bounding box of the face(s) if one is detected
for (x,y,w,h) in faces:
frame = cv2.rectangle(frame,(x,y),(x+w,y+h),(255,0,0),2)
roi_gray = gray[y:y+h, x:x+w]
roi_color = frame[y:y+h, x:x+w]
for (ex,ey,ew,eh) in eyes:
cv2.rectangle(roi_color,(ex,ey),(ex+ew,ey+eh),(0,255,0),2)



The next animation shows the results of face detection when scalefactor 1.03 was used to create the scale pyramid.  As can be seen, the eyes with the glasses on and some small faces from the photos are not detected at this scale.

The next animation shows the results of face detection when scalefactor 1.3 was used to create the scale pyramid.  As can be seen, the eyes with/without the glasses on as well as most of the small faces from the photos are detected at this scale most of the time.

## Semantic Segmentation with ENet / DeepLab (Deep Learning  model)

Input video and the segmented Output video

Input video and the segmented Output video

# Solving Some Image Processing Problems with Python libraries

In this article a few popular image processing problems along with their solutions are going to be discussed. Python image processing libraries are going to be used to solve these problems. Some of the problems are from the exercises from this book  (available on amazon).

## Image Transformations and Warping

### 0. Display RGB image color channels in 3D

1. A gray-scale image can be thought of a 2-D function f(x,y) of the pixel locations (x,y), that maps each pixel into its corresponding gray level (an integer in [0,255], e.g.,).
2. For an RGB image there are 3 such functions, f_R(x,y), f_G(x.y), f_B(x.y).
3. matplotlib’s 3-D plot functions can be used to plot each function.

The following python code shows how to plot the RGB channels separately in 3D:

import matplotlib.pylab as plt
from mpl_toolkits.mplot3d import Axes3D
def plot_3d(X, Y, Z, title, cmap):
# implement this function to plot the channel pixel values in 3D
plt.show()

Y = np.arange(im.shape[0])
X = np.arange(im.shape[1])
Z1 = im[...,0]
Z2 = im[...,1]
Z3 = im[...,2]
plot_3d(X, Y, Z1, cmap='Reds', title='3D plot for the Red Channel')
plot_3d(X, Y, Z2, cmap='Greens', title='3D plot for the Green Channel')
plot_3d(X, Y, Z3, cmap='Blues', title='3D plot for the Blue Channel')



The RGB image

### 1. Wave Transform

1. Use scikit-image’s warp() function to implement the wave transform.
2. Note that wave transform can be expressed with the following equations:

We shall use the madrill image to implement the wave transform. The next python code fragment shows how to do it:


def wave(xy):
xy[:, 1] += 20*np.sin(2*np.pi*xy[:, 0]/64)
return xy

from skimage.transform import warp
import matplotlib.pylab as plt
im = warp(im, wave)
plt.imshow(im)
plt.show()



The next figure shows the original mandrill input image and the output image obtained after applying the wave transform.

### 2. Swirl Transform

1. Use scikit-image’s warp() function to implement the swirl transform.
2. Note that swirl transform can be expressed with the following equations

We shall use the madrill image to implement the wave transform. The next python code fragment shows how to do it:


def swirl(xy, x0, y0, R):
r = np.sqrt((xy[:,1]-x0)**2 + (xy[:,0]-y0)**2)
a = np.pi * r / R
xy[:, 1] = (xy[:, 1]-x0)*np.cos(a) + (xy[:, 0]-y0)*np.sin(a) + x0
xy[:, 0] = -(xy[:, 1]-x0)*np.sin(a) + (xy[:, 0]-y0)*np.cos(a) + y0
return xy

im = warp(im, swirl, map_args={'x0':112, 'y0':112, 'R':512})
plt.imshow(im)
plt.axis('off')
plt.show()



The next figure shows the original mandrill input image and the output image obtained after applying the swirl transform.

Compare this with the output of the scikit-image swirl() function.

### 3. Very simple Face morphing with α-blending

1. Start from one face image (e.g., let image1 be the face of Messi) and end into another image (let image2 be the face of Ronaldo) iteratively, creating some intermediate images in between.
2. At each iteration create an image by using a linear combination of the two image numpy ndarrays given by

3. Iteratively increase α from 0 to 1.

The following code block shows how to implement it using matplotlib’s image and pylab modules.

im1 = mpimg.imread("../images/messi.jpg") / 255 # scale RGB values in [0,1]
i = 1
plt.figure(figsize=(18,15))
for alpha in np.linspace(0,1,20):
plt.subplot(4,5,i)
plt.imshow((1-alpha)*im1 + alpha*im2)
plt.axis('off')
i += 1
plt.show()


The next animation shows the simple face morphing:

There are more sophisticated techniques to improve the quality of morphing, but this is the simplest one.

### 4. Creating Instagram-like Gotham Filter

#### The Gotham filter

The Gotham filter is computed as follows (the steps taken from here), applying the following operations on an image, the corresponding python code, input and output images are shown along with the operations (with the following input image):

1. A mid-tone red contrast boost
from PIL import Image
import numpy as np
import matplotlib.pylab as plt
im = Image.open('../images/city.jpg') # pixel values in [0,255]
r, g, b = im.split()
red_levels = [0., 12.75, 25.5, 51., 76.5, 127.5, 178.5, 204., 229.5, 242.25, 255.]
r1 = Image.fromarray((np.reshape(np.interp(np.array(r).ravel(), np.linspace(0,255,len(red_levels)), red_levels), (im.height, im.width))).astype(np.uint8), mode='L')
plt.figure(figsize=(20,15))
plt.subplot(221)
plt.imshow(im)
plt.title('original', size=20)
plt.axis('off')
plt.subplot(222)
im1 = Image.merge('RGB', (r1, g, b))
plt.imshow(im1)
plt.axis('off')
plt.title('with red channel interpolation', size=20)
plt.subplot(223)
plt.hist(np.array(r).ravel(), normed=True)
plt.subplot(224)
plt.hist(np.array(r1).ravel(), normed=True)
plt.show()


2. Make the blacks a little bluer
plt.figure(figsize=(20,10))
plt.subplot(121)
plt.imshow(im1)
plt.title('last image', size=20)
plt.axis('off')
b1 = Image.fromarray(np.clip(np.array(b) + 7.65, 0, 255).astype(np.uint8))
im1 = Image.merge('RGB', (r1, g, b1))
plt.subplot(122)
plt.imshow(im1)
plt.axis('off')
plt.title('with transformation', size=20)
plt.tight_layout()
plt.show()


3. A small sharpening
from PIL.ImageEnhance import Sharpness
plt.figure(figsize=(20,10))
plt.subplot(121)
plt.imshow(im1)
plt.title('last image', size=20)
plt.axis('off')
im2 = Sharpness(im1).enhance(3.0)
plt.subplot(122)
plt.imshow(im2)
plt.axis('off')
plt.title('with transformation', size=20)
plt.tight_layout()
plt.show()


4. A boost in blue channel for lower mid-tones
5. A decrease in blue channel for upper mid-tones
blue_levels = [0., 11.985, 30.09, 64.005, 81.09, 99.96, 107.1, 111.945, 121.125, 143.055, 147.9, 159.885, 171.105, 186.915, 215.985, 235.875, 255.]
b2 = Image.fromarray((np.reshape(np.interp(np.array(b1).ravel(), np.linspace(0,255,len(blue_levels)), blue_levels), (im.height, im.width))).astype(np.uint8), mode='L')
plt.figure(figsize=(20,15))
plt.subplot(221)
plt.imshow(im2)
plt.title('last image', size=20)
plt.axis('off')
plt.subplot(222)
im3 = Image.merge('RGB', (r1, g, b2))
plt.imshow(im3)
plt.axis('off')
plt.title('with blue channel interpolation', size=20)
plt.subplot(223)
plt.hist(np.array(b1).ravel(), normed=True)
plt.subplot(224)
plt.hist(np.array(b2).ravel(), normed=True)
plt.show()


The output image obtained after applying the Gotham filter is shown below:

## Down-sampling with anti-aliasing using Gaussian Filter

1. Start with a large gray-scale image and reduce the image size 16 times, by reducing both height and width by 4 times.
2. Select every 4th pixel in the x and the y direction from the original image to compute the values of the pixels in the smaller image.
3. Before down-sampling apply a Gaussian filter (to smooth the image) for anti-aliasing.
4. Compare the quality of the output image obtained by down-sampling without a Gaussian filter (with aliasing).

The next code block performs the above steps. Since the Gaussian blur is a low-pass filter, it removes the high frequencies from the original input image, hence it’s possible to achieve sampling rate above the Nyquist rate (by sampling theorem) to avoid aliasing.


from scipy.ndimage import gaussian_filter
print(im.shape)
plt.figure(figsize=(20,20))
plt.imshow(im)
plt.show()
plt.figure(figsize=(20,20))
im_blurred = gaussian_filter(im, sigma=2.5) #(5,5,1)
plt.imshow(im_blurred)
plt.show()
n = 4 # create and image 16 times smaller in size
w, h = im.shape[0] // n, im.shape[1] // n
im_small = np.zeros((w,h))
for i in range(w):
for j in range(h):
im_small[i,j] = im[n*i, n*j]
plt.figure(figsize=(20,20))
plt.imshow(im_small)
plt.show()
im_small = np.zeros((w,h))
for i in range(w):
for j in range(h):
im_small[i,j] = im_blurred[n*i, n*j]
plt.figure(figsize=(20,20))
plt.imshow(im_small)
plt.show()



Original Image

Image blurred with Gaussian Filter LPF

Down-sampled Image from the original image (with aliasing)

Down-sampled Image from the blurred image (with anti-aliasing)

## Some Applications of DFT

### 0. Fourier Transform of a Gaussian Kernel is another Gaussian Kernel

Also, the spread in the frequency domain  inversely proportional to the spread in the spatial domain (known as Heisenberg’s inequality). Here is the proof:

The following animation shows an example visualizing the Gaussian contours in spatial and corresponding frequency domains:

### 1. Using DFT to up-sample an image

1. Let’s use the lena gray-scale image.
2. First double the size of the by padding zero rows/columns at every alternate positions.
3. Use FFT followed by an LPF.
4. Finally use IFFT to get the output image.

The following code block shows the python code for implementing the steps listed above:


import numpy as np
import numpy.fft as fp
import matplotlib.pyplot as plt

im1 = np.zeros((2*im.shape[0], 2*im.shape[1]))
print(im.shape, im1.shape)
for i in range(im.shape[0]):
for j in range(im.shape[1]):
im1[2*i,2*j] = im[i,j]

return vector

# the LPF kernel
kernel = [[0.25, 0.5, 0.25], [0.5, 1, 0.5], [0.25, 0.5, 0.25]]
# enlarge the kernel to the shape of the image

plt.figure(figsize=(15,10))
plt.gray() # show the filtered result in grayscale

freq = fp.fft2(im1)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
freq_LPF = freq*freq_kernel # by the Convolution theorem
im2 = fp.ifft2(freq_LPF)
freq_im2 = fp.fft2(im2)

plt.subplot(2,3,1)
plt.imshow(im)
plt.title('Original Image', size=20)
plt.subplot(2,3,2)
plt.imshow(im1)
plt.subplot(2,3,3)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq))).astype(int), cmap='jet')
plt.title('Original Image Spectrum', size=20)
plt.subplot(2,3,4)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq_kernel))).astype(int), cmap='jet')
plt.title('Image Spectrum of the LPF', size=20)
plt.subplot(2,3,5)
plt.imshow( (20*np.log10( 0.1 + fp.fftshift(freq_im2))).astype(int), cmap='jet')
plt.title('Image Spectrum after LPF', size=20)
plt.subplot(2,3,6)
plt.imshow(im2.astype(np.uint8)) # the imaginary part is an artifact
plt.title('Output Image', size=20)



The next figure shows the output. As can be seen from the next figure, the LPF removed the high frequency components from the Fourier spectrum of the padded image and with a subsequent inverse Fourier transform  we get a decent enlarged image.

### 2. Frequency Domain Gaussian Filter

1. Use an input image and use DFT to create the frequency 2D-array.
2. Create a small Gaussian 2D Kernel (to be used as an LPF) in the spatial domain and pad it to enlarge it to the image dimensions.
3. Use DFT to obtain the Gaussian Kernel in the frequency domain.
4. Use the Convolution theorem to convolve the LPF with the input image in the frequency domain.
5. Use IDFT to obtain the output image.
6. Plot the frequency spectrum of the image, the gaussian kernel and the image obtained after convolution in the frequency domain, in 3D.

The following code block shows the python code:


import matplotlib.pyplot as plt
from matplotlib import cm
from skimage.color import rgb2gray
import scipy.fftpack as fp

kernel = np.outer(signal.gaussian(im.shape[0], 10), signal.gaussian(im.shape[1], 10))
freq = fp.fft2(im)
assert(freq.shape == kernel.shape)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
convolved = freq*freq_kernel # by the Convolution theorem
im_blur = fp.ifft2(convolved).real
im_blur = 255 * im_blur / np.max(im_blur)

# center the frequency response
plt.imshow( (20*np.log10( 0.01 + fp.fftshift(freq_kernel))).real.astype(int), cmap='coolwarm')
plt.colorbar()
plt.show()

plt.figure(figsize=(20,20))
plt.imshow(im, cmap='gray')
plt.show()

from mpl_toolkits.mplot3d import Axes3D
from matplotlib.ticker import LinearLocator, FormatStrFormatter
# ... code for 3D visualization of the spectrums


#### The original color temple image (time / spatial domain)

The temple image (frequency domain)

#### The smoothed temple image with the LPF (frequency domain)

If we set the standard deviation of the LPF Gaussian kernel to be 10 we get the following output as shown in the next figures. As can be seen, the frequency response value drops much quicker from the center.

#### The smoothed temple image with the LPF with higher s.d. (frequency domain)

The output image after convolution (spatial / time domain)

3. Using the inverse filter to restore a motion-blurred image

1. First create a motion blur kernel of a given shape.
2. Convolve the kernel with an input image in the frequency domain.
3. Get the motion-blurred image in the spatial domain with IDFT.
4. Compute the inverse filter kernel and convolve with the blurred image in the frequency domain.
5. Get the convolved image back in the spatial domain.
6. Plot all the images and kernels in the frequency domain.

The following code block shows the python code:


# create the motion blur kernel
size = 21
kernel = np.zeros((size, size))
kernel[int((size-1)/2), :] = np.ones(size)
kernel = kernel / size

freq = fp.fft2(im)
freq_kernel = fp.fft2(fp.ifftshift(kernel))
convolved1 = freq1*freq_kernel1
im_blur = fp.ifft2(convolved1).real
im_blur = im_blur / np.max(im_blur)

epsilon = 10**-6

freq = fp.fft2(im_blur)
freq_kernel = 1 / (epsilon + freq_kernel1)

convolved = freq*freq_kernel
im_restored = fp.ifft2(convolved).real
im_restored = im_restored / np.max(im_restored)

plt.figure(figsize=(18,12))
plt.subplot(221)
plt.imshow(im)
plt.title('Original image', size=20)
plt.axis('off')
plt.subplot(222)
plt.imshow(im_blur)
plt.title('Blurred image with motion blur kernel', size=20)
plt.axis('off')
plt.subplot(223)
plt.imshow(im_restored)
plt.title('Restored image with inverse filter', size=20)
plt.axis('off')
plt.subplot(224)
plt.imshow(im_restored - im)
plt.title('Diff restored &amp;amp;amp;amp;amp;amp; original image', size=20)
plt.axis('off')
plt.show()

# Plot the surface of the frequency responses here



Frequency response of the input image

(log) Frequency response of the motion blur kernel (LPF)

Input image convolved with the motion blur kernel (frequency domain)

(log) Frequency response of the inverse frequency filter kernel (HPF)

Motion-blurred image convolved with the inverse frequency filter kernel (frequency domain)

4. Impact of noise on the inverse filter

1. Add some random noise to the Lena image.
2. Blur the image with a Gaussian kernel.
3. Restore the image using inverse filter.

With the original image

Let’s first blur and apply the inverse filter on the noiseless blurred image. The following figures show the outputs:

(log) Frequency response of the input image

(log) Frequency response of the Gaussian blur kernel (LPF)

(log) Frequency response of the blurred image

(log) Frequency response of the inverse kernel (HPF)

Frequency response of the output image

Adding noise to the original image

The following python code can be used to add Gaussian noise to an image:


from skimage.util import random_noise
im = random_noise(im, var=0.1)



The next figures show the noisy lena image, the blurred image with a Gaussian Kernel and the restored image with the inverse filter. As can be seen, being a high-pass filter, the inverse filter enhances the noise, typically corresponding to high frequencies.

#### 5. Use a notch filter to remove periodic noise from the following half-toned car image.

1. Use DFT to obtain the frequency spectrum of the image.
2. Block the high frequency components that are most likely responsible fro noise.
3. Use IDFT to come back to the spatial domain.


from scipy import fftpack
F1 = fftpack.fft2((im).astype(float))
F2 = fftpack.fftshift(F1)
for i in range(60, w, 135):
for j in range(100, h, 200):
if not (i == 330 and j == 500):
F2[i-10:i+10, j-10:j+10] = 0
for i in range(0, w, 135):
for j in range(200, h, 200):
if not (i == 330 and j == 500):
F2[max(0,i-15):min(w,i+15), max(0,j-15):min(h,j+15)] = 0
plt.figure(figsize=(6.66,10))
plt.imshow( (20*np.log10( 0.1 + F2)).astype(int), cmap=plt.cm.gray)
plt.show()
im1 = fp.ifft2(fftpack.ifftshift(F2)).real
plt.figure(figsize=(10,10))
plt.imshow(im1, cmap='gray')
plt.axis('off')
plt.show()



Frequency response of the input image

Frequency response of the input image with blocked frequencies with notch

Output image

With a low-pass-filter (LPF):

Frequency response of the input image with blocked frequencies with LPF

Output image

## Histogram Matching with color images

As described here, here is the algorithm:

1. The cumulative histogram is computed for each image dataset, see the figure below.
2. For any particular value (xi) in the input image data to be adjusted has a cumulative histogram value given by G(xi).
3. This in turn is the cumulative distribution value in the reference (template) image  dataset, namely H(xj). The input data value xi is replaced by xj.

im = imread('images/lena.jpg')
im1 = np.zeros(im.shape).astype(np.uint8)
plt.figure(figsize=(20,10))
for i in range(3):
c = cdf(im[...,i])
c_t = cdf(im_t[...,i])
im1[...,i] = hist_matching(c, c_t, im[...,i]) # implement this function with the above algorithm
c1 = cdf(im1[...,i])
col = 'r' if i == 0 else ('g' if i == 1 else 'b')
plt.plot(np.arange(256), c, col + ':', label='input ' + col.upper(), linewidth=5)
plt.plot(np.arange(256), c_t, col + '--', label='template ' + col.upper(), linewidth=5)
plt.plot(np.arange(256), c1, col + '-', label='output ' + col.upper(), linewidth=2)
plt.title('CDF', size=20)
plt.legend(prop={'size': 15})
plt.show()

plt.figure(figsize=(10,10))
plt.imshow(im1[...,:3])
plt.axis('off')
plt.show()


Input image

Template Image

Output Image

The following figure shows how the histogram of the input image is matched with the histogram of the template image.

Another example:

Input image

Template Image

Output Image

## Mathematical Morphology

### 1. Automatically cropping an image

1. Let’s use the following image. The image has unnecessary white background  outside the molecule of the organic compound.
2. First convert the image to a binary image and compute the convex hull of the molecule object.
3. Use the convex hull image to find the bounding box for cropping.
4. Crop the original image with the bounding box.

The next python code shows how to implement the above steps:


from PIL import Image
from skimage.morphology import convex_hull_image
plt.imshow(im)
plt.title('input image')
plt.show()
im1 = 1 - rgb2gray(im)
threshold = 0.5
im1[im1  threshold] = 1
chull = convex_hull_image(im1)
plt.imshow(chull)
plt.title('convex hull in the binary image')
plt.show()
imageBox = Image.fromarray((chull*255).astype(np.uint8)).getbbox()
cropped = Image.fromarray(im).crop(imageBox)
cropped.save('L_2d_cropped.jpg')
plt.imshow(cropped)
plt.title('cropped image')
plt.show()



This can also be found here.

### 2. Opening and Closing are Dual operations in mathematical morphology

1. Start with a binary image and apply opening operation with some structuring element (e.g., a disk) on it to obtain an output image.
2. Invert the image (to change the foreground to background and vice versa) and apply closing operation on it with the same structuring element to obtain another output image.
3. Invert the second output image obtained and observe that it’s same as the first output image.
4. Thus applying opening operation to the foreground of a binary image is equivalent to applying closing operation to the background of the same image with the same structuring element.

The next python code shows the implementation of the above steps.


from skimage.morphology import binary_opening, binary_closing, disk
from skimage.util import invert
im[im  0.5] = 1
plt.gray()
plt.figure(figsize=(20,10))
plt.subplot(131)
plt.imshow(im)
plt.title('original', size=20)
plt.axis('off')
plt.subplot(1,3,2)
im1 = binary_opening(im, disk(12))
plt.imshow(im1)
plt.title('opening with disk size ' + str(12), size=20)
plt.axis('off')
plt.subplot(1,3,3)
im1 = invert(binary_closing(invert(im), disk(12)))
plt.imshow(im1)
plt.title('closing with disk size ' + str(12), size=20)
plt.axis('off')
plt.show()



As can be seen the output images obtained are exactly same.

## Floyd-Steinberg Dithering (to convert a grayscale to a binary image)

The next figure shows the algorithm for error diffusion dithering.


def find_closest_palette_color(oldpixel):
return int(round(oldpixel / 255)*255)

pixel = np.copy(im)
w, h = im.shape

for x in range(w):
for y in range(h):
oldpixel = pixel[x][y]
newpixel = find_closest_palette_color(oldpixel)
pixel[x][y] = newpixel
quant_error = oldpixel - newpixel
if x + 1 &amp;amp;amp;amp;amp;lt; w-1:
pixel[x + 1][y] = pixel[x + 1][y] + quant_error * 7 / 16
if x &amp;amp;amp;amp;amp;gt; 0 and y &amp;amp;amp;amp;amp;lt; h-1:
pixel[x - 1][y + 1] = pixel[x - 1][y + 1] + quant_error * 3 / 16
if y &amp;amp;amp;amp;amp;lt; h-1:
pixel[x ][y + 1] = pixel[x ][y + 1] + quant_error * 5 / 16
if x &amp;amp;amp;amp;amp;lt; w-1 and y &amp;amp;amp;amp;amp;lt; h-1:
pixel[x + 1][y + 1] = pixel[x + 1][y + 1] + quant_error * 1 / 16

plt.figure(figsize=(10,20))
plt.imshow(pixel, cmap='gray')
plt.axis('off')
plt.show()



The input image (gray-scale)

The output Image (binary)

The next animation shows how an another input grayscale image gets converted to output binary image using the error diffusion dithering.

## Sharpen a color image

1. First blur the image with an LPF (e.g., Gaussian Filter).
2. Compute the detail image as the difference between the original and the blurred image.
3. Now the sharpened image can be computed as a linear combination of the original image and the detail  image. The next figure illustrates the concept.

The next python code shows how this can be implemented in python:


from scipy import misc, ndimage
import matplotlib.pyplot as plt
import numpy as np

def rgb2gray(im):
return np.clip(0.2989 * im[...,0] + 0.5870 * im[...,1] + 0.1140 * im[...,2], 0, 1)

im_blurred = ndimage.gaussian_filter(im, (5,5,0))
im_detail = np.clip(im - im_blurred, 0, 1)
fig, axes = plt.subplots(nrows=2, ncols=3, sharex=True, sharey=True, figsize=(15, 15))
axes = axes.ravel()
axes[0].imshow(im)
axes[0].set_title('Original image', size=15)
axes[1].imshow(im_blurred)
axes[1].set_title('Blurred image, sigma=5', size=15)
axes[2].imshow(im_detail)
axes[2].set_title('Detail image', size=15)
alpha = [1, 5, 10]
for i in range(3):
im_sharp = np.clip(im + alpha[i]*im_detail, 0, 1)
axes[3+i].imshow(im_sharp)
axes[3+i].set_title('Sharpened image, alpha=' + str(alpha[i]), size=15)
for ax in axes:
ax.axis('off')
fig.tight_layout()
plt.show()



The next figure shows the output of the above code block. As cane be seen, the output gets more sharpened as the value of alpha gets increased.

The next animation shows how the image gets more and more sharpened with increasing alpha.

## Edge Detection with LOG and Zero-Crossing Algorithm by Marr and Hildreth

The following figure shows LOG filter and its DOG approximation.

In order to detect edges as a binary image, finding the zero-crossings in the LoG-convolved image was proposed by Marr and Hildreth. Identification of the edge pixels can be done by viewing the sign of the LoG-smoothed image by defining it as a binary image, the algorithm is as follows:

Algorithm to compute the zero-crossing

1. First convert the LOG-convolved image to a binary image, by replacing the pixel values by 1 for positive values and 0 for negative values.
2. In order to compute the zero crossing pixels, we need to simply look at the boundaries of the non-zero regions in this binary image.
3. Boundaries can be found by finding any non-zero pixel that has an immediate neighbor which is is zero.
4. Hence, for each pixel, if it is non-zero, consider its 8 neighbors, if any of the neighboring pixels is zero, the pixel can be identified as an edge.

The next python code and the output images / animations generated show how to detect the edges from the  zebra image with LOG + zero-crossings:


from scipy import ndimage, misc
import matplotlib.pyplot as plt
from skimage.color import rgb2gray
def any_neighbor_zero(img, i, j):
for k in range(-1,2):
for l in range(-1,2):
if img[i+k, j+k] == 0:
return True
return False
def zero_crossing(img):
img[img > 0] = 1
img[img  0 and any_neighbor_zero(img, i, j):
out_img[i,j] = 255
return out_img

fig = plt.figure(figsize=(25,15))
plt.gray() # show the filtered result in grayscale

for sigma in range(2,10, 2):
plt.subplot(2,2,sigma/2)
result = ndimage.gaussian_laplace(img, sigma=sigma)
plt.imshow(zero_crossing(result))
plt.axis('off')
plt.title('LoG with zero-crossing, sigma=' + str(sigma), size=30)

plt.tight_layout()
plt.show()



Original Input Image

Output with edges detected with LOG + zero-crossing at different sigma scales

With another input image

Output with edges detected with LOG + zero-crossing at different sigma scales

## Constructing the Gaussian Pyramid with scikit-image transform module’s reduce function and Laplacian Pyramid from the Gaussian Pyramid and the expand function

The Gaussian Pyramid can be computed with the following steps:

2. Iteratively compute the image at each level of the pyramid first by smoothing the image (with gaussian filter) and then downsampling it .
3. Stop at a level where the image size becomes sufficiently small (e.g., 1×1).

The Laplacian Pyramid can be computed with the following steps:

2. Iteratively compute the difference image in between the image at the current level and the image obtained by first upsampling and then smoothing the image (with gaussian filter) from the previous level of the Gaussian Pyramid.
3. Stop at a level where the image size becomes equal to the original image size.

The next python code shows how to create a Gaussian Pyramid from an image.

import numpy as np
import matplotlib.pyplot as plt
from skimage.color import rgb2gray
from skimage.transform import pyramid_reduce, pyramid_expand, resize

def get_gaussian_pyramid(image):
rows, cols, dim = image.shape
gaussian_pyramid = [image]
while rows&amp;amp;gt; 1 and cols &amp;amp;gt; 1:
image = pyramid_reduce(image, downscale=2)
gaussian_pyramid.append(image)
print(image.shape)
rows //= 2
cols //= 2
return gaussian_pyramid

def get_laplacian_pyramid(gaussian_pyramid):
laplacian_pyramid = [gaussian_pyramid[len(gaussian_pyramid)-1]]
for i in range(len(gaussian_pyramid)-2, -1, -1):
image = gaussian_pyramid[i] - resize(pyramid_expand(gaussian_pyramid[i+1]), gaussian_pyramid[i].shape)
laplacian_pyramid.append(np.copy(image))
laplacian_pyramid = laplacian_pyramid[::-1]
return laplacian_pyramid

gaussian_pyramid = get_gaussian_pyramid(image)
laplacian_pyramid = get_laplacian_pyramid(gaussian_pyramid)

w, h = 20, 12
for i in range(3):
plt.figure(figsize=(w,h))
p = gaussian_pyramid[i]
plt.imshow(p)
plt.title(str(p.shape[0]) + 'x' + str(p.shape[1]), size=20)
plt.axis('off')
w, h = w / 2, h / 2
plt.show()

w, h = 10, 6
for i in range(1,4):
plt.figure(figsize=(w,h))
p = laplacian_pyramid[i]
plt.imshow(rgb2gray(p), cmap='gray')
plt.title(str(p.shape[0]) + 'x' + str(p.shape[1]), size=20)
plt.axis('off')
w, h = w / 2, h / 2
plt.show()


## Blending images with Gaussian and Laplacian pyramids

Here is the algorithm:

Blending the following input images A, B with mask image M

Input Image A (Goddess Durga)

Input Image B (Lord Shiva)

with the following python code creates the output image I shown below


# get the Gaussian and Laplacian pyramids, implement the functions
pyramidA = get_laplacian_pyramid(get_gaussian_pyramid(A))
pyramidB = get_laplacian_pyramid(get_gaussian_pyramid(B))
pyramidM = get_gaussian_pyramid(M)

pyramidC = []
for i in range(len(pyramidM)):
im = pyramidM[i]*pyramidA[i] + (1-pyramidM[i])*pyramidB[i]
pyramidC.append(im)
# implement the following function to construct an image from its laplacian pyramids
I = reconstruct_image_from_laplacian_pyramid(pyramidC)



Output Image I (Ardhanarishwar)
The following animation shows how the output image is formed:

Another blending (horror!) example (from prof. dmartin)

# Detection of a Human Object with HOG Descriptor Features using SVM (Primal QuadProg implementation using CVXOPT) in Python

In this article, first how to extract the HOG descriptor from an image will be discuss. Then how a support vector machine binary classifier can be trained on a dataset containing labeled images (using the extracted HOG descriptor features) and later how the SVM model can be used (along with a sliding window) to predict whether or not a human object exists in a test image will be described.  How SVM can be represented as a Primal Quadratic Programming problem and can be solved with CVXOPT that will also be discussed. This problem appeared as an assignment problem in this Computer Vision course from UCF.

## Problem 1: Compute HOG features

Let’s first Implement Histogram of Orientated Gradients (HOG). The dataset to be used is the INRIA Person Dataset from here. The dataset consists of positive and negative examples for training as well as testing images. Let us do the following:

i. Take 2003 positive training images of size 96×160
ii. Take 997 negative training images of size 96×160
iii. Compute HOG for positive and negative examples.
iv. Show the visualization of HOG for some positive and negative examples.

The Histograms of Oriented Gradients for Human Detection (HOG) is a very heavily cited paper by N. Dalal and B. Triggs from CVPR 2005. The following figure shows the  algorithm proposed by them can be used to compute the HOG features for a 96×160 image:

The next python code snippet shows some helper functions to compute the hog features:


import numpy as np
from scipy import signal
import scipy.misc

def s_x(img):
kernel = np.array([[-1, 0, 1]])
imgx = signal.convolve2d(img, kernel, boundary='symm', mode='same')
return imgx
def s_y(img):
kernel = np.array([[-1, 0, 1]]).T
imgy = signal.convolve2d(img, kernel, boundary='symm', mode='same')
return imgy

imgx = s_x(img)
imgy = s_y(img)
s = np.sqrt(imgx**2 + imgy**2)
theta = np.arctan2(imgx, imgy) #imgy, imgx)
theta[theta<0] = np.pi + theta[theta<0]
return (s, theta)


The following figures animations show some positive and negative training examples along with the HOG features computed using the algorithm.

### Positive Example 1

The next animation shows how the HOG features are computed using the above algorithm.

### Positive Example 2

The next animation shows how the HOG features are computed using the above algorithm.

### Positive Example 3

The next animation shows how the HOG features are computed using the above algorithm.

### Negative Example 1

The next animation shows how the HOG features are computed using the above algorithm.

## Problem 2: Use sklearn’s SVC and 80-20 validation to compute accuracy on the held-out training images dataset using the extracted HOG features.

Before implementing SVC on our own with primal quadratic programming solver, let’s use the scikit-learn SVC implementation (with linear kernel) to train a support vector classifier on the training positive and negative examples using the HOG features extracted  from the training images with 80-20 validation and compute accuracy of classification on the held-out images.

The following python code does exactly that, with the X matrix containing the 1620 HOG features extracted from each image and the corresponding label (pos/neg, depending on whether human is present or not), with 98.5% accuracy on the held-out dataset.


import time
from sklearn.metrics import accuracy_score
from sklearn.cross_validation import train_test_split
from sklearn.svm import SVC
Xtrain, Xtest, ytrain, ytest = train_test_split(X, y, train_size=0.8, random_state=123)
timestamp1 = time.time()
clf = SVC(C=1, kernel='linear')
clf.fit(Xtrain, ytrain)
print("%d support vectors out of %d points" % (len(clf.support_vectors_), len(Xtrain)))
timestamp2 = time.time()
print "sklearn LinearSVC took %.2f seconds" % (timestamp2 - timestamp1)
ypred = clf.predict(Xtest)
print('accuracy', accuracy_score(ytest, ypred))


430 support vectors out of 2400 points
sklearn LinearSVC took 3.40 seconds
accuracy 0.985

The next figures show the confusion matrices for the prediction on the held-out dataset with the SVC model learnt.

## Problem 3: Implement SVM by solving the Primal form of the problem using Quadratic Programming

Let’s implement Support Vector Machine (SVM) using Quadratic Programming. We shall use python’s CVXOPT package for this purpose. Let’s do the following:

i. Try to understand each input term in cvxopt.solvers.qp.
ii. Formulate soft- margin primal SVM in term of inputs of cvxopt.solvers.qp
iii. Show ‘P’, ‘Q’, ‘G”, ‘h’, ‘A’ and ‘b’ Matrices.
iv. Obtain parameter vector ‘w’ and bias term ‘b’ using cvxopt.solvers.qp

To be done

## Problem 4: Detect Human in testing images using trained model (‘w’, ‘b’) from the last problem

Let’s use the coefficients learnt by the SVM model from the training dataset and do the following:

i. Take at least 5 testing images from Test/pos.
ii. Test the trained model over testing images. Testing can be performed using
w*feature vector + b.
iii. Use sliding window approach to obtain detection at each location in the image.
iv. Perform non-maximal suppression and choose the highest scored location.
v. Display the bounding box at the final detection.

To be done

# Few Machine Learning Problems (with Python implementations)

In this article a few machine learning problems from a few online courses will be described.

# 1. Fitting the distribution of heights data

This problem appeared as an assignment problem in the coursera course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself. This video explains this problem and the solution in details.

In this assessment the steepest descent will be used to fit a Gaussian model to the distribution of heights data that was first introduced in Mathematics for Machine Learning: Linear Algebra.

The algorithm is the same as Gradient descent but this time instead of descending a
pre-defined function, we shall descend the χ2 (chi squared) function which is both a function of the parameters that we are to optimize, but also the data that the model is
to fit to.

## Background

We are given a dataset with 100 data-points for the heights of people in a population, with x as heights (in cm.) and y as the probability that there is a person with that height, first few datapoints are shown in the following table:

x y
0 51.25 0.0
1 53.75 0.0
2 56.25 0.0
3 58.75 0.0
4 61.25 0.0
5 63.75 0.0
6 66.25 0.0
7 68.75 0.0
8 71.25 0.0
9 73.75 0.0

The dataset can be plotted as a histogram, i.e., a bar chart where each bar has a width representing a range of heights, and an area which is the probability of finding a person with a height in that range, using the following code.

import matplotlib.pylab as plt
plt.figure(figsize=(15,5))
plt.bar(x, y, width=3, color=greenTrans, edgecolor=green)
plt.xlabel('x')
plt.ylabel('y')
plt.show()


We can model that data with a function, such as a Gaussian, which we can specify with two parameters, rather than holding all the data in the histogram.  The Gaussian function is given as,

By definition χ2 as the squared difference of the data and model, i.e.,  χ|− (xμσ)|^2.

x an y are represented as vectors here, as these are lists of all of the data points, the
|abs-squared|^2 encodes squaring and summing of the residuals on each bar.

To improve the fit, we shall want to alter the parameters μ and σ, and ask how that changes the χ2. That is, we will need to calculate the Jacobian,

Let’s look at the first term, (χ2)/μ, using the multi-variate chain rule, this can be written as,

A similar expression for (χ2)/σ can be obtained as follows:

The Jacobians rely on the derivatives f/μ and f/σ. It’s pretty straightforward to implement the python functions dfdmu() and dfdsig() to compute the derivatives.

Next recall that steepest descent shall move around in parameter space proportional to the negative of the Jacobian, i.e.,

with the constant of proportionality being the aggression of the algorithm.

The following function computes the expression for the Jacobian.


def steepest_step (x, y, mu, sig, aggression) :
J = np.array([
-2*(y - f(x,mu,sig)) @ dfdmu(x,mu,sig),
-2*(y - f(x,mu,sig)) @ dfdsig(x,mu,sig)
])
step = -J * aggression
return step


We need to run a few rounds of steepest descent to fit the model. The next piece of code builds the model with steepest descent to fit the heights data:


# Do a few rounds of steepest descent.
for i in range(50) :
dmu, dsig = steepest_step(x, y, mu, sig, 2000)
mu += dmu
sig += dsig
p = np.append(p, [[mu,sig]], axis=0)



The following animations show the steepest descent path for the parameters and the model fitted to the data, respectively.

The data is shown in orange, the model in magenta, and where they overlap it’s shown in green.

χ2 is represented in the figure as the sum of the squares of the pink and orange bars.

This particular model has not been fit well with the initial guess – since there is not a strong overlap.

But gradually the model fits the data better as more and more iterations of steepest descent are run.

Note that the path taken through parameter space is not necessarily the most direct path, as with steepest descent we always move perpendicular to the contours.

# 2. Back-propagation

This problem also appeared as an assignment problem in the coursera online course Mathematics for Machine Learning: Multivariate Calculus. The description of the problem is taken from the assignment itself.

In this assignment, we shall train a neural network to draw a curve. The curve takes one input variable, the amount traveled along the curve from 0 to 1, and returns 2 outputs, the 2D coordinates of the position of points on the curve.

The below table shows the first few rows of the dataset. Here x is the input variable and y1, y2 are the output variables.

x y1 y2
0 0.00 0.500000 0.625000
1 0.01 0.500099 0.627015
2 0.02 0.500788 0.632968
3 0.03 0.502632 0.642581
4 0.04 0.506152 0.655407
5 0.05 0.511803 0.670852
6 0.06 0.519955 0.688198
7 0.07 0.530875 0.706641
8 0.08 0.544723 0.725326
9 0.09 0.561537 0.743386

The next figures show how the data looks:

To help capture the complexity of the curve, we shall use two hidden layers in our network with 6 and 7 neurons respectively.

We shall implement functions to calculate the Jacobian of the cost function, with respect to the weights and biases of the network. The code will form part of a stochastic steepest descent algorithm that will train our network.

## Feed forward

The following figure shows the feed-forward equations,

In the following python code (taken from the same assignment) defines functions to set up our neural network. Namely an activation function, σ(z), it’s derivative, σ(z), a function to initialize weights and biases, and a function that calculates each activation of the network using feed-forward.

In this assignment we shall use the logistic function as our activation function, rather than the more familiar tanh or relu.


sigma = lambda z : 1 / (1 + np.exp(-z))
d_sigma = lambda z : np.cosh(z/2)**(-2) / 4

# This function initialises the network with it's structure, it also resets any training already done.
def reset_network (n1 = 6, n2 = 7, random=np.random) :
global W1, W2, W3, b1, b2, b3
W1 = random.randn(n1, 1) / 2
W2 = random.randn(n2, n1) / 2
W3 = random.randn(2, n2) / 2
b1 = random.randn(n1, 1) / 2
b2 = random.randn(n2, 1) / 2
b3 = random.randn(2, 1) / 2

# This function feeds forward each activation to the next layer. It returns all weighted sums and activations.
def network_function(a0) :
z1 = W1 @ a0 + b1
a1 = sigma(z1)
z2 = W2 @ a1 + b2
a2 = sigma(z2)
z3 = W3 @ a2 + b3
a3 = sigma(z3)
return a0, z1, a1, z2, a2, z3, a3

# This is the cost function of a neural network with respect to a training set.
def cost(x, y) :
return np.linalg.norm(network_function(x)[-1] - y)**2 / x.size



## Backpropagation

Next we need to implement the functions for the Jacobian of the cost function with respect to the weights and biases. We will start with layer 3, which is the easiest, and work backwards through the layers.

The cost function C is the sum (or average) of the squared losses over all training examples:

The following python code shows how the J_W3 function can be implemented.


# Jacobian for the third layer weights.
def J_W3 (x, y) :
# First get all the activations and weighted sums at each layer of the network.
a0, z1, a1, z2, a2, z3, a3 = network_function(x)
# We'll use the variable J to store parts of our result as we go along, updating it in each line.
# Firstly, we calculate dC/da3, using the expressions above.
J = 2 * (a3 - y)
# Next multiply the result we've calculated by the derivative of sigma, evaluated at z3.
J = J * d_sigma(z3)
# Then we take the dot product (along the axis that holds the training examples) with the final partial derivative,
# i.e. dz3/dW3 = a2
# and divide by the number of training examples, for the average over all training examples.
J = J @ a2.T / x.size
# Finally return the result out of the function.
return J



The following python code snippet implements the Gradient Descent algorithm (where the parameter aggression represents the learning rate and noise acts as a regularization parameter here):


while iterations < max_iteration:
j_W1 = J_W1(x, y) * (1 + np.random.randn() * noise)
j_W2 = J_W2(x, y) * (1 + np.random.randn() * noise)
j_W3 = J_W3(x, y) * (1 + np.random.randn() * noise)
j_b1 = J_b1(x, y) * (1 + np.random.randn() * noise)
j_b2 = J_b2(x, y) * (1 + np.random.randn() * noise)
j_b3 = J_b3(x, y) * (1 + np.random.randn() * noise)

W1 = W1 - j_W1 * aggression
W2 = W2 - j_W2 * aggression
W3 = W3 - j_W3 * aggression
b1 = b1 - j_b1 * aggression
b2 = b2 - j_b2 * aggression
b3 = b3 - j_b3 * aggression



The next figures and animations show how the prediction curve (pink / orange) with the neural net approaches the original curve (green) as it’s trained longer.

With learning rate = 7

The next figures and animations visualize all the curves learnt at different iterations.

With learning rate = 1

We can change parameters of the steepest descent algorithm, e.g., how aggressive the learning step is, and how much noise to add. The next figure shows the actual and the predicted curves for a few different values of the paramaters.

We can compute the model error (sum of square deviation in between the actual and predicted outputs) for different values of the parameters. The next heatmap shows how the model error varies with the aggression and noise parameter values.

## 3. The Kernel Perceptron

This problem appeared in an assignment in the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta).

We need to implement the Kernel Perceptron algorithm to classify some datasets that are not linearly separable. The algorithm should allow kernels like the quadratic and RBF kernel.

The next figure describes the theory and the algorithm (in dual form) for Kernel Perceptron. As can be seen, the kernel trick can be used both at the training and the prediction time to avoid basis expansion (by replacing the dot products of the expanded feature vectors with a Mercer Kernel).

The datasets on which we are going to classify with the dual perceptron algorithm are 2-dimensional datasets, each datapoint with a label +1 or -1, the first few datapoints of a dataset is shown below:

X1 X2 y
0 1.0 1.0 1.0
1 2.0 1.0 1.0
2 3.0 1.0 1.0
3 4.0 1.0 1.0
4 5.0 1.0 1.0

The ground-truth of the data-points are represented by their color (red and black) and marker type (circle and triangle respectively).

The data-point that is mis-classified in a particular iteration is shown in blue.

When a mis-classified point is selected, the corresponding alpha value is up-voted, this is indicated by increase in the size of the data-point.

The decision boundary for the two classes are shown with green and magenta colors, respectively.

The following figures and animations show the classification of the datasets using kernel perceptron with RBF and quadratic kernels. The next python code snippet implements the kernel functions.


import numpy as np
def kernel(x, z, type, s):
if type == 'rbf':
return np.exp(-np.dot(x-z, x-z)/s**2)
return (1 + np.dot(x, z))**2
return np.dot(x, z)



## Results

The next figures / animations show the result of classification with a python implementation of the (Dual) Kernel Perceptron Algorithm.

Dataset 1

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

Dataset 2

Results with RBF Kernel

Dataset 3

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

• C is the setting of the soft-margin parameter C (default: 1.0)
• s (for the RBF kernel) is the scaling parameter s (default: 1.0)

Dataset 4

Results with RBF Kernel

Dataset 5

Kernel Perceptron algorithm does not converge on this dataset with quadratic kernel. The following animation shows the convergence of the algorithm and decision boundary found with gaussian kernel.

## Results with Kernel SVM Classifier (sklearn)

The following code and the figures show the decision boundaries and the support vectors (datapoints with larger size) learnt with sklearn SVC.


from sklearn.svm import SVC
x = data[:,0:2]
y = data[:,2]
clf = SVC(kernel=kernel, C=C, kernel = 'rbf', gamma=1.0/(s*s))
clf.fit(x,y)
clf.support_



Dataset 1

With polynomial kernel (degree=2, C=1)

With RBF kernel (C=10, σ = 10)

Dataset 2

With polynomial kernel (degree=2, C=1)

With RBF kernel (C=10, σ = 10)

Dataset 3

With RBF kernel (C=10, σ = 10)

## 4. Models for handwritten digit classification

This problem is taken from a few assignments from the edX course Machine Learning Fundamentals by UCSD (by Prof. Sanjay Dasgupta). The problem description is taken from the course itself.

In this assignment we will build a few classifiers that take an image of a handwritten digit and outputs a label 0-9. We will start with a particularly simple strategy for this problem known as the nearest neighbor classifier, then a Gaussian generative model for classification will be built and finally an SVM model will be used for classification.

### The MNIST dataset

MNIST is a classic dataset in machine learning, consisting of 28×28 gray-scale images handwritten digits. The original training set contains 60,000 examples and the test set contains 10,000 examples. Here we will be working with a subset of this data: a training set of 7,500 examples and a test set of 1,000 examples.

The following figure shows the first 25 digits from the training dataset along with the labels.

Similarly, the following figure shows the first 25 digits from the test dataset along with the ground truth labels.

## Nearest neighbor for handwritten digit recognition

### Squared Euclidean distance

To compute nearest neighbors in our data set, we need to first be able to compute distances between data points. A natural distance function is Euclidean distance: for two vectors x∈ ℝ^d, their Euclidean distance is defined as

Often we omit the square root, and simply compute squared Euclidean distance.

For the purposes of nearest neighbor computations, the two are equivalent: for three vectors xy∈ ℝ^d, we have xyxz if and only if xy∥^2xz∥^2.

The following python function squared Euclidean distance.


## Computes squared Euclidean distance between two vectors.
def squared_dist(x,y):
return np.sum(np.square(x-y)



### Computing nearest neighbors

Now that we have a distance function defined, we can now turn to (1-) nearest neighbor classification, with the following naive implementation with 0 training / pre-processing time.

## Takes a vector x and returns the index of its nearest neighbor in train_data
def find_NN(x):
# Compute distances from x to every row in train_data
distances = [squared_dist(x,train_data[i,]) for i in range(len(train_labels))]
# Get the index of the smallest distance
return np.argmin(distances)

## Takes a vector x and returns the class of its nearest neighbor in train_data
def NN_classifier(x):
# Get the index of the the nearest neighbor
index = find_NN(x)
# Return its class
return train_labels[index]


The following figure shows a test example correctly classified by finding the nearest training example and another incorrectly classified.

### Processing the full test set

Now let’s apply our nearest neighbor classifier over the full data set.

Note that to classify each test point, our code takes a full pass over each of the 7500 training examples. Thus we should not expect testing to be very fast.


## Predict on each test data point (and time it!)
t_before = time.time()
test_predictions = [NN_classifier(test_data[i,]) for i in range(len(test_labels))]
t_after = time.time()

## Compute the error
err_positions = np.not_equal(test_predictions, test_labels)
error = float(np.sum(err_positions))/len(test_labels)

print("Error of nearest neighbor classifier: ", error)
print("Classification time (seconds): ", t_after - t_before)



(‘Error of nearest neighbor classifier: ‘, 0.046)
(‘Classification time (seconds): ‘, 41.04900002479553)

The next figure shows the confusion matrix for classification

### Faster nearest neighbor methods

Performing nearest neighbor classification in the way we have presented requires a full pass through the training set in order to classify a single point. If there are N training points in ℝ^d, this takes O(Nd) time.

Fortunately, there are faster methods to perform nearest neighbor look up if we are willing to spend some time pre-processing the training set. scikit-learnhas fast implementations of two useful nearest neighbor data structures: the ball tree and the k-d tree.


from sklearn.neighbors import BallTree

## Build nearest neighbor structure on training data
t_before = time.time()
ball_tree = BallTree(train_data)
t_after = time.time()

## Compute training time
t_training = t_after - t_before
print("Time to build data structure (seconds): ", t_training)

## Get nearest neighbor predictions on testing data
t_before = time.time()
test_neighbors = np.squeeze(ball_tree.query(test_data, k=1, return_distance=False))
ball_tree_predictions = train_labels[test_neighbors]
t_after = time.time()

## Compute testing time
t_testing = t_after - t_before
print("Time to classify test set (seconds): ", t_testing)



(‘Time to build data structure (seconds): ‘, 0.3269999027252197)
(‘Time to classify test set (seconds): ‘, 6.457000017166138)

similarly, with the KdTree data structure we have the following runtime:

(‘Time to build data structure (seconds): ‘, 0.2889997959136963)
(‘Time to classify test set (seconds): ‘, 7.982000112533569)

Next let’s use sklearn’s KNeighborsClassifier to compare with the runtimes.


from sklearn.neighbors import KNeighborsClassifier
neigh = KNeighborsClassifier(n_neighbors=1)
neigh.fit(train_data, train_labels)
predictions = neigh.predict(test_data)



(‘Training Time (seconds): ‘, 0.2999999523162842)
(‘Time to classify test set (seconds): ‘, 8.273000001907349)

The next figure shows the error rate on the test dataset with k-NearestNeighbor classifier with different values of k.

Training the 1-NN classifier on the entire training dataset with 60k images and testing on the entire testset with 10k images yields the the following results:

(‘Training Time (seconds): ‘, 19.694000005722046)
(‘Time to classify test set (seconds): ‘, 707.7590000629425)

with the following accuracy on the test dataset and the confusion matrix:

accuracy: 0.9691 (error 3.09%)

## Gaussian generative models for handwritten digit classification

Recall that the 1-NN classifier yielded a 3.09% test error rate on the MNIST data set of handwritten digits. We will now see that a Gaussian generative model does almost as well, while being significantly faster and more compact.

For this assignment we shall be using the entire MNIST dataset, the training dataset contains 60k images and the test dataset contains 10k images.

### Fit a Gaussian generative model to the training data

The following figure taken from the lecture videos from the same course describes the basic theory.

Let’s Define a function, fit_generative_model, that takes as input a training set (data x and labels y) and fits a Gaussian generative model to it. It should return the parameters of this generative model; for each label j = 0,1,...,9, we have:

• pi[j]: the frequency of that label
• mu[j]: the 784-dimensional mean vector
• sigma[j]: the 784×784 covariance matrix

This means that pi is 10×1, mu is 10×784, and sigma is 10x784x784.

We need to fit a Gaussian generative model. The parameters pi, mu and sigma are computed with corresponding maximum likelihood estimates (MLE) values: empirical count, mean and covariance matrix for each of the class labels from the data. However, now there is an added ingredient.

The empirical covariances are very likely to be singular (or close to singular), which means that we won’t be able to do calculations with them. Thus it is important to regularize these matrices. The standard way of doing this is to add cI to them, where c is some constant and I is the 784-dimensional identity matrix. (To put it another way, we compute the empirical covariances and then increase their diagonal entries by some constant c).

This modification is guaranteed to yield covariance matrices that are non-singular, for any c > 0, no matter how small. But this doesn’t mean that we should make c as small as possible. Indeed, c is now a parameter, and by setting it appropriately, we can improve the performance of the model. We will study regularization in greater detail over the coming weeks.

The following python code snippet shows the function:


def fit_generative_model(x,y):
k = 10 # labels 0,1,...,k-1
d = (x.shape)[1] # number of features
mu = np.zeros((k,d))
sigma = np.zeros((k,d,d))
pi = np.zeros(k)
c = 3500 # regularizer
for label in range(k):
indices = (y == label)
pi[label] = ... # empirical count
mu[label] = ... # empirical mean
sigma[label] = ... # empirical regularized covariance matrix
return mu, sigma, pi



Now let”s visualize the means of the Gaussians for the digits.

Time taken to fit the generative model (in seconds) : 2.60100007057

### Make predictions on test data

Now let’s see how many errors the generative model makes on the test set.
The model makes 438 errors out of 10000 (test accuracy: 95.562%)
Time taken to classify the test data (in seconds): 19.5959999561

The following figure shows the confusion matrix.

## SVM for handwritten digit classification

The entire training dataset from the MNIST dataset  is used to train the SVM model, the training dataset contains 60k images and the test dataset contains 10k images.

First let’s try linear SVM, the following python code:


from sklearn.svm import LinearSVC
clf = LinearSVC(C=C, loss='hinge')
clf.fit(train_data,train_labels)
score = clf.score(test_data,test_labels)



The following figure shows the training and test accuracies of LinearSVC with different values of the hyper-parameter C.

Next let’s try SVM with quadratic kernel, as can be seen it gives 98.06% accuracy on the test dataset with C=1.


from sklearn.svm import SVC
clf = SVC(C=1., kernel='poly', degree=2)
clf.fit(train_data,train_labels)
print clf.score(train_data,train_labels)
print clf.score(test_data,test_labels)



training accuracy: 1.0
test accuracy: 0.9806 (error: 1.94%)

The following figure shows the confusion matrix:

# Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

• First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
• Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
• Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
• The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

## 1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:

or as the following if the explicit bias term is discarded:

where the 0-1 loss is approximated by the hinge-loss.

• Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
• The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
• Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
• The learning rate η is gradually decreased with iteration.

The following figure shows a simplified version of the algorithm:

The following python code implements the algorithm:


class SVMPegasos(LinearClassifier):
"""Implementation of SVM with SGD with PEGASOS Algorithm"""

def __init__(self, n_iter=10, lambda1=1):
self.n_iter = n_iter
self.lambda1 = lambda1

def fit(self, X, Y):
Y = list(Y)
self.find_classes(Y)
# convert all output values to +1 or -1
Yn = [sign(y, self.positive_class) for y in Y]
X = X.toarray()
m, n_features = X.shape[0], X.shape[1]
self.w = numpy.zeros( n_features )
for i in range(self.n_iter):
eta = 1. / (self.lambda1*(i+1))
j = numpy.random.choice(m, 1)[0]
x, y = X[j], Yn[j]
score = self.w.dot(x)
if y*score < 1:
self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
else:
self.w = (1 - eta*self.lambda1)*self.w



### Some Notes

• The optional projection step has been left out (the line in square brackets in the paper).
•  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
•  The number η is the step length in gradient descent.
• The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
• Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
• Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

### Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
• Classification accuracy obtained on the test split was 95%.




## 2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.

### Dataset 1

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
• Classification accuracy obtained on the test split was 95.8%.




### Dataset 2

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
• Classification accuracy obtained on the test split was 100%.




## 3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.





### Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
• Classification accuracy obtained on the test split was 97.5%.




## 4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.

The next figures show the distribution of the word lengths for the texts for different sentiments.

## Steps

• Create sklearn Pipeline to
• Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
• Use sklearn SelectKBest to choose top 5000 features
• Train / test the model
• Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
• Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
• Also Record the time taken to train the classifier.
• Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.

The following python code shows how the text-classification pipeline is implemented with SVMPegasos.


def classify_with_svm():
# split into training and test parts
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
random_state=0)
classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
tokenizer = lambda x: x)),
('fs', SelectKBest(k=5000)),
('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
t0 = time.time()
classifier.fit(Xtrain, Ytrain)
t1 = time.time()
print('Training time:', t1-t0, 'seconds.')
Yguess = classifier.predict(Xtest)
print('accuracy:', accuracy_score(Ytest, Yguess))



### Some Notes

• The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
• This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
• The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
• For Perceptron the entire training data was passed 10 times to the model while training.
• For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.

## Results

As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.

# Implementing a Soft-Margin Kernelized Support Vector Machine Binary Classifier with Quadratic Programming in R and Python

In this article, couple of implementations of the support vector machine binary classifier with quadratic programming libraries (in R and python respectively) and application on a few datasets are going to be discussed.  The following video lectures / tutorials / links have been very useful for the implementation:

The next figure taken from here describes the basics of Soft-Margin SVM (without kernels).

## SVM in a nutshell

• Given a (training) dataset consisting of positive and negative class instances.
• Objective is to find a maximum-margin classifier, in terms of a hyper-plane (the vectors w and b) that separates the positive and negative instances (in the training dataset).
• If the dataset is noisy (with some overlap in positive and negative samples), there will be some error in classifying them with the hyper-plane.
• In the latter case the objective will be to minimize the errors in classification along with maximizing the margin and the problem becomes a soft-margin SVM (as opposed to the hard margin SVM without slack variables).
• A slack variable per training point is introduced to include the classification errors (for the miss-classified points in the training dataset) in the objective, this can also be thought of adding regularization.
• The optimization problem is quadratic in nature, since it has quadratic objective with linear constraints.
• It is easier to solve the optimization problem in the dual rather than the primal space, since there are less number of variables.
• Hence the optimization problem is often solved in the dual space by converting the minimization to a maximization problem (keeping in mind the weak/strong duality theorem and the complementary slackness conditions), by first constructing the Lagrangian and then using the KKT conditions for a saddle point.
• If the dataset is not linearly separable, the kernel trick is used to conceptually map the datapoints to some higher-dimensions only by computing the (kernel) gram matrix /  dot-product of the datapoints (the matrix needs to positive semi-definite as per Mercer’s theorem).
• Some popular kernel functions are the linear, polynomial, Gaussian (RBF, corresponding to the infinite dimensional space) kernels.
• The dual optimization problem is solved (with standard quadratic programming packages) and the solution is found in terms of a few support vectors (defining the linear/non-liear decision boundary, SVs correspond to the non-zero values of the dual variable / the primal Lagrange multipler), that’s why the name SVM.
• Once the dual optimization problem is solved ,  the values of the primal variables are computed to construct the hyper-plane / decision surface.
• Finally the dual and primal variables (optimum values obtained from the solutions) are used in conjunction to predict the class of a new (unseen) datapoint.
• The hyper-parameters (e.g., C) can be tuned to fit different models and choose the most accurate one from a held-out (validation) dataset.

The following figure describes the soft-margin SVM in a more formal way.

The following figures show how the SVM dual quadratic programming problem can be formulated using the R quadprog QP solver (following the QP formulation in the R package quadprog).

The following figures show how the SVM dual quadratic programming problem can be formulated using the Python CVXOPT QP solver (following the QP formulation in the python library CVXOPT).

The following R code snippet shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

library(quadprog)
library(Matrix)
linear.kernel <- function(x1, x2) {
return (x1%*%x2)
}
svm.fit <- function(X, y, FUN=linear.kernel, C=NULL) {
n.samples <- nrow(X)
n.features <- ncol(X)
# Gram matrix
K <- matrix(rep(0, n.samples*n.samples), nrow=n.samples)
for (i in 1:n.samples){
for (j in 1:n.samples){
K[i,j] <- FUN(X[i,], X[j,])
}
}
Dmat <- outer(y,y) * K
Dmat <- as.matrix(nearPD(Dmat)$mat) # convert Dmat to nearest pd matrix dvec <- rep(1, n.samples) if (!is.null(C)) { # soft-margin Amat <- rbind(y, diag(n.samples), -1*diag(n.samples)) bvec <- c(0, rep(0, n.samples), rep(-C, n.samples)) } else { # hard-margin Amat <- rbind(y, diag(n.samples)) bvec <- c(0, rep(0, n.samples)) } res <- solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1) a = res$solution # Lagrange multipliers
# Support vectors have non-zero Lagrange multipliers
# ...
}


The following python code snippet adapted from here and from  Mathieu Blondel’s Blog, shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

import numpy as np
import cvxopt
def fit(X, y, kernel, C):
n_samples, n_features = X.shape
# Compute the Gram matrix
K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
for j in range(n_samples):
K[i,j] = kernel(X[i], X[j])
# construct P, q, A, b, G, h matrices for CVXOPT
P = cvxopt.matrix(np.outer(y,y) * K)
q = cvxopt.matrix(np.ones(n_samples) * -1)
A = cvxopt.matrix(y, (1,n_samples))
b = cvxopt.matrix(0.0)
if C is None:      # hard-margin SVM
G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
h = cvxopt.matrix(np.zeros(n_samples))
else:              # soft-margin SVM
G = cvxopt.matrix(np.vstack((np.diag(np.ones(n_samples) * -1), np.identity(n_samples))))
h = cvxopt.matrix(np.hstack((np.zeros(n_samples), np.ones(n_samples) * C)))
# solve QP problem
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
# Lagrange multipliers
a = np.ravel(solution['x'])
# Support vectors have non zero lagrange multipliers
sv = a > 1e-5 # some small threshold
# ...


## Notes

1. Since the objective function for QP is convex if and only if the matrix P (in python CVXOPT) or Dmat (in R quadprog) is positive-semidefinite, it needs to be ensured that the corresponding matrix for SVM is psd too.
2. The corresponding matrix is computed from the Kernel gram matrix (which is psd or non-negative-definite by Mercer’s theorem) and the labels from the data. Due to numerical errors, often a few eigenvalues of the matrix tend to be very small negative values.
3. Although python CVXOPT will allow very small numerical errors in P matrix with a warning message, R quardprog will strictly require that the Dmat matrix is strictly positive definite, otherwise it will fail.
4. Hence, with R quadprog the D matrix first needs to be converted to a positive definite matrix using some algorithm (particularly in case when it contains very small negative eigenvalues, which is quite common, since D comes from the data).
5. A small threshold (e.g., 1e-5) is chosen to find the support vectors (corresponding to non-zero Lagrange multipliers, by complementary slackness condition).
6. Cases corresponding to the hard and soft-margin SVMs must be handled separately, otherwise it will lead to inconsistent system of solutions.

## Using the SVM implementations for classification on some datasets

### The datasets

For each dataset, the 80-20 Validation on the dataset is used to

• First fit (train) the model on randomly selected 80% samples of the dataset.
• Predict (test) on the held-out (remaining 20%) of the dataset and compute accuracy.
• Different values of the hyper-parameter C and different kernels are used.
• For the polynomial kernel, polynomial of degree 3 is used and the RBF kernel with the standard deviation of 5 is used, although these hyper-parameters can be tuned too.

### Results

As can be seen from the results below,

• The points with blue circles are the support vectors.
• When the C value is low (close to hard-margin SVM), the model learnt tends to overfit the training data.
• When the C value is high (close to soft-margin SVM), the model learnt tends to be more generalizable (C acts as a regularizer).
• There are more support vectors required to define the decision surface for the hard-margin SVM than the soft-margin SVM for datasets not linearly separable.
• The linear (and sometimes polynomial) kernel performs pretty badly on the datasets that are not linearly separable.
• The decision boundaries are also shown.