Some Applications of Markov Chain in Python

In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.

1. Markov Model of Natural Language

This problem appeared as an assignment in the Princeton course cos126 . This assignment was developed by Prof. Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.

Problem Statement

Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these frequencies as probabilities. For example, if the input text is "gagggagaggcgagaaa", the Markov model of order 0 predicts that each letter is 'a' with probability 7/17, 'c' with probability 1/17, and 'g' with probability 9/17 because these are the fraction of times each letter occurs. The following sequence of letters is a typical example generated from this model:

g a g g c g a g a a g a g a a g a a a g a g a g a g a a a g a g a a g ...

A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with statistical properties of English text because there a high correlation among successive letters in a word or sentence. For example, 'w' is more likely to be followed with 'e' than with 'u', while 'q' is more likely to be followed with 'u' than with 'e'.

We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters. Let a k-gram mean any k consecutive letters. Then for example, if the text has 100 occurrences of "th", with 60 occurrences of "the", 25 occurrences of "thi", 10 occurrences of "tha", and 5 occurrences of "tho", the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20.

A brute-force solution. Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of order 1:

“ To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. ”

Our task is to write a python program to automate this laborious task in a more efficient way — Shannon’s brute-force approach is prohibitively slow when the size of the input text is large.

Markov model data type.
 Create an immutable data type MarkovModel to represent a Markov model of order k from a given text string. The data type must implement the following API:


  • Constructor. To implement the data type, create a symbol table, whose keys will be Stringk-grams. You may assume that the input text is a sequence of characters over the ASCII alphabet so that all char values are between 0 and 127. The value type of your symbol table needs to be a data structure that can represent the frequency of each possible next character. The frequencies should be tallied as if the text were circular (i.e., as if it repeated the first k characters at the end).
  • Order. Return the order k of the Markov Model.
  • Frequency. There are two frequency methods.
    • freq(kgram) returns the number of times the k-gram was found in the original text.
    • freq(kgram, c) returns the number of times the k-gram was followed by the character c in the original text.
  • Randomly generate a character. Return a character. It must be a character that followed the k-gram in the original text. The character should be chosen randomly, but the results of calling rand(kgram) several times should mirror the frequencies of characters that followed the k-gram in the original text.
  • Generate pseudo-random text. Return a String of length T that is a randomly generated stream of characters whose first k characters are the argument kgram. Starting with the argument kgram, repeatedly call rand() to generate the next character. Successive k-grams should be formed by using the most recent k characters in the newly generated text. Use a StringBuilder object to build the stream of characters (otherwise, as we saw when discussing performance, your code will take order of N2 time to generate N characters, which is too slow).

To avoid dead ends, treat the input text as a circular string: the last character is considered to precede the first character.

For example, if k = 2 and the text is the 17-character string "gagggagaggcgagaaa", then the salient features of the Markov model are captured in the table below:

               frequency of   probability that
                next char       next char is 
kgram   freq    a   c   g        a    c    g
 aa      2      1   0   1       1/2   0   1/2  
 ag      5      3   0   2       3/5   0   2/5  
 cg      1      1   0   0        1    0    0
 ga      5      1   0   4       1/5   0   4/5  
 gc      1      0   0   1        0    0    1  
 gg      3      1   1   1       1/3  1/3  1/3  
        17      7   1   9

Taken from an example from the same assignment.

Note that the frequency of "ag" is 5 (and not 4) because we are treating the string as circular.

Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For example, if the current state is "ga" in the Markov model of order 2 discussed above, then the next character is 'a' with probability 1/5 and 'g' with probability 4/5. The next state in the Markov chain is obtained by appending the new character to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such states. Below is a possible trajectory consisting of 9 transitions.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
probability for a:       1/5      3/5      1/3       0        1       1/5      3/51/5      1/2
probability for c:        0        0       1/3       0        0        0        0        0        0
probability for g:       4/5      2/5      1/3       1        0       4/5      2/5      4/5      1/2

Taken from an example from the same assignment.

Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters.

To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then, simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
output:              ga        g        g        c        g        a        g        a        a        g

Text generation client. Implement a client program TextGenerator that takes two command-line integers k and T, reads the input text from standard input and builds a Markov model of order k from the input text; then, starting with the k-gram consisting of the first k letters of the input text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain. We may assume that the text has length at least k, and also that T ≥ k.

A Python Implementation

import numpy as np
from collections import defaultdict

class MarkovModel:

    def __init__(self, text, k): 
        create a Markov model of order k from given text
        Assume that text has length at least k.
        self.k = k               
        self.tran = defaultdict(float)
        self.alph = list(set(list(text)))
        self.kgrams = defaultdict(int)
        n = len(text)
        text += text[:k]
        for i in range(n):
            self.tran[text[i:i+k],text[i+k]] += 1.
            self.kgrams[text[i:i+k]] += 1

    def order(self):                   # order k of Markov model
        return self.k

    def freq(self, kgram):             # number of occurrences of kgram in text
        assert len(kgram) == self.k    # (check if kgram is of length k)
        return self.kgrams[kgram]

    def freq2(self, kgram, c):   	# number of times that character c follows kgram
        assert len(kgram) == self.k     # (check if kgram is of length k)  
        return self.tran[kgram,c]  

    def rand(self, kgram):             # random character following given kgram
        assert len(kgram) == self.k    # (check if kgram is of length k.
        Z = sum([self.tran[kgram, alph] for alph in self.alph])
        return np.random.choice(self.alph, 1, p=np.array([self.tran[kgram, alph] for alph in self.alph])/Z)

    def gen(self, kgram, T):           # generate a String of length T characters
        assert len(kgram) == self.k    # by simulating a trajectory through the corresponding
        str = ''                       # Markov chain. The first k characters of the newly
        for _ in range(T):             # generated String should be the argument kgram.
             #print kgram, c           # check if kgram is of length k.
             c =  self.rand(kgram)[0]  # Assume that T is at least k.
             kgram = kgram[1:] + c     
             str += c			 
        return str

Some Results

m = MarkovModel('gagggagaggcgagaaa', 2)

generates the following MarkovChain where each state represents a 2-gram.



Input: news item (taken from the assignment)

Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.

“We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Counsel for Microsoft Corporation. “We are disappointed with this decision, but we will immediately comply with the Court’s order.”

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft’s goal.

“We will continue to listen to our customers and provide them the tools they need to write great software using the Java language,” added Tod Nielsen, General Manager for Microsoft’s Developer Relations Group/Platform Marketing.

Markov Model learnt


Generated output: random news item, using input as an order 7 model


Microsoft is no longer able to use the Java language,” added Tod Nielsen, General Counsel for Microsoft’s Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developer Relations Group/Platform Marketing.Microsoft to be Microsoft Corporation. “We are disappointed with the Court’s order.”

Microsoft is no longer able to use the Java language. Providing the best tools and provide them the tools and programming options will continue to listen to our customers and provide them the tools and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the Java programming option of taking advantage of Windows features when writing software Developer Relations Group/Platform Marketing.Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Manager for Microsoft’s goal.

“We will continue to listen to our customers and programming language. Providing the Java language,” added Tod Nielsen, General Manager for Microsoft is no longer able to use the Java language.

Noisy Text Correction

Imagine we receive a message where some of the characters have been corrupted by noise. We represent unknown characters by the ~ symbol (we assume we don’t use ~ in our messages). Add a method replaceUnknown that decodes a noisy message by replacing each ~ with the most likely character given our order k Markov model, and conditional on the surrounding text:

def replaceUnknown(corrupted)  # replace unknown characters with most probable characters

Assume unknown characters are at least k characters apart and also appear at least k characters away from the start and end of the message.  This maximum-likelihood approach doesn’t always get it perfect, but it fixes most of the missing characters correctly.

Here are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should consider all possible replacement characters. We want the replacement character that makes sense not only at the unknown position (given the previous characters) but also when the replacement is used in the context of the k subsequent known characters. For example we expect the unknown character in "was ~he wo" to be 't' and not simply the most likely character in the context of "was ". You can compute the probability of each hypothesis by multiplying the probabilities of generating each of k+1 characters in sequence: the missing one, and the k next ones. The following figure illustrates how we want to consider k+1 windows to maximize the log-likelihood:


Using the algorithm described above, here are the results obtained for the following example:

Original : it was the best of times, it was the worst of times.
Noisy :      it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.
Corrected (k=4): it was the best of times, it was the worst of times.
Corrected (k=2): it was the best of times, in was the wo st of times.


2. Detecting authorship

This problem appeared as an assignment in the Cornell course cs1114 .

The Problem Statement

In this assignment, we shall be implementing an authorship detector which, when
given a large sample size of text to train on, can then guess the author of an unknown

  • The algorithm to be implemented works based on the following idea: An author’s
    writing style can be defined quantitatively by looking at the words he uses. Specifically, we want to keep track of his word flow – that is, which words he tends to use after other words.
  • To make things significantly simpler, we’re going to assume that the author always
    follows a given word with the same distribution of words. Of course, this isn’t true,
    since the words you choose when writing obviously depend on context. Nevertheless, this simplifying assumption should hold over an extended amount of text, where context becomes less relevant.
  • In order to implement this model of an author’s writing style, we will use a Markov
    chain. A Markov chain is a set of states with the Markov property – that is, the
    probabilities of each state are independent from the probabilities of every other state. This behavior correctly models our assumption of word independence.
  • A Markov chain can be represented as a directed graph. Each node is a state (words,
    in our case), and a directed edge going from state Si to Sj represents the probability we will go to Sj when we’re at Si. We will implement this directed graph as a transition matrix. Given a set of words W1, W2, …Wn, we can construct an n by n transition matrix A, where an edge from Wi to Wj of weight p means Aij = p.
  • The edges, in this case, represent the probability that word j follows word i from the
    given author. This means, of course, that the sum of the weights of all edges leaving
    from each word must add up to 1.
  • We can construct this graph from a large sample text corpus. Our next step would be finding the author of an unknown, short chunk of text. To do this, we simply compute the probability of this unknown text occurring, using the words in that order, in each of our Markov chains. The author would likely be the one with the highest probability.
  • We shall implement the Markov chain model of writing style. We are given some sample texts to train our model on, as well as some challenges for you to
    figure out.


Constructing the transition matrix

Our first step is to construct the transition matrix representing our Markov chain.
First, we must read the text from a sample file. We shall want to create a sparse array using the scipy csr sparse matrix. Along with the transition matrix, we shall be creating a corresponding vector that contains 2 word frequencies (normalized by the total number of words in the document (including repeated words)).

Calculating likelihood

Once we have our transition matrix, we can calculate the likelihood of an unknown
sample of text. We are given several pieces of literature by various authors,
as well as excerpts from each of the authors as test dataset. Our goal is to identify the authors of each excerpt.

To do so, we shall need to calculate the likelihood of the excerpt occurring in each author’s transition matrix. Recall that each edge in the directed graph that the transition
matrix represents is the probability that the author follows a word with another.
Since we shall be multiplying numerous possibly small probabilities together, our
calculated likelihood will likely be extremely small. Thus, you should compare log(likelihood) instead. Keep in mind the possibility that the author may have used a word he has never used before. Our calculated likelihood should not eliminate an author completely because of this. We shall be imposing a high penalty if a word is missing.

Finding the author with the maximum likelihood

Now that we can compute likelihoods, the next step is to write a routine that takes a set of transition matrices and dictionaries, and a sequence of text, and returns the index of the transition matrix that results in the highest likelihood. You will write this in a function classify text, which takes transition matrices, dictionaries, histograms, and the name of the file containing the test text, and returns a single integer best index.  The following figure shows how to detect an author k (A_k) of the test text t_1..t_n  using the transition matrix P_k with MLE :


Python Implementation

from np import log
def log0(x):
return 0 if x <= 0 else log(x)

def compute_text_likelihood(filename, T, dict_rev, histogram, index):

Compute the (log) likelihood L of a given string (in ‘filename’) given
a word transition probability T, dictionary ‘dict’, and histogram

text = word_tokenize(open(filename).read().replace(‘\n’, ‘ ‘).lower())
num_words = len(text)
text = [word for word in text if word in histogram] # keep only the words that are found in the training dataset
ll = log0(histogram[text[0]]) – log0(sum(histogram.values()))
for i in range(1, len(text)):
ll += log0(T[dict_rev[text[i-1]], dict_rev[text[i]]])
return ll + (num_words – num_matches)*penalty

def classify_text(tmatrices, dict_revs, histograms, filename):

Return the index of the most likely author given the transition matrices, dictionaries, and a test file

for i in range(len(tmatrices)):
ll = compute_text_likelihood(filename, tmatrices[i], dict_revs[i], histograms[i], i)

print i, ll

Training Dataset

The list of authors whose writings are there in the training dataset:

0. Austen
1. Carroll
2. Hamilton
3. Jay
4. Madison
5. Shakespeare
6. Thoreau
7. Twain


A few lines of excerpts from the training files, the word clouds and a few states from the corresponding Markov Models Constructed for a few authors

Author JANE AUSTEN (Texts taken from Emma, Mansfield Park, Pride and Prejudice)


Emma Woodhouse, handsome, clever, and rich, with a comfortable home and
happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister’s marriage,
been mistress of his house from a very early period. Her mother had
died too long ago for her to have more than an indistinct remembrance
of her caresses; and her place had been supplied by an excellent woman
as governess, who had fallen little short of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a
governess than a friend, very fond of both daughters, but particularly
of Emma. Between _them_ it was more the intimacy of sisters. Even
before Miss Taylor had ceased to hold the nominal office of governess,
the mildness of her temper had hardly allowed her to impose any
restraint; and the shadow of authority being now long passed away, they
had been living together as friend and friend very mutually attached,
and Emma doing just what she liked; highly esteeming Miss Taylor’s
judgment, but directed chiefly by her own.

The real evils, indeed, of Emma’s situation were the power of having
rather too much her own way, and a disposition to think a little too
well of herself; these were the disadvantages which threatened alloy to
her many enjoyments. The danger, however, was at present so
unperceived, that they did not by any means rank as misfortunes with




Author Lewis Carroll (Texts taken from Alice’s Adventures in Wonderland, Sylvie and Bruno)

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, ‘and what is the use of a book,’ thought Alice ‘without pictures or

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear!
Oh dear! I shall be late!’ (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural); but when the Rabbit actually TOOK A WATCH
OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,
Alice started to her feet, for it flashed across her mind that she had
never before seen a rabbit with either a waistcoat-pocket, or a watch
to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large
rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how
in the world she was to get out again.

The rabbit-hole went straight on like a tunnel for some way, and then
dipped suddenly down, so suddenly that Alice had not a moment to think
about stopping herself before she found herself falling down a very deep




Author William Shakespeare (Texts taken from Henry IV Part 1, Romeo and Juliet, Twelfth Night)


So shaken as we are, so wan with care,
Find we a time for frighted peace to pant,
And breathe short-winded accents of new broils
To be commenced in strands afar remote.
No more the thirsty entrance of this soil
Shall daub her lips with her own children’s blood;
No more shall trenching war channel her fields,
Nor bruise her flowerets with the armed hoofs
Of hostile paces: those opposed eyes,
Which, like the meteors of a troubled heaven,
All of one nature, of one substance bred,
Did lately meet in the intestine shock
And furious close of civil butchery,
Shall now, in mutual well-beseeming ranks,
March all one way, and be no more opposed
Against acquaintance, kindred, and allies:
The edge of war, like an ill-sheathed knife,
No more shall cut his master. Therefore, friends,
As far as to the sepulchre of Christ–
Whose soldier now, under whose blessed cross
We are impressed and engaged to fight–
Forthwith a power of English shall we levy,
To chase these pagans in those holy fields
Over whose acres walk’d those blessed feet
Which fourteen hundred years ago were nail’d
For our advantage on the bitter cross.
But this our purpose now is twelvemonth old,
And bootless ’tis to tell you we will go:
Therefore we meet not now.–Then let me hear
Of you, my gentle cousin Westmoreland,
What yesternight our Council did decree
In forwarding this dear expedience.

My liege, this haste was hot in question,
And many limits of the charge set down
But yesternight; when, all athwart, there came
A post from Wales loaden with heavy news;
Whose worst was, that the noble Mortimer,
Leading the men of Herefordshire to fight
Against th’ irregular and wild Glendower,
Was by the rude hands of that Welshman taken;
A thousand of his people butchered,
Upon whose dead corpse’ there was such misuse,
Such beastly, shameless transformation,
By those Welshwomen done, as may not be
Without much shame re-told or spoken of.




Author MARK TWAIN (Texts taken from A Connecticut Yankee in King Arthur’s Court, The Adventures of Huckleberry Finn, The Prince and the Pauper)

I am an American. I was born and reared in Hartford, in the State
of Connecticut–anyway, just over the river, in the country. So
I am a Yankee of the Yankees–and practical; yes, and nearly
barren of sentiment, I suppose–or poetry, in other words. My
father was a blacksmith, my uncle was a horse doctor, and I was
both, along at first. Then I went over to the great arms factory
and learned my real trade; learned all there was to it; learned
to make everything: guns, revolvers, cannon, boilers, engines, all
sorts of labor-saving machinery. Why, I could make anything
a body wanted–anything in the world, it didn’t make any difference
what; and if there wasn’t any quick new-fangled way to make a thing,
I could invent one–and do it as easy as rolling off a log. I became
head superintendent; had a couple of thousand men under me.

Well, a man like that is a man that is full of fight–that goes
without saying. With a couple of thousand rough men under one,
one has plenty of that sort of amusement. I had, anyway. At last
I met my match, and I got my dose. It was during a misunderstanding
conducted with crowbars with a fellow we used to call Hercules.
He laid me out with a crusher alongside the head that made everything
crack, and seemed to spring every joint in my skull and made it
overlap its neighbor. Then the world went out in darkness, and
I didn’t feel anything more, and didn’t know anything at all
–at least for a while.

When I came to again, I was sitting under an oak tree, on the
grass, with a whole beautiful and broad country landscape all
to myself–nearly. Not entirely; for there was a fellow on a horse,
looking down at me–a fellow fresh out of a picture-book. He was
in old-time iron armor from head to heel, with a helmet on his
head the shape of a nail-keg with slits in it; and he had a shield,
and a sword, and a prodigious spear; and his horse had armor on,
too, and a steel horn projecting from his forehead, and gorgeous
red and green silk trappings that hung down all around him like
a bedquilt, nearly to the ground.

“Fair sir, will ye just?” said this fellow.

“Will I which?”

“Will ye try a passage of arms for land or lady or for–”

“What are you giving me?” I said. “Get along back to your circus,
or I’ll report you.”

Now what does this man do but fall back a couple of hundred yards
and then come rushing at me as hard as he could tear, with his
nail-keg bent down nearly to his horse’s neck and his long spear
pointed straight ahead. I saw he meant business, so I was up
the tree when he arrived.






Classifying unknown texts from the Test Dataset

Each of the Markov models learnt from the training texts for each author are used to compute the log-likelihood of the unknown test text, the author with the maximum log-likelihood is chosen to be the likely author of the text.
Unknown Text1 

Against the interest of her own individual comfort, Mrs. Dashwood had
determined that it would be better for Marianne to be any where, at
that time, than at Barton, where every thing within her view would be
bringing back the past in the strongest and most afflicting manner, by
constantly placing Willoughby before her, such as she had always seen
him there. She recommended it to her daughters, therefore, by all
means not to shorten their visit to Mrs. Jennings; the length of which,
though never exactly fixed, had been expected by all to comprise at
least five or six weeks. A variety of occupations, of objects, and of
company, which could not be procured at Barton, would be inevitable
there, and might yet, she hoped, cheat Marianne, at times, into some
interest beyond herself, and even into some amusement, much as the
ideas of both might now be spurned by her.

From all danger of seeing Willoughby again, her mother considered her
to be at least equally safe in town as in the country, since his
acquaintance must now be dropped by all who called themselves her
friends. Design could never bring them in each other’s way: negligence
could never leave them exposed to a surprise; and chance had less in
its favour in the crowd of London than even in the retirement of
Barton, where it might force him before her while paying that visit at
Allenham on his marriage, which Mrs. Dashwood, from foreseeing at first
as a probable event, had brought herself to expect as a certain one.

She had yet another reason for wishing her children to remain where
they were; a letter from her son-in-law had told her that he and his
wife were to be in town before the middle of February, and she judged
it right that they should sometimes see their brother.

Marianne had promised to be guided by her mother’s opinion, and she
submitted to it therefore without opposition, though it proved
perfectly different from what she wished and expected, though she felt
it to be entirely wrong, formed on mistaken grounds, and that by
requiring her longer continuance in London it deprived her of the only
possible alleviation of her wretchedness, the personal sympathy of her
mother, and doomed her to such society and such scenes as must prevent
her ever knowing a moment’s rest.

But it was a matter of great consolation to her, that what brought evil
to herself would bring good to her sister; and Elinor, on the other
hand, suspecting that it would not be in her power to avoid Edward
entirely, comforted herself by thinking, that though their longer stay
would therefore militate against her own happiness, it would be better
for Marianne than an immediate return into Devonshire.

Her carefulness in guarding her sister from ever hearing Willoughby’s
name mentioned, was not thrown away. Marianne, though without knowing
it herself, reaped all its advantage; for neither Mrs. Jennings, nor
Sir John, nor even Mrs. Palmer herself, ever spoke of him before her.
Elinor wished that the same forbearance could have extended towards
herself, but that was impossible, and she was obliged to listen day
after day to the indignation of them all.

log-likelihood values computed for the probable authors

Author  LL
0           -3126.5812874
1           -4127.9155186
2           -7364.15782346
3           -9381.06336055
4           -7493.78440066
5           -4837.98005673
6           -3515.44028659
7           -3455.85716104

As can be seen from above the maximum likelihood value corresponds to the author 0, i.e., Austen.  Hence, the most probable author of the unknown text is Austen.



Unknown Text2 

Then he tossed the marble away pettishly, and stood cogitating. The
truth was, that a superstition of his had failed, here, which he and
all his comrades had always looked upon as infallible. If you buried a
marble with certain necessary incantations, and left it alone a
fortnight, and then opened the place with the incantation he had just
used, you would find that all the marbles you had ever lost had
gathered themselves together there, meantime, no matter how widely they
had been separated. But now, this thing had actually and unquestionably
failed. Tom’s whole structure of faith was shaken to its foundations.
He had many a time heard of this thing succeeding but never of its
failing before. It did not occur to him that he had tried it several
times before, himself, but could never find the hiding-places
afterward. He puzzled over the matter some time, and finally decided
that some witch had interfered and broken the charm. He thought he
would satisfy himself on that point; so he searched around till he
found a small sandy spot with a little funnel-shaped depression in it.
He laid himself down and put his mouth close to this depression and

“Doodle-bug, doodle-bug, tell me what I want to know! Doodle-bug,
doodle-bug, tell me what I want to know!”

The sand began to work, and presently a small black bug appeared for a
second and then darted under again in a fright.

“He dasn’t tell! So it WAS a witch that done it. I just knowed it.”

He well knew the futility of trying to contend against witches, so he
gave up discouraged. But it occurred to him that he might as well have
the marble he had just thrown away, and therefore he went and made a
patient search for it. But he could not find it. Now he went back to
his treasure-house and carefully placed himself just as he had been
standing when he tossed the marble away; then he took another marble
from his pocket and tossed it in the same way, saying:

“Brother, go find your brother!”

He watched where it stopped, and went there and looked. But it must
have fallen short or gone too far; so he tried twice more. The last
repetition was successful. The two marbles lay within a foot of each

log-likelihood values computed for the probable authors

Author  LL
0             -2779.02810424
1             -2738.09304225
2             -5978.83684489
3             -6551.16571407
4             -5780.39620942
5             -4166.34886511
6             -2309.25043697
7             -2033.00112729

As can be seen from above the maximum likelihood value corresponds to the author 7, i.e., Twain.  Hence, the most probable author of the unknown text is Twain. The following figure shows the relevant states corresponding to the Markov model for Twain trained from the training dataset.



Classifying a Face as Happy/Unhappy and Face Recognition using Deep Convolution Net with Keras in Python

In this article couple of problems are going to be discussed. Both the problems appeared as assignments in the Coursera course Convolution Neural Networks (a part of deeplearning specialization) by the Stanford Prof. Andrew Ng. ( The problem descriptions are taken from the course itself.

1. Classifying a Face Image as Happy/Unhappy

  • Given:
    • 600 64×64 RGB (labeled) training images with labels 0 (not happy) and 1 (happy).
    • 150 (unlabeled) test images (also the ground-truths separately).
  • Train a deep convolution neural net model for binary classification.
  • Use the model to predict the labels of the test images and evaluate the model using the ground truth.


Details of the “Happy” dataset:

Images are of shape (64,64,3)
Training: 600 pictures
Test: 150 pictures

It is now time to solve the “Happy” Challenge.

We need to start by loading the following required packages.

import numpy as np
from keras import layers
from keras.layers import Input, Dense, Activation, ZeroPadding2D, BatchNormalization, Flatten, Conv2D
from keras.layers import AveragePooling2D, MaxPooling2D, Dropout, GlobalMaxPooling2D, GlobalAveragePooling2D
from keras.models import Model
from keras.preprocessing import image
from keras.utils import layer_utils
from keras.utils.data_utils import get_file
from keras.applications.imagenet_utils import preprocess_input
from keras.utils.vis_utils import model_to_dot
from keras.utils import plot_model
import keras.backend as K
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow

Then let’s normalize and load the dataset.

X_train_orig, Y_train_orig, X_test_orig, Y_test_orig, classes = load_dataset()

​# Normalize image vectors
X_train = X_train_orig/255.
X_test = X_test_orig/255.

# Reshape
Y_train = Y_train_orig.T
Y_test = Y_test_orig.T

print (“number of training examples = ” + str(X_train.shape[0]))
print (“number of test examples = ” + str(X_test.shape[0]))
print (“X_train shape: ” + str(X_train.shape))
print (“Y_train shape: ” + str(Y_train.shape))
print (“X_test shape: ” + str(X_test.shape))
print (“Y_test shape: ” + str(Y_test.shape))

number of training examples = 600
number of test examples = 150
X_train shape: (600, 64, 64, 3)
Y_train shape: (600, 1)
X_test shape: (150, 64, 64, 3)
Y_test shape: (150, 1)

Now let’s find the number of labeled happy and unhappy faces in the training dataset.

print(X_train[Y_train.ravel()==1].shape, X_train[Y_train.ravel()==0].shape)

(300, 64, 64, 3) (300, 64, 64, 3)

As can be seen, there are equal numbers of positive and negative examples in the training dataset. The following figures show a few samples drawn from each class.



Building a model in Keras

Keras is very good for rapid prototyping. In just a short time we shall be able to build a model that achieves outstanding results.

Let’s Implement a HappyModel() with the following architecture:

def HappyModel(input_shape):
Implementation of the HappyModel.

input_shape — shape of the images of the dataset

model — a Model() instance in Keras

# Define the input placeholder as a tensor with shape input_shape. Think of
# this as our input image!
X_input = Input(input_shape)

# Zero-Padding: pads the border of X_input with zeroes
X = ZeroPadding2D((3, 3))(X_input)

# CONV -> BN -> RELU Block applied to X
X = Conv2D(32, (7, 7), strides = (1, 1), name = ‘conv0’)(X)
X = BatchNormalization(axis = 3, name = ‘bn0’)(X)
X = Activation(‘relu’)(X)

X = MaxPooling2D((2, 2), name=’max_pool’)(X)

# FLATTEN X (means convert it to a vector) + FULLYCONNECTED
X = Flatten()(X)
X = Dense(1, activation=’sigmoid’, name=’fc’)(X)

# Create model. This creates our Keras model instance, you’ll use this instance
# to train/test the model.
model = Model(inputs = X_input, outputs = X, name=’HappyModel’)

return model


Step 1: Let’s first create the model.

happyModel = HappyModel((64,64,3))

Step 2:  Compile the model to configure the learning process, keeping in view that the Happy Challenge is a binary classification problem.

happyModel.compile(optimizer = “Adam”, loss = “binary_crossentropy”, metrics = [“accuracy”])

Step 3: Train the model. Choose the number of epochs and the batch size. = X_train, y = Y_train, epochs = 20, batch_size = 32)

Epoch 1/20
600/600 [==============================] – 6s – loss: 1.0961 – acc: 0.6750
Epoch 2/20
600/600 [==============================] – 7s – loss: 0.4198 – acc: 0.8250
Epoch 3/20
600/600 [==============================] – 8s – loss: 0.1933 – acc: 0.9250
Epoch 4/20
600/600 [==============================] – 7s – loss: 0.1165 – acc: 0.9567
Epoch 5/20
600/600 [==============================] – 6s – loss: 0.1224 – acc: 0.9500
Epoch 6/20
600/600 [==============================] – 6s – loss: 0.0970 – acc: 0.9667
Epoch 7/20
600/600 [==============================] – 7s – loss: 0.0639 – acc: 0.9850
Epoch 8/20
600/600 [==============================] – 7s – loss: 0.0841 – acc: 0.9700
Epoch 9/20
600/600 [==============================] – 8s – loss: 0.0934 – acc: 0.9733
Epoch 10/20
600/600 [==============================] – 7s – loss: 0.0677 – acc: 0.9767
Epoch 11/20
600/600 [==============================] – 6s – loss: 0.0705 – acc: 0.9650
Epoch 12/20
600/600 [==============================] – 7s – loss: 0.0548 – acc: 0.9783
Epoch 13/20
600/600 [==============================] – 7s – loss: 0.0533 – acc: 0.9800
Epoch 14/20
600/600 [==============================] – 7s – loss: 0.0517 – acc: 0.9850
Epoch 15/20
600/600 [==============================] – 7s – loss: 0.0665 – acc: 0.9750
Epoch 16/20
600/600 [==============================] – 7s – loss: 0.0273 – acc: 0.9917
Epoch 17/20
600/600 [==============================] – 7s – loss: 0.0291 – acc: 0.9933
Epoch 18/20
600/600 [==============================] – 6s – loss: 0.0245 – acc: 0.9917
Epoch 19/20
600/600 [==============================] – 7s – loss: 0.0376 – acc: 0.9883
Epoch 20/20
600/600 [==============================] – 7s – loss: 0.0440 – acc: 0.9917

Note that if we run fit() again, the model will continue to train with the parameters it has already learnt instead of re-initializing them.

Step 4: Test/evaluate the model.

preds = happyModel.evaluate(x = X_test, y = Y_test)
print (“Loss = ” + str(preds[0]))
print (“Test Accuracy = ” + str(preds[1]))

150/150 [==============================] – 0s

Loss = 0.167731122573
Test Accuracy = 0.94666667064

As can be seen, our model gets around 95% test accuracy in 20 epochs (and 99% train accuracy).

Test with my own image

Let’s test on my own image to see how well the model generalizes on unseen face images.

img_path = ‘me_happy.png’
img = image.load_img(img_path, target_size=(64, 64))


x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)

[[ 1.]]             # Happy !

Model Summary




2.  Face Recognition with Deep Neural Net

Face recognition problems commonly fall into two categories:

  1. Face Verification – “is this the claimed person?”. For example, at some airports, one can pass through customs by letting a system scan your passport and then verifying that he (the person carrying the passport) is the correct person. A mobile phone that unlocks using our face is also using face verification. This is a 1:1 matching problem.
  2. Face Recognition – “who is this person?”. For example, this video of Baidu employees entering the office without needing to otherwise identify themselves is an example of face recognition. This is a 1:K matching problem.

FaceNet learns a neural network that encodes a face image into a vector of 128 numbers. By comparing two such vectors, we can then determine if two pictures are of the same person.

In this assignment, we shall:

  • Implement the triplet loss function
  • Use a pretrained model to map face images into 128-dimensional encodings
  • Use these encodings to perform face verification and face recognition

In this exercise, we will be using a pre-trained model which represents ConvNet activations using a “channels first” convention, as opposed to the “channels last” convention.

In other words, a batch of images will be of shape (m,n_C,n_H,n_W) instead of (m,n_H,n_W,n_C). Both of these conventions have a reasonable amount of traction among open-source implementations; there isn’t a uniform standard yet within the deep learning community.

Naive Face Verification

In Face Verification, we’re given two images and we have to tell if they are of the same person. The simplest way to do this is to compare the two images pixel-by-pixel. If the distance between the raw images are less than a chosen threshold, it may be the same person!


Of course, this algorithm performs really poorly, since the pixel values change dramatically due to variations in lighting, orientation of the person’s face, even minor changes in head position, and so on.

We’ll see that rather than using the raw image, we can learn an encoding f(img) so that element-wise comparisons of this encoding gives more accurate judgments as to whether two pictures are of the same person.

Encoding face images into a 128-dimensional vector

Using an ConvNet to compute encodings

The FaceNet model takes a lot of data and a long time to train. So following common practice in applied deep learning settings, let’s just load weights that someone else has already trained. The network architecture follows the Inception model from Szegedy et al.. We are going to use an inception network implementation.

This network uses 96×96 dimensional RGB images as its input. Specifically, inputs a face image (or batch of m face images) as a tensor of shape (m,nC,nH,nW)=(m,3,96,96).
It outputs a matrix of shape (m,128) that encodes each input face image into a 128-dimensional vector.

Let’s create the model for face images.

FRmodel = faceRecoModel(input_shape=(3, 96, 96))
print(“Total Params:”, FRmodel.count_params())

Total Params: 3743280

By using a 128-neuron fully connected layer as its last layer, the model ensures that the output is an encoding vector of size 128. We then use the encodings the compare two face images as follows:


By computing a distance between two encodings and thresholding, we can determine if the two pictures represent the same person.

So, an encoding is a good one if:

  • The encodings of two images of the same person are quite similar to each other
  • The encodings of two images of different persons are very different

The triplet loss function formalizes this, and tries to “push” the encodings of two images of the same person (Anchor and Positive) closer together, while “pulling” the encodings of two images of different persons (Anchor, Negative) further apart.


In the next part, we will call the pictures from left to right: Anchor (A), Positive (P), Negative (N).



Loading the trained model

FaceNet is trained by minimizing the triplet loss. But since training requires a lot of data and a lot of computation, we won’t train it from scratch here. Instead, we load a previously trained model. Let’s Load a model using the following code; this might take a couple of minutes to run.

FRmodel.compile(optimizer = ‘adam’, loss = triplet_loss, metrics = [‘accuracy’])

Here is the summary of the very deep inception network:


The next figure shows how the pre-trained deep inception network looks like:


Here’re some examples of distances between the encodings between three individuals:

Let’s now use this model to perform face verification and face recognition!

Applying the model

Face Verification

Let’s build a database containing one encoding vector for each person. To generate the encoding we use img_to_encoding(image_path, model) which basically runs the forward propagation of the model on the specified image.

Let’s build a database to map each person’s name to a 128-dimensional encoding of their face.

Now this can be used in an automated employee verification at the gate in an office in the following way: when someone shows up at the front door and swipes their ID card (thus giving us their name), we can look up their encoding in the database, and use it to check if the person standing at the front door matches the name on the ID.

Let’s implement the verify() function which checks if the front-door camera picture (image_path) is actually the person called “identity“. We shall have to go through the following steps:

  • Compute the encoding of the image from image_path
  • Compute the distance in between this encoding and the encoding of the identity image stored in the database
  • Open the door if the distance is less than the threshold  0.7, else do not open.

As presented above, we are going to use the L2 distance (np.linalg.norm).

def verify(image_path, identity, database, model):

Function that verifies if the person on the “image_path” image is “identity”.


image_path — path to an image

identity — string, name of the person you’d like to verify the identity. Has to be a resident of the Happy house.

database — python dictionary mapping names of allowed people’s names (strings) to their encodings (vectors).

model — your Inception model instance in Keras


dist — distance between the image_path and the image of “identity” in the database.
door_open — True, if the door should open. False otherwise.


### CODE HERE ###

return dist, door_open

Younes is trying to enter the  and the camera takes a picture of him (“camera_0.jpg”). Let’s run the above verification algorithm on this picture and compare with the one stored in the system (image_path):


verify(“camera_0.jpg”, “younes”, database, FRmodel)

# output
It’s younes, welcome home!
(0.67291224, True)

Benoit, has been banned from the office and removed from the database. He stole Kian’s ID card and came back to the house to try to present himself as Kian. The front-door camera took a picture of Benoit (“camera_2.jpg). Let’s run the verification algorithm to check if benoit can enter.


verify(“camera_2.jpg”, “kian”, database, FRmodel)

# output
It’s not kian, please go away
(0.86543155, False)

Face Recognition

In this case, we need to implement a face recognition system that takes as input an image, and figures out if it is one of the authorized persons (and if so, who). Unlike the previous face verification system, we will no longer get a person’s name as another input.

Implement who_is_it(). We shall have to go through the following steps:

  • Compute the target encoding of the image from image_path
  • Find the encoding from the database that has smallest distance with the target encoding.
  • Initialize the min_dist variable to a large enough number (100). It will help to keep track of what is the closest encoding to the input’s encoding.
  • Loop over the database dictionary’s names and encodings. To loop use for (name, db_enc) in database.items().
  • Compute L2 distance between the target “encoding” and the current “encoding” from the database.
  • If this distance is less than the min_dist, then set min_dist to dist, and identity to name.


def who_is_it(image_path, database, model):
Implements face recognition for the happy house by finding who is the person on the image_path image.

image_path — path to an image
database — database containing image encodings along with the name of the person on the image
model — your Inception model instance in Keras

min_dist — the minimum distance between image_path encoding and the encodings from the database
identity — string, the name prediction for the person on image_path

###  CODE HERE ###

return min_dist, identity

Younes is at the front-door and the camera takes a picture of him (“camera_0.jpg”). Let’s see if our who_it_is() algorithm identifies Younes.

who_is_it(“camera_0.jpg”, database, FRmodel)

# output
it’s younes, the distance is 0.672912
(0.67291224, ‘younes’)

We can change “camera_0.jpg” (picture of younes) to “camera_1.jpg” (picture of bertrand) and see the result.

who_is_it(“camera_1.jpg”, database, FRmodel)

# output
it’s bertrand, the distance is 0.474829
(0.47482917, ‘bertrand’)

Here is the takeaway:

  • Face verification solves an easier 1:1 matching problem; face recognition addresses a harder 1:K matching problem.
  • The triplet loss is an effective loss function for training a neural network to learn an encoding of a face image.
  • The same encoding can be used for verification and recognition. Measuring distances between two images’ encodings allows you to determine whether they are pictures of the same person.



  • Florian Schroff, Dmitry Kalenichenko, James Philbin (2015). FaceNet: A Unified Embedding for Face Recognition and Clustering
  • Yaniv Taigman, Ming Yang, Marc’Aurelio Ranzato, Lior Wolf (2014). DeepFace: Closing the gap to human-level performance in face verification
  • The pretrained model we use is inspired by Victor Sy Wang’s implementation and was loaded using his code:
  • Implementation by Ng. et al. also took a lot of inspiration from the official FaceNet github repository:


EigenFaces and A Simple Face Detector with PCA/SVD in Python

In this article, a few problems will be discussed that are related to face reconstruction and rudimentary face detection using eigenfaces (we are not going to discuss about more sophisticated face detection algorithms such as Voila-Jones or DeepFace).

1. Eigenfaces

This problem appeared as an assignment in the edX course Analytics for Computing (by Georgia Tech)The following description of the problem is taken straightaway from the assignment.

This is an application of principal components analysis (PCA) and k-means to the analysis of “familiar faces.”  We’ll also create a simple interactive visualization for exploring this dataset (using bokeh).


Solving the PCA problem

The following figure shows the basic algorithm to compute a PCA, the interactive visual demo of which appears here.


The dataset: Some familiar faces

The dataset consists of a bunch of images of people’s faces taken from MIT Faces Recognition Project database.  There are 3991 images in total. A randomly selected  subset of 500 images are loaded, all the following experiments will be done on these images. They are shown below.


Preprocessing the images

To apply PCA, we’ll want to preprocess the images in various ways.

To begin with, it’s possible that that the images come in all shapes and sizes. The following code will figure out what is the largest height and width that are within the bounds of all the images.

import sys
import numpy as np
min_rows, min_cols = sys.maxsize, sys.maxsize
max_rows, max_cols = 0, 0
for (i, image) in enumerate(original_images):
    r, c = image.shape[0], image.shape[1]    
    min_rows = min(min_rows, r)
    max_rows = max(max_rows, r)
    min_cols = min(min_cols, c)
    max_cols = max(max_cols, c)
print("\n==> Least common image size:", min_rows, "x", min_cols, "pixels")
 ==> Least common image size: 128 x 128 pixels
  • Suppose the least common image size is r0×c0 pixels is the smallest dimension. Let’s crop each r×c image so that it is r0×c0 in size. If r>r0, then crop out any extra rows on the bottom of the image; and if c>c0, then center the columns of the image. Let’s store the output images in a 3-DNumpy array called images[:, :, :], where images[k, :, :] is the k-th image, the following code exactly does that.
defrecenter(image, min_rows, min_cols):
    r, c = image.shape
    top, bot, left, right = 0, r, 0, c
    if r > min_rows:
        top = r - min_rows  
    if c > min_cols:
        right = min_cols     
    return image[top:bot, left:right]

image0_recentered = recenter(image0, min_rows, min_cols)
fig, axs = plt.subplots(1, 2, figsize=(10, 5))
imshow_gray(image0, ax=axs[0])
imshow_gray(image0_recentered, ax=axs[1])

Let’s compute an “average” image, taken as the elementwise (pixelwise) mean over the images and display this “average face”.


Recall that PCA requires centered points. Let’s do that by subtracting the mean image from every image. The next figure shows couple of images and the ones obtained after mean subtraction.


From image set to a data matrix and back again

For PCA, we need a data matrix. Here is some code to convert our 3-D array of images into a 2-D data matrix, where we “flatten” each image into a 1-D vector by a simple reshape() operation.

# Create m x d data matrix
m = len(images)
d = min_rows * min_cols
X = np.reshape(images, (m, d))
# To get back to an image, just reshape it again
imshow_gray(np.reshape(X[int(len(X)/2), :], (min_rows, min_cols)))


Applying PCA

Let’s Compute the SVD of X. Store the result in three arrays, USigma, and VT, where U holds USigma holds just the diagonal entries of Σ, and VT holds V’.

U, Sigma, VT = np.linalg.svd(X, full_matrices=False)
# Sanity check on dimensions
print("X:", X.shape)
print("U:", U.shape)
print("Sigma:", Sigma.shape)
print("V^T:", VT.shape)
X: (500, 16384) 
U: (500, 500)
Sigma: (500,)
V^T: (500, 16384)

The following table and the plot inspect the singular values, i.e., the entries of Σ stored in Sigma. The plot will show the singular values as dots, plotted at each position x=for the i-th singular values. To give a rough idea of how quickly the singular values decay, the plot includes a solid line showing the curve, σ0/√(i+1).


Does the spectrum of these data decay quickly or slowly? How should that affect our choice of k, if we consider a k-truncated SVD?

The question is ill-defined and the answer is relative. In this case, a reasonable argument is that the spectrum actually decays somewhat slowly. Why? If we try fitting the singular values {σ_i} to functions of i, we’ll find that 1/√(i+1) is actually a pretty good fit. That is considered fairly slow decay; there would be significant compressibility if the curve dropped off exponentially (or at least superlinearly) instead.

Next, let’s plot the first few principal components. From what we computed above, each right singular vector has the same number of entries as there are pixels in an image. So, we could plot them as images by reshaping them. What do they appear to capture?


For example, the first (also third) principal component captures a male face, whereas the second (also fourth) one seems to capture a female face, the fifth one captures a face with long hairs.

Now let’s compute a new matrix Y, which is the original data matrix projected onto the first num_components principal components.
num_components = 5 # Number of principal components
Y = np.matmul(X, VT[:num_components,:].T)

The next plot visualizes the face datapoints projected on the first two principal components with bokeh with the thumbnails as the face images.


Next, let’s run  Run k-means on the projected data, Y[:m, :num_components], to try to identify up to num_clusters clusters.

The next animation shows the interactive plot of the face data points projected on the first two principal components with bokeh with the thumbnails as the face images.

For the 500 randomly selected faces, the green cluster seems to contain mostly the whiter / brighter faces whereas the red cluster mostly contains the darker faces, whereas the blue cluster seems to contain younger populations mostly.


2. Face Reconstruction and A  Simple Face Detector

This problem appeared as projects in this CMU machine learning for signal processing course and also in this UW Computer Vision course. The description of the problem is taken straightaway from the course projects.

We are given a corpus of facial images [here] from the LFWCrop database. Each image in this corpus is 64 x 64 and grayscale. We need to learn a typical (i.e. Eigen) face from them. There are 1071 (.pgm) images in this dataset. A few of them are shown below.


Computing the EigenFaces

As described in the original paper from MIT and this case study from JHU, the following figure shows the theory / algorithm for how to efficiently compute the eigenfaces from the (training) images and then choose top k eigenfaces (eigenvectors).


The next 4 figures show the meanface, the percentage variance captured by the eigenfaces and the top 100 and then top 400 eigenfaces computed computed from the images, respectively.


Projection on the EigenFaces and Reconstruction

Now, let’s project a face onto the face space to generate a vector of k coefficients, one for each of the k eigenfaces (for different values of k). Then let’s reconstruct the same face from the vector of coefficients computed.

The following figures and animations show how a given image can be reconstructed by projecting on the first few dominant eigenfaces and also how the reconstruction error  (residual) decreases as more and more eigenfaces are used for projection. Also, the reconstruction error goes down quite fast, with the selection of first few eigenfaces.

  Original Face Image
  Reconstructed Face Image by projecting on the truncated eigenfaces space

The Reconstruction Error when first k eigenfaces were used


The same is shown for yet another image.


The next figure shows the images projected on and reconstructed from the first two eigenfaces. The similar images stay in close proximity whereas the dissimilar ones (w.r.t. projection on the first 2 dominant eigenvectors) are far apart in the 2D space.

2deig.pngProjection of a Human vs. a Non-Human-Face (e.g. Cat) on the EigenFaces Space

Now let’s reconstruct a non-human object (a cat) to the same space spanned by the top-k eigenfaces for different values of k.  As we can see from the below figures, the reconstruction error is much higher for the cat, as expected.




Face Detection

The following figure taken from the same JHU vision lab case study shows the threshold-based face detection algorithm with eigenfaces.


Now, we are also given four group photographs with multiple faces [here]. We need to use the Eigen face to detect the faces in these photos. One such group photo is shown below: the group photo of the rockstars from The Beatles.


The faces in the group photographs may have different sizes. We also need to account for these variations.

To detect faces in the image, we need to scan the group photo and identify all regions in it that “match” the patterns in Eigen facemost. To “Scan” the image to find matches against an N×MN×M Eigen face, you must match every N×MN×M region of the photo against the Eigen face.

The “match” between any N×M region of an image and an Eigen face is given by the normalized dot product between the Eigen face and the region of the image being evaluated. The normalized dot product between an N×M Eigen face and a corresponding N×M segment of the image is given by EP/|P|, where E is the vector (unrolled) representation of the Eigen face, and P is the unrolled vector form of the N×M patch.

Scaling and Rotation

The Eigen face is fixed in size and can only be used to detect faces of approximately the same size as the Eigen face itself. On the other hand faces in the group photos are of different sizes — they get smaller as the subject gets farther away from the camera.

The solution to this is to make many copies of the eigen face and match them all.

In order to make your detection system robust, resize the Eigen faces from 64 pixels to 32×32, 48×48, 96×96, and 128×128 pixels in size.  Once we’ve scaled your eigen face, we will have a total of five “typical” faces, one at each level of scaling. We need to scan the group pictures with all of the five eigen faces. Each of them will give us a “match” score for each position on the image. If we simply locate the peaks in each of them, we may find all the faces. Sometimes multiple peaks will occur at the same position, or within a few pixels of one another. In these cases, we need to merge all of these (non-maximum suppression), they probably all represent the same face.

Additional heuristics may also be required (appropriate setting of thresholds, comparison of peak values from different scaling factors, addiitonal scaling, filtering by human body color thresholds etc.). These are for us to investigate.

The  faces detected using the window scanning and the threshold on the distance from the eigenfaces plane are shown below for the beatles group image.


Face Morphing

Implement morphing. Given two images, compute an animation that shows one being transformed continuously into the other, using the eigenfaces representation. Also, try transforming an image “away” from another one, to exaggerate differences between two faces. Make a video from your morphed faces.

Let’s try to morph the first face to the second one.


Here is the algorithm that we are going to use:

  1. For the first face first create a reconstruction using only a few (k) dominant eigenfaces.
  2. Iteratively reconstruct the first face using lesser and lesser eigenfaces and animate.
  3. Stop when we reached the reconstruction of the first face with only k eigenfaces.
  4. Now compute the coefficients of the second face using only those k eigenfaces,
  5. Next start using second face’s coefficients instead of the first face’s coefficients.
  6. Now iteratively increase the number of eigenfaces to be used to reconstruct the second face, till the point when all the eigenfaces are used.

The following animation shows such morphing:


Deep Learning & Art: Neural Style Transfer – An Implementation with Tensorflow in Python


This problem appeared as an assignment in the online coursera course Convolution Neural Networks by Prof Andrew Ng, (  The description of the problem is taken straightway from the assignment.

Neural Style Transfer algorithm was created by Gatys et al. (2015) , the paper can be found here .

In this assignment, we shall:

  • Implement the neural style transfer algorithm
  • Generate novel artistic images using our algorithm

Most of the algorithms we’ve studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, we  shall optimize a cost function to get pixel values!

Problem Statement

Neural Style Transfer (NST) is one of the most fun techniques in deep learning. As seen below, it merges two images, namely,

  1. a “content” image (C) and
  2. a “style” image (S),

to create a “generated” image (G). The generated image G combines the “content” of the image C with the “style” of image S.

In this example, we are going to generate an image of the Louvre museum in Paris (content image C), mixed with a painting by Claude Monet, a leader of the impressionist movement (style image S).


Let’s see how we can do this.

Transfer Learning

Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that. The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

Following the original NST paper, we shall use the VGG network. Specifically, we’ll use VGG-19, a 19-layer version of the VGG network. This model has already been trained on the very large ImageNet database, and thus has learned to recognize a variety of low level features (at the earlier layers) and high level features (at the deeper layers). The following figure (taken from the google image search results) shows how a VGG-19 convolution neural net looks like, without the last fully-connected (FC) layers.


We run the following code to load parameters from the pre-trained VGG-19 model serialized in a matlab file. This takes a few seconds.

model = load_vgg_model(“imagenet-vgg-verydeep-19.mat”)
import pprint

{‘avgpool1’: <tf.Tensor ‘AvgPool_5:0’ shape=(1, 150, 200, 64) dtype=float32>,
‘avgpool2’: <tf.Tensor ‘AvgPool_6:0’ shape=(1, 75, 100, 128) dtype=float32>,
‘avgpool3’: <tf.Tensor ‘AvgPool_7:0’ shape=(1, 38, 50, 256) dtype=float32>,
‘avgpool4’: <tf.Tensor ‘AvgPool_8:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘avgpool5’: <tf.Tensor ‘AvgPool_9:0’ shape=(1, 10, 13, 512) dtype=float32>,
‘conv1_1’: <tf.Tensor ‘Relu_16:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv1_2’: <tf.Tensor ‘Relu_17:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv2_1’: <tf.Tensor ‘Relu_18:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv2_2’: <tf.Tensor ‘Relu_19:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv3_1’: <tf.Tensor ‘Relu_20:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_2’: <tf.Tensor ‘Relu_21:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_3’: <tf.Tensor ‘Relu_22:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_4’: <tf.Tensor ‘Relu_23:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv4_1’: <tf.Tensor ‘Relu_24:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_2’: <tf.Tensor ‘Relu_25:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_3’: <tf.Tensor ‘Relu_26:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_4’: <tf.Tensor ‘Relu_27:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv5_1’: <tf.Tensor ‘Relu_28:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_2’: <tf.Tensor ‘Relu_29:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_3’: <tf.Tensor ‘Relu_30:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_4’: <tf.Tensor ‘Relu_31:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘input’: <tensorflow.python.ops.variables.Variable object at 0x7f7a5bf8f7f0>}

The next figure shows the content image (C) – the Louvre museum’s pyramid surrounded by old Paris buildings, against a sunny sky with a few clouds.


For the above content image, the activation outputs from the convolution layers are visualized in the next few figures.









How to ensure that the generated image G matches the content of the image C?

As we know, the earlier (shallower) layers of a ConvNet tend to detect lower-level features such as edges and simple textures, and the later (deeper) layers tend to detect higher-level features such as more complex textures as well as object classes.

We would like the “generated” image G to have similar content as the input image C. Suppose we have chosen some layer’s activations to represent the content of an image. In practice, we shall get the most visually pleasing results if we choose a layer in the middle of the network – neither too shallow nor too deep.


First we need to compute the “content cost” using TensorFlow.

  • The content cost takes a hidden layer activation of the neural network, and measures how different a(C) and a(G) are.
  • When we minimize the content cost later, this will help make sure G
    has similar content as C.

def compute_content_cost(a_C, a_G):
Computes the content cost

a_C — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G

J_content — scalar that we need to compute using equation 1 above.

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape a_C and a_G
a_C_unrolled = tf.reshape(tf.transpose(a_C), (m, n_H * n_W, n_C))
a_G_unrolled = tf.reshape(tf.transpose(a_G), (m, n_H * n_W, n_C))

# compute the cost with tensorflow
J_content = tf.reduce_sum((a_C_unrolled – a_G_unrolled)**2 / (4.* n_H * n_W *  \

return J_content


Computing the style cost

For our running example, we will use the following style image (S). This painting was painted in the style of impressionism, by  Claude Monet .



def gram_matrix(A):
A — matrix of shape (n_C, n_H*n_W)

GA — Gram matrix of A, of shape (n_C, n_C)

GA = tf.matmul(A, tf.transpose(A))
return GA


def compute_layer_style_cost(a_S, a_G):
a_S — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G

J_style_layer — tensor representing a scalar value, style cost defined above by equation (2)

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape the images to have them of shape (n_C, n_H*n_W)
a_S = tf.reshape(tf.transpose(a_S), (n_C, n_H * n_W))
a_G = tf.reshape(tf.transpose(a_G), (n_C, n_H * n_W))

# Computing gram_matrices for both images S and G (≈2 lines)
GS = gram_matrix(a_S)
GG = gram_matrix(a_G)

# Computing the loss
J_style_layer = tf.reduce_sum((GS – GG)**2 / (4.* (n_H * n_W * n_C)**2))

return J_style_layer


  • The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
  • Minimizing the style cost will cause the image G to follow the style of the image S.


Defining the total cost to optimize

Finally, let’s create and implement a cost function that minimizes both the style and the content cost. The formula is:



def total_cost(J_content, J_style, alpha = 10, beta = 40):
Computes the total cost function

J_content — content cost coded above
J_style — style cost coded above
alpha — hyperparameter weighting the importance of the content cost
beta — hyperparameter weighting the importance of the style cost

J — total cost as defined by the formula above.

J = alpha * J_content + beta * J_style
return J

  • The total cost is a linear combination of the content cost J_content(C,G) and the style cost J_style(S,G).
  • α and β are hyperparameters that control the relative weighting between content and style.


Solving the optimization problem

Finally, let’s put everything together to implement Neural Style Transfer!

Here’s what the program will have to do:

  • Create an Interactive Session
  • Load the content image
  • Load the style image
  • Randomly initialize the image to be generated
  • Load the VGG19 model
  • Build the TensorFlow graph:
    • Run the content image through the VGG19 model and compute the content cost.
    • Run the style image through the VGG19 model and compute the style cost
      Compute the total cost.
    • Define the optimizer and the learning rate.
  • Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.

Let’s first load, reshape, and normalize our “content” image (the Louvre museum picture) and “style” image (Claude Monet’s painting).

Now, we initialize the “generated” image as a noisy image created from the content_image. By initializing the pixels of the generated image to be mostly noise but still slightly correlated with the content image, this will help the content of the “generated” image more rapidly match the content of the “content” image. The following figure shows the noisy image:


Next, let’s load the pre-trained VGG-19 model.

To get the program to compute the content cost, we will now assign a_C and a_G to be the appropriate hidden layer activations. We will use layer conv4_2 to compute the content cost. We need to do the following:

  • Assign the content image to be the input to the VGG model.
  • Set a_C to be the tensor giving the hidden layer activation for layer “conv4_2”.
  • Set a_G to be the tensor giving the hidden layer activation for the same layer.
  • Compute the content cost using a_C and a_G.

Next, we need to compute the style cost and compute the total cost J by taking a linear combination of the two. Use alpha = 10 and beta = 40.

Then we are going to  set up the Adam optimizer in TensorFlow, using a learning rate of 2.0.

Finally, we need to initialize the variables of the tensorflow graph, assign the input image (initial generated image) as the input of the VGG19 model and runs the model to minimize the total cost J for a large number of iterations.


The following figures / animations show the generated images (G) with different content (C) and style images (S) at different iterations in the optimization process.



Style (Claud Monet’s The Poppy Field near Argenteuil)


















Style (Van Gogh’s The Starry Night)










Content (Victoria Memorial Hall)


Style (Van Gogh’s The Starry Night)




Content (Taj Mahal)


Style (Van Gogh’s Starry Night Over the Rhone)






Style (Claud Monet’s Sunset in Venice)




Content (Visva Bharati)biswa.jpg

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )




Content (Howrah Bridge)


Style (Van Gogh’s The Starry Night)





Content (Leonardo Da Vinci’s Mona Lisa)


Style (Van Gogh’s The Starry Night)


Content (My sketch: Rabindranath Tagore)


Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )


Content (me)


Style (Van Gogh’s Irises)













Style (Publo Picaso’s Factory at Horto de Ebro)




The following animations show how the generated image changes with the change in VGG-19 convolution layer used for computing content cost.



Style (Van Gogh’s The Starry Night)



convolution layer 3_2 used


convolution layer 4_2 used


convolution layer 5_2 used



Some Optimization: Implementing the Orthogonal Matching Pursuit (OMP) and the Basis Pursuit (BP) Algorithms with Octave / Matlab


The following problems appeared in a project in the edX course 236862.1x: Introduction to the Fundamentals of Sparse Representations by Prof. Michael Elad from The Technion – Israel Institute of Technology.  The description of the problems are taken straightaway from the project.


Pursuit Algorithms

In this article we demonstrate the Orthogonal Matching Pursuit (OMP) and Basis Pursuit (BP) algorithms by running them on a set of test signals and checking whether they provide the desired outcome for the P0 problem.

Some Theory 

Our goal is to solve the following problem:


which is also known as the P0 problem (since the objective function to be minimized is the L0 norm of a vector x):  a problem that seeks the sparsest solution to a linear system.  The concept of seeking the sparsest representation for data are of central importance to data processing in a universal way (e.g., in machine learning, image processing, computer vision, remote sensing), although the problem is NP-hard in general.

In order to solve it in practice, some approximation algorithm is needed to be used, either using greedy or relaxation based approaches, as shown in the next figure.


In this article the implementation  of the above two algorithms and some experimental results on approximately solving the P0 problem with some synthetic signal dataset will be described. The following figure describes how the experiments are to be done:



Data Construction

In this part we create a dictionary and synthetic signals that will serve for us in the experiments.  Here are the ingredients involved:

  • The matrix A is the “dictionary” (of size 50-by-100) under which the ideal signal is known to be sparse. We need to first construct this dictionary by drawing at random a matrix with i.i.d. Gaussian entries. Normalize each atom to a unit L2 norm.

  • Then we need to generate a sparse vector x0 of cardinality s. Let’s draw at random the  locations of the non-zero entries. Then, draw at random the value of each entry from a uniform distribution in the range [1, 3] and multiply each by a random sign.

The following figure shows the heatmap of the dictionary A randomly generated.


Greedy Pursuit

The greedy algorithm OMP is described in the following figure:



Basis Pursuit 


Even though the P0 problem is relaxed to P1, it’s still not ready to be solved by an LP Solver since, the objective function is still not linear. As explained in this thesis, the P1 problem can be converted to a LP by introducing a bunch of new variables and using the trick shown in the following figure.


The matlab /octave function to implement the BP algorithm using LP is shown below:


Analyzing the Results

In this part we compare the results obtained via the OMP and BP, by executing the following steps.

  • Plot the average relative ℓ2 error, obtained by the OMP and BP versus the cardinality.
  • Plot the average support recovery error, obtained by the OMP and BP versus the cardinality.
  • Discuss the presented graphs and explain the results.





As can be seen from the above results,

• The Linear Programming-based relaxed implementation of Basis Pursuit has higher accuracy (in terms of both L2 and support-probability errors) than the greedy Orthogonal Matching Pursuit, particularly when the cardinality of the true solution increases beyond cardinality 6.

• Both OMP and BP (LP) performs equally well (with almost zero L2 error) upto cardinality 6 and then OMP clearly performs worse than the BP (LP).

• Although the average L2 error for OMP increases upto 0.2 when the cardinality of the true solution increases to 15, whereas that of BP only increases very slightly.

• Similar pattern can be seen for the probability of error in support.

Mutual Coherence of A and theoretical guarantees for OMP and BP


where the mutual coherence of matrix A is defined as follows:


For our experiment, the mutual coherence μ(A)  = 0.45697. Hence, as per the theoretical guarantee, the OMP and BP are guaranteed to find the solution of P0 for s < 1.594164. Although in practice we observe that till s = 6 the solution found is quite accurate, since the theoretical bound shown above is for the worst-case only.


Data Science with Python: Exploratory Analysis with Movie-Ratings and Fraud Detection with Credit-Card Transactions

The following problems are taken from the projects / assignments in the edX course Python for Data Science (UCSanDiagoX) and the coursera course Applied Machine Learning in Python (UMich).


1. Exploratory Analysis to Find Trends in Average Movie Ratings for different Genres


● The IMDB Movie Dataset (MovieLens 20M) is used for the analysis.
● The dataset is downloaded from here .
● This dataset contains 20 million ratings and 465,000 tag applications applied to 27,000 movies by 138,000 users and was released in 4/2015.
● The csv files movies.csv and ratings.csv are used for the analysis.



● Understand the trend in average ratings for different movie genres over years (from 1995 to 2015) and Correlation between the trends for different genres (8 different genres are considered: Animation, Comedy, Romance, Thriller, Horror, Sci-Fi and Musical).

● This will give us an insight about how the people’s liking for the different movie genres change over time and about the strength of association between trends in between different movie genres, insights possibly useful for the critics.

Research Questions

The answer to the following research questions will be searched for, using exploratory analysis and visualization:

1) The trend in average ratings for different movie genres: How the average ratings for a few different movie genres (namely, Animation, Comedy, Romance, Thriller, Horror,
Sci-Fi and Musical) change over time (different years, from 1995 to 2015)? How can the average ratings for different genres be compared among themselves?

2) Correlation between the trends for different genres: How are the trends for the genres correlated? For example, are the average ratings for Comedy and Sci-Fi movies positively associated with each other? What is the strength of association?


Pre-processing the dataset

The next table shows the first few rows of the movies dataset, loaded in a pandas DataFrame.

movieId title genres
0 1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
1 2 Jumanji (1995) Adventure|Children|Fantasy
2 3 Grumpier Old Men (1995) Comedy|Romance
3 4 Waiting to Exhale (1995) Comedy|Drama|Romance
4 5 Father of the Bride Part II (1995) Comedy
print movies.shape
(27278, 3)
with in explode() function implementation similar to this stackoverflow post , the DataFrame is transformed to the one shown below.
movies.genres = movies.genres.str.split('|')
movies = explode(movies, ['genres'])
movieId title genres
0 1 Toy Story (1995) Adventure
1 1 Toy Story (1995) Animation
2 1 Toy Story (1995) Children
3 1 Toy Story (1995) Comedy
4 1 Toy Story (1995) Fantasy

The next table shows the first few rows of the ratings dataset, again loaded with pandas.

userId movieId rating timestamp
0 1 2 3.5 1112486027
1 1 29 3.5 1112484676
2 1 32 3.5 1112484819
3 1 47 3.5 1112484727
4 1 50 3.5 1112484580

The input tables are pre-processed using the following code to get the data in the desired format, ready for the analysis.

import time
ratings['timestamp'] = ratings['timestamp'].apply(lambda x: time.strftime('%Y', time.localtime(x)))
userId movieId rating timestamp
0 1 2 3.5 2005
1 1 29 3.5 2005
2 1 32 3.5 2005
3 1 47 3.5 2005
4 1 50 3.5 2005
movies = movies.drop('title', axis=1)
movieId genres
0 1 Adventure
1 1 Animation
2 1 Children
3 1 Comedy
4 1 Fantasy
ratings = ratings.merge(movies, left_on='movieId', right_on='movieId', how='inner')
userId movieId rating timestamp genres
0 1 2 3.5 2005 Adventure
1 1 2 3.5 2005 Children
2 1 2 3.5 2005 Fantasy
3 5 2 3.0 1996 Adventure
4 5 2 3.0 1996 Children
(53177059, 5)
ratings = ratings.loc[ratings['genres'].isin(['Sci-Fi', 'Animation', 'Comedy', 'Romance', 'Thriller', 'Horror', 'Musical'])]
mean_ratings = ratings.groupby(['timestamp', 'genres'], as_index=False)['rating'].aggregate(np.mean)
mean_ratings.rename(columns={'timestamp': 'year'}, inplace=True) sd_ratings = ratings.groupby(['timestamp', 'genres'], as_index=False)['rating'].aggregate(np.std) sd_ratings.rename(columns={'timestamp': 'year'}, inplace=True) sd_ratings.head()
year genres rating
0 1995 Comedy 0.000000
1 1995 Romance NaN
2 1995 Thriller 1.414214
3 1996 Animation 0.953820
4 1996 Comedy 1.010918



Let’s first visualize (visualization done using seaborn) the distributions of the movie-ratings across different genres, as shown in the following figure.


The next figure shows the trends of the average ratings by users for different genres across different years.  The vertical error bars represent the standard deviations of the average ratings (ratings for different movies averaged over users) for the same genres (s.d. across the movies belonging to the same genre).

ratings2 = ratings.groupby(['movieId', 'timestamp', 'genres'], as_index=False)['rating'].aggregate(np.mean)
movieId year genres rating
0 1 1996 Animation 4.132947
1 1 1996 Comedy 4.132947
2 1 1997 Animation 3.875221
3 1 1997 Comedy 3.875221
4 1 1998 Animation 3.883929


The next figure shows the trends of the ratings (averaged over users and movies for each genre) for different genres across different years.


ratings3 = ratings.groupby(['userId', 'genres'], as_index=False)['rating'].aggregate(np.mean)
userId genres rating
0 1 Animation 3.650000
1 1 Comedy 3.731707
2 1 Horror 3.744444
3 1 Musical 3.666667
4 1 Romance 3.954545
pivot = ratings3.pivot(index='userId', columns='genres', values='rating')
genres Animation Comedy Horror Musical Romance Sci-Fi Thriller
1 3.650000 3.731707 3.744444 3.666667 3.954545 3.712500 3.761905
2 3.000000 3.900000 3.555556 3.000000 3.833333 4.608696 4.263158
3 3.750000 4.057692 3.937500 4.000000 4.062500 4.000000 4.260000
4 4.000000 3.545455 NaN 4.000000 3.500000 3.000000 3.461538
5 4.666667 4.083333 3.000000 4.375000 3.937500 4.600000 4.333333

The next figures show the trends for different genres for different sub-windows of time and with the variations (inter-quartile ranges)  in average ratings for each genre.


The next figures show how correlated are the trends for average ratings for different genres.


Some Findings

● There is a decreasing trend in the average ratings for all 8 genres during 1995-98, then the ratings become stable during 1999-2007, then again increase.

● Horror movies always have the lowest average ratings. Sci-Fi and Comedy movies also get low average ratings.

● Musical, Animation and Romance movies get the highest average ratings.

● Sci-Fi and Animation movies show very similar trends, they again become popular during 2009-2013.

● Trends in the average ratings of Romance and Horror movies show positive association between them.


2. Fraud Detection with Classification


In this project, we aim to build machine learning models to automatically detect frauds in credit card transactions. Several supervised binary classification models will be trained using 75-25 validation on this credit card transaction dataset from Kaggle. Given a transaction instance, a model will predict whether it is fraud or not. Different models will then be evaluated on a held-out subset of this data by measuring how effectively they predict instances of credit card fraud. The model with the best recall value (the one which is able to detect the highest number of true frauds) will be selected for prediction. We found that the k-Nearest Neighbors (k=3) and LogisticRegression models perform the best when only the recall (sensitivity) is concerned, whereas the RandomForest model gives pretty high specificity and sensitivity at the same time.



Using the credit card transaction dataset, we want to train a few machine learning models that can predict whether an unseen transaction in the future is likely to be fraud or not. This automatic prediction / detection of fraud can immediately raise an alarm and the transaction could be stopped before it completes. This can also help by expediting the process of manual inspection / examination / investigation of the predicted fraudulent transactions, thereby ensuring safety for the financial institution / bank / customers.


● This dataset from Kaggle is used for credit card fraud detection. The dataset has been collected and analyzed during a research collaboration of Worldline and the Machine Learning Group of ULB (Université Libre de Bruxelles) on big data mining and fraud detection. More details on current and past projects on related topics are available here and here.

● The datasets contains transactions made by credit cards in September 2013 by the European cardholders. This dataset presents transactions that occurred in two days, where there were 492 frauds out of 284,807 transactions. The dataset is highly unbalanced, the positive class (frauds) account for 0.172% of all transactions. Each row in the dataset creditcard.csv corresponds to a credit card transaction. The dataset contains 284,807 rows and 30 columns.

● Features (Columns) include an integer variable Time (which represents the seconds elapsed between each transaction and the first transaction in the dataset), 28 confidential variables V1 through V28 (these are numerical input variables which are the result of a PCA transformation – unfortunately, due to confidentiality issues, the original features and more background information about the data could not be provided), as well as the variable Amount which is the amount of the transaction.

● The target is stored in the Class column, where a value of 1 corresponds to an instance of fraud and 0 corresponds to an instance of not fraud.

● The next table shows the first few rows.


Data Preparation and Cleaning

● The main problem of this dataset is that it is highly unbalanced, the positive class (frauds) account for only 0.172% of all transactions. There are only 492 frauds out of 284,807 transactions: too many negative instances and too few positive (fraud) instances.

● In order to mitigate this high imbalance ratio, so that while training the models can see enough fraud examples, the following techniques were used.

  1. Oversampling: the Synthetic Minority Oversampling Technique (SMOTE) is used to  generate new fraud (minority class) samples with interpolation and k-nearest neighbors. Using this technique, the number of positive examples were increased to 5000 samples.
  2.  Under-sampling: the NeverMiss-1 Algorithm is used to select samples from the majority (non-fraud) class for which the average distance of the k nearest samples of the minority class is the smallest. Using this technique, the number of positive examples were decreased to 10000 samples.

● The input feature Time is not that relevant to predict fraud, that is why that feature is dropped from the input.


Research Questions

Predict whether a given transaction is fraudulent or not.

● Given a credit card transaction (represented by the values of the 30 input features), the goal is to answer the following question: is the transaction fraud?

● More mathematically, given the labelled data, we want to learn a function f :

from the data. Also we shall be interested to compute the conditional probability Prob(fraud|transaction).

● We want to use the function learnt to predict new transactions (not seen while learning the function ).

● We want to evaluate how correctly we can find frauds from the unseen data and find which model performs the best (model selection).



● The dataset is first split (with sklearn) into random train (75%) and test (25%) subsets. Both subsets roughly maintain the ratio of majority vs. minority class.

Split the data into train and test

from sklearn.model_selection import train_test_split

df = pd.read_csv(path + 'creditcard.csv')
df.drop('Time', 1, inplace=True)

X = df.iloc[:,:-1]
y = df.iloc[:,-1]

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

● The training dataset is highly imbalanced (only 372 fraud instances out of 213,607 total instances) w.r.t. class-ratio, it’s balanced using SMOTE (oversampling: the number of the fraud instances to increase to 5000) and NeverMiss-1 (under-sampling: decreasing the number of the non-fraud instances to 10000). The test dataset is not touched.

Preprocessing: Resample to balance the dataset


● A few sklearn models (kNN, SVM, LogisticRegression, RandomForest, DecesionTree, AdaBoost, NaiveBayesian) are then trained separately on the training dataset and every time a model is learnt, it is used to predict the class of the hitherto-unseen test dataset.

● Given the class imbalance ratio, one of the recommend measures for model evaluation is the Area Under the Precision-Recall Curve (AUPRC), since Confusion matrix accuracy is not meaningful for unbalanced classification. e.g., Training a dummy classifier that classifies everything as the majority class of the training data, the accuracy of this classifier obtained is 0.996643 but the recall obtained is very poor: 0.008333, as shown below.

Using X_trainX_testy_train, and y_test (as defined above), let’s train a dummy classifier that classifies everything as the majority class of the training data and compute what is the accuracy of this classifier? What is the recall?

from sklearn.metrics import recall_score, precision_score
from sklearn.dummy import DummyClassifier 
clf = DummyClassifier(random_state=0), y_train) 
y_pred = clf.predict(X_test) 
accuracy = float(sum(y_pred==y_test))/len(y_test) 
recall = recall_score(y_test, y_pred) print (accuracy, recall)
# (0.9966433527148114, 0.0083333333333333332)

● Particularly we shall be interested in high Recall, since ideally we want all the fraud instances to be predicted correctly as fraud instances by the model, with zero False Negatives.



● The next figure shows the prediction evaluation results on the test dataset using the python sklearn Support Vector Classifier with RBF kernel. As can be seen, using training with resampling, the recall becomes quite high (~91.7%), although the accuracy and precision drops.

from sklearn.svm import SVC
clf = SVC(random_state=0, C=1e9, gamma=1e-07, probability=True), y_train) 
y_pred = clf.predict(X_test) 
y_prob = clf.predict_proba(X_test) 
precision, recall, _ = precision_recall_curve(y_test, y_pred) 
fpr, tpr, _ = roc_curve(y_test, y_prob[:,1], pos_label=1) 
print(auc(fpr, tpr)) 
cnf_matrix = confusion_matrix(y_test, y_pred) 
plot_confusion_matrix(cnf_matrix, classes=['0','1'], title='Confusion matrix, without normalization')


● The next figure shows the prediction evaluation results on the test dataset using the python sklearn LogisticRegression classifier. As can be seen, using training with resampling, the recall becomes even higher (~94.2%), although the accuracy and precision drops further.


● The next figure again shows the prediction recall values on the test dataset using the sklearn LogisticRegression classifier, but this time using GridSearch with different hyper-parameter values (for C and regularization).

from sklearn.linear_model import LogisticRegression
parameters = {'penalty': ['l1', 'l2'], 'C':[0.01, 0.1, 1, 10, 100]}
lr = LogisticRegression()
clf = GridSearchCV(lr, parameters, scoring='recall'), y_train)
print np.reshape(clf.cv_results_['mean_test_score'], (5,2))

● As can be seen, C = 100 with L1 penalty obtains the best recall on the test dataset among the values searched on the grid.

● Also, there are 120 fraud instances in the test dataset, out of which all but 7 are detected correctly with the best Logistic Regression Model. The next table shows a few highlighted in red for which the model failed to predict a fraud instance.

● The next figure shows visualizations of the classification decision functions learnt from the training dataset and the class labels predicted for the test dataset, projected on the variables V1 and V2, for different classifiers.


● With every classifier (specificity, sensitivity) metrics on the test dataset are also computed and shown.

● As can be seen, the RandomForest Classifier does a descent job in having both specificity and sensitivity quite high (89% and 88% resp.)

● The k-NearestNeibors (with k = 3) and LogisticRegression classifiers do the best in terms of recall (95%, 94% respectively) but their specificity is quite poor.



Limitations and Future Work

● The models are learnt from this particular credit card fraud dataset and hence may not generalize to other fraud datasets.

● The dataset being highly imbalanced, other methods (such as different variants of SMOTE and ADASYN oversampling, Tomek’s link / different variants of Edited Nearest Neighbor methods e.g. AllKNN) might be tried as pre-processing step, in order to create a more balanced dataset. Also, the resampling was done to result in a final training dataset with 1:2 ratio of the number of minority vs. majority instances, we could try different ratios (e.g., 1:1 or 2:3).

● 75-25 validation is used to evaluate the classifier models, instead k-fold (e.g., k=10) cross-validation could be used to obtain more generalizable models.

● Fine-tuning with hyper-parameters for all models could be done, it was only done with grid search for LogisticRegression. For probabilistic classifiers (e.g., Logistic Regression), we could try different thresholds for classification instead of default.

● We assumed the transactions to be independent of each other (not time dependent) and ignored the Time feature, one could use Time to build a sequential model for prediction.

● The entire dataset might be used with data augmentation for the minority fraud class and deep learning models (CNN or RNN if the Time feature is used) can be learnt and later used for prediction to improve specificity and sensitivity (recall) simultaneously.

● We could try to find the patterns in the input features for a probable fraud instance.



● As can be seen, with LogisticRegression and k-NearestNeighbors (k=3) classifiers we could obtain the best recall on the held-out dataset (although precision / specificity with these models are really poor). Hence, if recall is the only concerned metric, we can go ahead with either of these models for fraud detection / prediction of a fraud instance.
● If precision / specificity is also a secondary concern (e.g., we may not want too many False Positives / Alarms) along with the primary concern recall, we can go ahead with the RandomForest model.



● Andrea Dal Pozzolo, Olivier Caelen, Reid A. Johnson and Gianluca Bontempi. Calibrating Probability with Undersampling for Unbalanced Classification. In Symposium on Computational Intelligence and Data Mining (CIDM), IEEE, 2015.

Credit card transaction dataset from Kaggle.


Some Deep Learning with Python, TensorFlow and Keras


The following problems are taken from a few assignments from the coursera courses Introduction to Deep Learning (by Higher School of Economics) and Neural Networks and Deep Learning (by Prof Andrew Ng, The problem descriptions are taken straightaway from the assignments.


1. Linear models, Optimization

In this assignment a linear classifier will be implemented and it will be trained using stochastic gradient descent with numpy.

Two-dimensional classification

To make things more intuitive, let’s solve a 2D classification problem with synthetic data.



As we  can notice the data above isn’t linearly separable. Hence we should add features (or use non-linear model). Note that decision line between two classes have form of circle, since that we can add quadratic features to make the problem linearly separable. The idea under this displayed on image below:


Here are some test results for the implemented expand function, that is used for adding quadratic features:

# simple test on random numbers

dummy_X = np.array([

# call expand function
dummy_expanded = expand(dummy_X)

# what it should have returned:   x0       x1       x0^2     x1^2     x0*x1    1
dummy_expanded_ans = np.array([[ 0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  1.    ],
                               [ 1.    ,  0.    ,  1.    ,  0.    ,  0.    ,  1.    ],
                               [ 2.61  , -1.28  ,  6.8121,  1.6384, -3.3408,  1.    ],
                               [-0.59  ,  2.1   ,  0.3481,  4.41  , -1.239 ,  1.    ]])


Logistic regression

To classify objects we will obtain probability of object belongs to class ‘1’. To predict probability we will use output of linear model and logistic function:


def probability(X, w):
    Given input features and weights
    return predicted probabilities of y==1 given x, P(y=1|x), see description above
    :param X: feature matrix X of shape [n_samples,6] (expanded)
    :param w: weight vector w of shape [6] for each of the expanded features
    :returns: an array of predicted probabilities in [0,1] interval.

    return 1. / (1 + np.exp(, w)))

In logistic regression the optimal parameters w are found by cross-entropy minimization:


def compute_loss(X, y, w):
    Given feature matrix X [n_samples,6], target vector [n_samples] of 1/0,
    and weight vector w [6], compute scalar loss function using formula above.
    return -np.mean(y*np.log(probability(X, w)) + (1-y)*np.log(1-probability(X, w)))

Since we train our model with gradient descent, we should compute gradients. To be specific, we need the following derivative of loss function over each weight:


Here is the derivation (can be found here too):


def compute_grad(X, y, w):
    Given feature matrix X [n_samples,6], target vector [n_samples] of 1/0,
    and weight vector w [6], compute vector [6] of derivatives of L over each weights.
    return, w) - y), X) / X.shape[0]


In this section we’ll use the functions we wrote to train our classifier using stochastic gradient descent. We shall try to change hyper-parameters like batch size, learning rate and so on to find the best one.

Mini-batch SGD

Stochastic gradient descent just takes a random example on each iteration, calculates a gradient of the loss on it and makes a step:


w = np.array([0, 0, 0, 0, 0, 1]) # initialize

eta = 0.05 # learning rate
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
    dw = compute_grad(X_expanded[ind, :], y[ind], w)
    w = w - eta*dw

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD where batch-size=4.


SGD with momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in image below. It does this by adding a fraction α of the update vector of the past time step to the current update vector.



eta = 0.05 # learning rate
alpha = 0.9 # momentum
nu = np.zeros_like(w)
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
    dw = compute_grad(X_expanded[ind, :], y[ind], w)
    nu = alpha*nu + eta*dw
    w = w - nu

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD + momentum  where batch-size=4. As can be seen, the loss function drops much faster, leading to a faster convergence.



We also need to implement RMSPROP algorithm, which use squared gradients to adjust learning rate as follows:


eta = 0.05 # learning rate
alpha = 0.9 # momentum
G = np.zeros_like(w)
eps = 1e-8
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
     dw = compute_grad(X_expanded[ind, :], y[ind], w)
     G = alpha*G + (1-alpha)*dw**2
     w = w - eta*dw / np.sqrt(G + eps)

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD + RMSProp where batch-size=4. As can be seen again, the loss function drops much faster, leading to a faster convergence.


2. Planar data classification with a neural network with one hidden layer, an implementation from scratch

In this assignment a neural net with a single hidden layer will be trained from scratch. We shall see a big difference between this model and the one implemented using logistic regression.

We shall learn how to:

  • Implement a 2-class classification neural network with a single hidden layer
  • Use units with a non-linear activation function, such as tanh
  • Compute the cross entropy loss
  • Implement forward and backward propagation


The following figure visualizes a “flower” 2-class dataset that we shall work on, the colors indicates the class labels.  We have m = 400 training examples.


Simple Logistic Regression

Before building a full neural network, lets first see how logistic regression performs on this problem. We can use sklearn’s built-in functions to do that, by running the code below to train a logistic regression classifier on the dataset.

# Train the logistic regression classifier
clf = sklearn.linear_model.LogisticRegressionCV();, Y.T);

We can now plot the decision boundary of the model and accuracy with the following code.

# Plot the decision boundary for logistic regression
plot_decision_boundary(lambda x: clf.predict(x), X, Y)
plt.title("Logistic Regression")

# Print accuracy
LR_predictions = clf.predict(X.T)
print ('Accuracy of logistic regression: %d ' % float((,LR_predictions) +,1-LR_predictions))/float(Y.size)*100) +
       '% ' + "(percentage of correctly labelled datapoints)")
Accuracy of logistic regression: 47 % (percentage of correctly labelled datapoints)
Accuracy 47%

Interpretation: The dataset is not linearly separable, so logistic regression doesn’t perform well. Hopefully a neural network will do better. Let’s try this now!


Neural Network model

Logistic regression did not work well on the “flower dataset”. We are going to train a Neural Network with a single hidden layer, by implementing the network with python numpy from scratch.

Here is our model:


The general methodology to build a Neural Network is to:

1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
2. Initialize the model's parameters
3. Loop:
    - Implement forward propagation
    - Compute loss
    - Implement backward propagation to get the gradients
    - Update parameters (gradient descent)

Defining the neural network structure

Define three variables and the function layer_sizes:

- n_x: the size of the input layer
- n_h: the size of the hidden layer (set this to 4) 
- n_y: the size of the output layer
def layer_sizes(X, Y):
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer

Initialize the model’s parameters

Implement the function initialize_parameters().


  • Make sure the parameters’ sizes are right. Refer to the neural network figure above if needed.
  • We will initialize the weights matrices with random values.
    • Use: np.random.randn(a,b) * 0.01 to randomly initialize a matrix of shape (a,b).
  • We will initialize the bias vectors as zeros.
    • Use: np.zeros((a,b)) to initialize a matrix of shape (a,b) with zeros.
def initialize_parameters(n_x, n_h, n_y):
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    params -- python dictionary containing the parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)

The Loop

Implement forward_propagation().


  • Look above at the mathematical representation of the classifier.
  • We can use the function sigmoid().
  • We can use the function np.tanh(). It is part of the numpy library.
  • The steps we have to implement are:
    1. Retrieve each parameter from the dictionary “parameters” (which is the output of initialize_parameters()) by using parameters[".."].
    2. Implement Forward Propagation. Compute Z[1],A[1],Z[2]Z[1],A[1],Z[2] and A[2]A[2] (the vector of all the predictions on all the examples in the training set).
  • Values needed in the backpropagation are stored in “cache“. The cache will be given as an input to the backpropagation function.
def forward_propagation(X, parameters):
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing the parameters (output of initialization function)
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"

Implement compute_cost() to compute the value of the cost J.  There are many ways to implement the cross-entropy loss.


def compute_cost(A2, Y, parameters):
    Computes the cross-entropy cost given in equation (13)
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing the parameters W1, b1, W2 and b2
    cost -- cross-entropy cost given equation (13)

Using the cache computed during forward propagation, we can now implement backward propagation.

Implement the function backward_propagation().

Instructions: Backpropagation is usually the hardest (most mathematical) part in deep learning. The following figure is taken from is the slide from the lecture on backpropagation. We’ll want to use the six equations on the right of this slide, since we are building a vectorized implementation.


def backward_propagation(parameters, cache, X, Y):
    Implement the backward propagation using the instructions above.
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    grads -- python dictionary containing the gradients with respect to different parameters

Implement the update rule. Use gradient descent. We have to use (dW1, db1, dW2, db2) in order to update (W1, b1, W2, b2).

General gradient descent rule: θ=θα(J/θ) where α is the learning rate and θ
represents a parameter.

Illustration: The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.


def update_parameters(parameters, grads, learning_rate = 1.2):
    Updates parameters using the gradient descent update rule given above
    parameters -- python dictionary containing our parameters 
    grads -- python dictionary containing our gradients 
    parameters -- python dictionary containing our updated parameters 

Integrate previous parts in nn_model()

Build the neural network model in nn_model().

Instructions: The neural network model has to use the previous functions in the right order.

def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    parameters -- parameters learnt by the model. They can then be used to predict.


Use the model to predict by building predict(). Use forward propagation to predict results.


def predict(parameters, X):
    Using the learned parameters, predicts a class for each example in X
    parameters -- python dictionary containing our parameters 
    X -- input data of size (n_x, m)
    predictions -- vector of predictions of our model (red: 0 / blue: 1)

It is time to run the model and see how it performs on a planar dataset. Run the following code to test our model with a single hidden layer of nh hidden units.

# Build a model with a n_h-dimensional hidden layer
parameters = nn_model(X, Y, n_h = 4, num_iterations = 10000, print_cost=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))
Cost after iteration 0: 0.693048
Cost after iteration 1000: 0.288083
Cost after iteration 2000: 0.254385
Cost after iteration 3000: 0.233864
Cost after iteration 4000: 0.226792
Cost after iteration 5000: 0.222644
Cost after iteration 6000: 0.219731
Cost after iteration 7000: 0.217504
Cost after iteration 8000: 0.219471
Cost after iteration 9000: 0.218612


Cost after iteration 9000 0.218607
# Print accuracy
predictions = predict(parameters, X)
print ('Accuracy: %d' % float((,predictions.T) +,1-predictions.T))/float(Y.size)*100) + '%')
 Accuracy: 90%

Accuracy is really high compared to Logistic Regression. The model has learnt the leaf patterns of the flower! Neural networks are able to learn even highly non-linear decision boundaries, unlike logistic regression.

Now, let’s try out several hidden layer sizes. We can observe different behaviors of the model for various hidden layer sizes. The results are shown below.

Tuning hidden layer size

Accuracy for 1 hidden units: 67.5 %
Accuracy for 2 hidden units: 67.25 %
Accuracy for 3 hidden units: 90.75 %
Accuracy for 4 hidden units: 90.5 %
Accuracy for 5 hidden units: 91.25 %
Accuracy for 20 hidden units: 90.0 %
Accuracy for 50 hidden units: 90.25 %



  • The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.
  • The best hidden layer size seems to be around n_h = 5. Indeed, a value around here seems to fits the data well without also incurring noticable overfitting.
  • We shall also learn later about regularization, which lets us use very large models (such as n_h = 50) without much overfitting.


3. Getting deeper with Keras


  • Tensorflow is a powerful and flexible tool, but coding large neural architectures with it is tedious.
  • There are plenty of deep learning toolkits that work on top of it like Slim, TFLearn, Sonnet, Keras.
  • Choice is matter of taste and particular task
  • We’ll be using Keras to predict handwritten digits with the mnist dataset.
  • The following figure shows 225 sample images from the dataset.




The pretty keras

Using only the following few lines of code we can learn a simple deep neural net with 3 dense hidden layers and with Relu activation, with dropout 0.5 after each dense layer.

import keras
from keras.models import Sequential
import keras.layers as ll

model = Sequential(name="mlp")
model.add(ll.InputLayer([28, 28]))

# network body



# output layer: 10 neurons for each class with softmax
model.add(ll.Dense(10, activation='softmax'))

# categorical_crossentropy is our good old crossentropy
# but applied for one-hot-encoded vectors
model.compile("adam", "categorical_crossentropy", metrics=["accuracy"])

The following shows the summary of the model:

Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        (None, 28, 28)            0         
flatten_12 (Flatten)         (None, 784)               0         
dense_35 (Dense)             (None, 128)               100480    
activation_25 (Activation)   (None, 128)               0         
dropout_22 (Dropout)         (None, 128)               0         
dense_36 (Dense)             (None, 128)               16512     
activation_26 (Activation)   (None, 128)               0         
dropout_23 (Dropout)         (None, 128)               0         
dense_37 (Dense)             (None, 128)               16512     
activation_27 (Activation)   (None, 128)               0         
dropout_24 (Dropout)         (None, 128)               0         
dense_38 (Dense)             (None, 10)                1290      
Total params: 134,794
Trainable params: 134,794
Non-trainable params: 0

Model interface

Keras models follow Scikit-learn‘s interface of fit/predict with some notable extensions. Let’s take a tour.

# fit(X,y) ships with a neat automatic logging.
#          Highly customizable under the hood., y_train,
          validation_data=(X_val, y_val), epochs=13);
 Train on 50000 samples, validate on 10000 samples
Epoch 1/13
50000/50000 [==============================] - 14s - loss: 0.1489 - acc: 0.9587 - val_loss: 0.0950 - val_acc: 0.9758
Epoch 2/13
50000/50000 [==============================] - 12s - loss: 0.1543 - acc: 0.9566 - val_loss: 0.0957 - val_acc: 0.9735
Epoch 3/13
50000/50000 [==============================] - 11s - loss: 0.1509 - acc: 0.9586 - val_loss: 0.0985 - val_acc: 0.9752
Epoch 4/13
50000/50000 [==============================] - 11s - loss: 0.1515 - acc: 0.9577 - val_loss: 0.0967 - val_acc: 0.9752
Epoch 5/13
50000/50000 [==============================] - 11s - loss: 0.1471 - acc: 0.9596 - val_loss: 0.1008 - val_acc: 0.9737
Epoch 6/13
50000/50000 [==============================] - 11s - loss: 0.1488 - acc: 0.9598 - val_loss: 0.0989 - val_acc: 0.9749
Epoch 7/13
50000/50000 [==============================] - 11s - loss: 0.1495 - acc: 0.9592 - val_loss: 0.1011 - val_acc: 0.9748
Epoch 8/13
50000/50000 [==============================] - 11s - loss: 0.1434 - acc: 0.9604 - val_loss: 0.1005 - val_acc: 0.9761
Epoch 9/13
50000/50000 [==============================] - 11s - loss: 0.1514 - acc: 0.9590 - val_loss: 0.0951 - val_acc: 0.9759
Epoch 10/13
50000/50000 [==============================] - 11s - loss: 0.1424 - acc: 0.9613 - val_loss: 0.0995 - val_acc: 0.9739
Epoch 11/13
50000/50000 [==============================] - 11s - loss: 0.1408 - acc: 0.9625 - val_loss: 0.0977 - val_acc: 0.9751
Epoch 12/13
50000/50000 [==============================] - 11s - loss: 0.1413 - acc: 0.9601 - val_loss: 0.0938 - val_acc: 0.9753
Epoch 13/13
50000/50000 [==============================] - 11s - loss: 0.1430 - acc: 0.9619 - val_loss: 0.0981 - val_acc: 0.9761

As we could see, with a simple model without any convolution layers we could obtain more than 97.5% accuracy on the validation dataset.

The following figures show the weights learnt at different layers.


Some Tips & tricks to improve accuracy

Here are some tips on what we can do to improve accuracy:

  • Network size
    • More neurons,
    • More layers, (docs)
    • Nonlinearities in the hidden layers
      • tanh, relu, leaky relu, etc
    • Larger networks may take more epochs to train, so don’t discard the net just because it could didn’t beat the baseline in 5 epochs.
  • Early Stopping
    • Training for 100 epochs regardless of anything is probably a bad idea.
    • Some networks converge over 5 epochs, others – over 500.
    • Way to go: stop when validation score is 10 iterations past maximum
  • Faster optimization
    • rmsprop, nesterov_momentum, adam, adagrad and so on.
      • Converge faster and sometimes reach better optima
      • It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
  • Regularize to prevent overfitting
  • Data augmemntation – getting 5x as large dataset for free is a great deal
    • Zoom-in+slice = move
    • Rotate+zoom(to remove black stripes)
    • any other perturbations
    • Simple way to do that (if we have PIL/Image):
      • from scipy.misc import imrotate,imresize
      • and a few slicing
    • Stay realistic. There’s usually no point in flipping dogs upside down as that is not the way we usually see them.

Some Data Processing and Analysis with Python

The following problems appeared as assignments in the edX course Analytics for Computing (by Gatech). The descriptions are taken from the assignments.

1. Association rule mining

First we shall implement the basic pairwise association rule mining algorithm.

Problem definition

Let’s say we have a fragment of text in some language. We wish to know whether there are association rules among the letters that appear in a word. In this problem:

  • Words are “receipts”
  • Letters within a word are “items”

We want to know whether there are association rules of the form, a⟹ b , where a and b are letters, for a given language (by calculating for each rule its confidencecon(⟹ b), which is an estimate of the conditional probability of b given a, or Pr[b|a].

Sample text input

Let’s carry out this analysis on a “dummy” text fragment, which graphic designers refer to as the lorem ipsum:

latin_text= """
Sed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam
rem aperiam eaque ipsa, quae ab illo inventore veritatis et quasi architecto beatae vitae dicta
sunt, explicabo. Nemo enim ipsam voluptatem, quia voluptas sit, aspernatur aut odit aut fugit, sed
quia consequuntur magni dolores eos, qui ratione voluptatem sequi nesciunt, neque porro quisquam est,
qui dolorem ipsum, quia dolor sit amet consectetur adipisci[ng] velit, sed quia non numquam [do] eius
modi tempora inci[di]dunt, ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima
veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi
consequatur? Quis autem vel eum iure reprehenderit, qui in ea voluptate velit esse, quam nihil molestiae
consequatur, vel illum, qui dolorem eum fugiat, quo voluptas nulla pariatur?

At vero eos et accusamus et iusto odio dignissimos ducimus, qui blanditiis praesentium voluptatum
deleniti atque corrupti, quos dolores et quas molestias excepturi sint, obcaecati cupiditate non
provident, similique sunt in culpa, qui officia deserunt mollitia animi, id est laborum et dolorum
fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est
eligendi optio, cumque nihil impedit, quo minus id, quod maxime placeat, facere possimus, omnis voluptas
assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et aut officiis debitis aut rerum
necessitatibus saepe eveniet, ut et voluptates repudiandae sint et molestiae non recusandae. Itaque
earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur
aut perferendis doloribus asperiores repellat.

Data cleaning

Like most data in the real world, this dataset is noisy. It has both uppercase and lowercase letters, words have repeated letters, and there are all sorts of non-alphabetic characters. For our analysis, we should keep all the letters and spaces (so we can identify distinct words), but we should ignore case and ignore repetition within a word.

For example, the eighth word of this text is “error.” As an itemset, it consists of the three unique letters{e,o,r}. That is, to treat the word as a set, meaning we only keep the unique letters. This itemset has six possible itempairs{e,o}{e,r}, and {o,r}.

  • We need to start by “cleaning up” (normalizing) the input, with all characters converted to lowercase and all non-alphabetic, non-space characters removed.
  • Next, we need to convert each word into an itemset like the following examples:

consequatur –> {‘a’, ‘e’, ‘c’, ‘s’, ‘n’, ‘o’, ‘u’, ‘t’, ‘q’, ‘r’}
voluptatem –> {‘l’, ‘a’, ‘e’, ‘o’, ‘p’, ‘u’, ‘m’, ‘t’, ‘v’}

Implementing the basic algorithm

The followed algorithm is implemented:



First all item-pairs within an itemset are enumerated and a table that tracks the counts of those item-pairs is updated in-place.

  • Now, given tables of item-paircounts and individual item counts, as well as a confidence threshold, the rules that meet the threshold are returned.
  • The returned rules should be in the form of a dictionary whose key is the tuple(a,bcorresponding to the rule a⇒ b, and whose value is the confidence of the rule, conf(⇒ b).
  • The following functions were implemented to compute the association rules.
    from collections import defaultdict
    from itertools import combinations 
    def update_pair_counts (pair_counts, itemset):
        Updates a dictionary of pair counts for all pairs of items 
        in a given itemset.
        assert type (pair_counts) is defaultdict
        for item in list(combinations(itemset, 2)):
            pair_counts[item] += 1
            pair_counts[item[::-1]] += 1
        return pair_counts
    def update_item_counts(item_counts, itemset):
        for item in itemset:
            item_counts[item] += 1
        return item_counts
    def filter_rules_by_conf (pair_counts, item_counts, threshold):
        rules = {} # (item_a, item_b) -> conf (item_a => item_b)
        for (item_a, item_b) in pair_counts:
            conf = pair_counts[(item_a, item_b)] / float(item_counts[item_a])
            if conf >= threshold:
                rules[(item_a, item_b)] = conf
        return rules
    def find_assoc_rules(receipts, threshold):
        pair_counts = defaultdict(int)
        item_counts = defaultdict(int)
        for receipt in receipts:
            update_pair_counts(pair_counts, receipt)
            update_item_counts(item_counts, receipt)
        return filter_rules_by_conf(pair_counts, item_counts, threshold)


  • For the Latin string, latin_text, the function find_assoc_rules() was used to compute the rules whose confidence is at least 0.75, with the following rules obtained as result.

    conf(q => u) = 1.000
    conf(x => e) = 1.000
    conf(x => i) = 0.833
    conf(h => i) = 0.833
    conf(v => t) = 0.818
    conf(r => e) = 0.800
    conf(v => e) = 0.773
    conf(f => i) = 0.750
    conf(b => i) = 0.750
    conf(g => i) = 0.750


  • Now, let’s look at rules common to the above Latin textandEnglish text obtained by a translation of the lorem ipsum text, as shown below:
    english_text = """
    But I must explain to you how all this mistaken idea of denouncing of a pleasure and praising pain was
    born and I will give you a complete account of the system, and expound the actual teachings of the great
    explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids
    pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure
    rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or
    pursues or desires to obtain pain of itself, because it is pain, but occasionally circumstances occur in
    which toil and pain can procure him some great pleasure. To take a trivial example, which of us
    ever undertakes laborious physical exercise, except to obtain some advantage from it? But who has any
    right to find fault with a man who chooses to enjoy a pleasure that has no annoying consequences, or
    one who avoids a pain that produces no resultant pleasure?
    On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and
    demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the
    pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty
    through weakness of will, which is the same as saying through shrinking from toil and pain. These
    cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is
    untrammeled and when nothing prevents our being able to do what we like best, every pleasure is to
    be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or
    the obligations of business it will frequently occur that pleasures have to be repudiated and
    annoyances accepted. The wise man therefore always holds in these matters to this principle of
    selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to
    avoid worse pains.


  • Again, for the English string, english_text, the function find_assoc_rules()  was used to compute the rules whose confidence is at least 0.75, with the following rules obtained as result.

    conf(z => a) = 1.000
    conf(j => e) = 1.000
    conf(z => o) = 1.000
    conf(x => e) = 1.000
    conf(q => e) = 1.000
    conf(q => u) = 1.000
    conf(z => m) = 1.000
    conf(z => r) = 1.000
    conf(z => l) = 1.000
    conf(z => e) = 1.000
    conf(z => d) = 1.000
    conf(z => i) = 1.000
    conf(k => e) = 0.778
    conf(q => n) = 0.750


  • Let’s consider any rules with a confidence of at least 0.75 to be a “high-confidence rule“.  The common_high_conf_rules are all the high-confidence rules appearing in both the Latin text and the English text. The rules shown below are all such rules:High-confidence rules common to _lorem ipsum_ in Latin and English:

    q => u
    x => e


  • The following table and the figure show the high confidence rules for  the latin and the english texts.
    index rule confidence
    0 z=> o 1.000000 English
    1 z=> l 1.000000 English
    2 z=> m 1.000000 English
    3 q=> u 1.000000 English
    4 q=> e 1.000000 English
    5 x=> e 1.000000 English
    6 z=> e 1.000000 English
    7 j=> e 1.000000 English
    8 z=> a 1.000000 English
    9 z=> d 1.000000 English
    10 q=> u 1.000000 Latin
    11 z=> i 1.000000 English
    12 x=> e 1.000000 Latin
    13 z=> r 1.000000 English
    14 x=> i 0.833333 Latin
    15 h=> i 0.833333 Latin
    16 v=> t 0.818182 Latin
    17 r=> e 0.800000 Latin
    18 k=> e 0.777778 English
    19 v=> e 0.772727 Latin
    20 g=> i 0.750000 Latin
    21 q=> n 0.750000 English
    22 f=> i 0.750000 Latin
    23 b=> i 0.750000 Latin


Putting it all together: Actual baskets!

Let’s take a look at some real data  from this link. First few lines of the transaction data is shown below:

citrus fruit,semi-finished bread,margarine,ready soups
tropical fruit,yogurt,coffee
whole milk
pip fruit,yogurt,cream cheese ,meat spreads
other vegetables,whole milk,condensed milk,long life bakery product
whole milk,butter,yogurt,rice,abrasive cleaner
other vegetables,UHT-milk,rolls/buns,bottled beer,liquor (appetizer)
pot plants
whole milk,cereals
tropical fruit,other vegetables,white bread,bottled water,chocolate
citrus fruit,tropical fruit,whole milk,butter,curd,yogurt,flour,bottled water,dishes
chicken,tropical fruit
butter,sugar,fruit/vegetable juice,newspapers
fruit/vegetable juice
packaged fruit/vegetables
specialty bar
other vegetables
butter milk,pastry
whole milk
tropical fruit,cream cheese ,processed cheese,detergent,newspapers
tropical fruit,root vegetables,other vegetables,frozen dessert,rolls/buns,flour,sweet spreads,salty snack,waffles,candy,bathroom cleaner
bottled water,canned beer
other vegetables
brown bread,soda,fruit/vegetable juice,canned beer,newspapers,shopping bags
yogurt,beverages,bottled water,specialty bar

  • Our task is to mine this dataset for pairwise association rules to produce a final dictionary, basket_rules, that meet these conditions:
  1. The keys are pairs (a,b), where a and b are item names.
  2. The values are the corresponding confidence scores, conf(⇒ b).
  3. Only include rules ⇒ b where item a occurs at least MIN_COUNT times and conf(⇒ b) is at least THRESHOLD.

The result is shown below:

Found 19 rules whose confidence exceeds 0.5.
Here they are:

conf(honey => whole milk) = 0.733
conf(frozen fruits => other vegetables) = 0.667
conf(cereals => whole milk) = 0.643
conf(rice => whole milk) = 0.613
conf(rubbing alcohol => whole milk) = 0.600
conf(cocoa drinks => whole milk) = 0.591
conf(pudding powder => whole milk) = 0.565
conf(jam => whole milk) = 0.547
conf(cream => sausage) = 0.538
conf(cream => other vegetables) = 0.538
conf(baking powder => whole milk) = 0.523
conf(tidbits => rolls/buns) = 0.522
conf(rice => other vegetables) = 0.520
conf(cooking chocolate => whole milk) = 0.520
conf(frozen fruits => whipped/sour cream) = 0.500
conf(specialty cheese => other vegetables) = 0.500
conf(ready soups => rolls/buns) = 0.500
conf(rubbing alcohol => butter) = 0.500
conf(rubbing alcohol => citrus fruit) = 0.500

2. Simple string processing with Regex

Phone numbers

  • Write a function to parse US phone numbers written in the canonical “(404) 555-1212” format, i.e., a three-digit area code enclosed in parentheses followed by a seven-digit local number in three-hyphen-four digit format.
  • It should also ignore all leading and trailing spaces, as well as any spaces that appear between the area code and local numbers.
  • However, it should not accept any spaces in the area code (e.g., in ‘(404)’) nor should it in the local number.
  • It should return a triple of strings, (area_code, first_three, last_four).
  • If the input is not a valid phone number, it should raise a ValueError.
The following function implements the regex parser.
import re 
def parse_phone(s):
    pattern = re.compile("\s*\((\d{3})\)\s*(\d{3})-(\d{4})\s*")
    m = pattern.match(s)
    if not m:
        raise ValueError('not a valid phone number!')
    return m.groups()

#print(parse_phone1('(404) 201-2121'))    
  • Implement an enhanced phone number parser that can handle any of these patterns.
    • (404) 555-1212
    • (404) 5551212
    • 404-555-1212
    • 404-5551212
    • 404555-1212
    • 4045551212
  • As before, it should not be sensitive to leading or trailing spaces. Also, for the patterns in which the area code is enclosed in parentheses, it should not be sensitive to the number of spaces separating the area code from the remainder of the number.

The following function implements the enhanced regex parser.

import re 
def parse_phone2 (s):
    pattern = re.compile("\s*\((\d{3})\)\s*(\d{3})-?(\d{4})\s*")
    m = pattern.match(s)
    if not m:
        pattern2 = re.compile("\s*(\d{3})-?(\d{3})-?(\d{4})\s*")
        m = pattern2.match(s)
        if not m:
            raise ValueError('not a valid phone number!')
    return m.groups()

3. Tidy data and the Pandas

“Tidying data,” is all about cleaning up tabular data for analysis purposes.

Definition: Tidy datasets. More specifically, Wickham defines a tidy data set as one that can be organized into a 2-D table such that

  1. each column represents a variable;
  2. each row represents an observation;
  3. each entry of the table represents a single value, which may come from either categorical (discrete) or continuous spaces.

Definition: Tibbles. if a table is tidy, we will call it a tidy table, or tibble, for short.

Apply functions to data frames

Given the following pandas DataFrame (first few rows are shown in the next table),compute the prevalence, which is the ratio of cases to the population, using the apply() function, without modifying the original DataFrame.

country year cases population
0 Afghanistan ’99 745 19987071
1 Brazil ’99 37737 172006362
2 China ’99 212258 1272915272
3 Afghanistan ’00 2666 20595360
4 Brazil ’00 80488 174504898
5 China ’00 213766 1280428583

The next function does exactly the same thing.

def calc_prevalence(G):
    assert 'cases' in G.columns and 'population' in G.columns
    H = G.copy()
    H['prevalence'] = H.apply(lambda row: row['cases'] / row['population'], axis=1)
    return H

Tibbles and Bits

Now let’s start creating and manipulating tibbles.

Write a function, canonicalize_tibble(X), that, given a tibble X, returns a new copy Y of X in canonical order. We say Y is in canonical order if it has the following properties.

  1. The variables appear in sorted order by name, ascending from left to right.
  2. The rows appear in lexicographically sorted order by variable, ascending from top to bottom.
  3. The row labels (Y.index) go from 0 to n-1, where n is the number of observations.

The following code exactly does the same:

def canonicalize_tibble(X):
    # Enforce Property 1:
    var_names = sorted(X.columns)
    Y = X[var_names].copy()
    Y = Y.sort_values(by=var_names, ascending=True)
    Y.reset_index(drop=True, inplace=True)
    return Y

Basic tidying transformations: Implementing Melting and Casting

Given a data set and a target set of variables, there are at least two common issues that require tidying.


First, values often appear as columns. Table 4a is an example. To tidy up, we want to turn columns into rows:


Because this operation takes columns into rows, making a “fat” table more tall and skinny,  it is sometimes called melting.

To melt the table, we need to do the following.

  1. Extract the column values into a new variable. In this case, columns "1999"  and "2000" of table4 need to become the values of the variable, "year".
  2. Convert the values associated with the column values into a new variable as well. In this case, the values formerly in columns "1999" and "2000"become the values of the "cases" variable.

In the context of a melt, let’s also refer to "year" as the new key variable and
"cases" as the new value variable.

Implement the melt operation as a function,

def melt(df, col_vals, key, value):

It should take the following arguments:

  • df: the input data frame, e.g., table4 in the example above;
  • col_vals: a list of the column names that will serve as values;
  • key: name of the new variable, e.g., year in the example above;
  • value: name of the column to hold the values.

The next function implements the melt operation:

def melt(df, col_vals, key, value):
    assert type(df) is pd.DataFrame
    df2 = pd.DataFrame()
    for col in col_vals:
        df1 = pd.DataFrame(df[col].tolist(), columns=[value]) #,
        df1[key] = col
        other_cols = list(set(df.columns.tolist()) - set(col_vals))
        for col1 in other_cols:
            df1[col1] = df[col1]
        df2 = df2.append(df1, ignore_index=True)
    df2 = df2[other_cols + [key, value]]    
    return df2

with the following output

=== table4a ===
country 1999 2000
0 Afghanistan 745 2666
1 Brazil 37737 80488
2 China 212258 213766
=== melt(table4a) ===
country year cases
0 Afghanistan 1999 745
1 Brazil 1999 37737
2 China 1999 212258
3 Afghanistan 2000 2666
4 Brazil 2000 80488
5 China 2000 213766
=== table4b ===
country 1999 2000
0 Afghanistan 19987071 20595360
1 Brazil 172006362 174504898
2 China 1272915272 1280428583
=== melt(table4b) ===
country year population
0 Afghanistan 1999 19987071
1 Brazil 1999 172006362
2 China 1999 1272915272
3 Afghanistan 2000 20595360
4 Brazil 2000 174504898
5 China 2000 1280428583


The second most common issue is that an observation might be split across multiple rows. Table 2 is an example. To tidy up, we want to merge rows:


Because this operation is the moral opposite of melting, and “rebuilds” observations from parts, it is sometimes called casting.

Melting and casting are Wickham’s terms from his original paper on tidying data. In his more recent writing, on which this tutorial is based, he refers to the same operation as gathering. Again, this term comes from Wickham’s original paper, whereas his more recent summaries use the term spreading.

The signature of a cast is similar to that of melt. However, we only need to know the key, which is column of the input table containing new variable names, and the value, which is the column containing corresponding values.

Implement a function to cast a data frame into a tibble, given a key column containing new variable names and a value column containing the corresponding cells.

Observe that we are asking your cast() to accept an optional parameter, join_how, that may take the values 'outer' or 'inner' (with 'outer' as the default).

The following function implements the casting operation:

def cast(df, key, value, join_how='outer'):
    """Casts the input data frame into a tibble,
    given the key column and value column.
    assert type(df) is pd.DataFrame
    assert key in df.columns and value in df.columns
    assert join_how in ['outer', 'inner']
    fixed_vars = df.columns.difference([key, value])
    tibble = pd.DataFrame(columns=fixed_vars) # empty frame
    fixed_vars = fixed_vars.tolist()
    #tibble[fixed_vars] = df[fixed_vars]
    cols = []
    for k,df1 in df.groupby(df[key]):
        #tibble = pd.concat([tibble.reset_index(drop=True), df1[value]], axis=1)
        tibble = tibble.merge(df1[fixed_vars+[value]], on=fixed_vars, how=join_how)
        cols.append(str(k)) #list(set(df1[key]))[0])
    tibble.columns = fixed_vars + cols
    return tibble

with the following output:

=== table2 ===
country year type count
0 Afghanistan 1999 cases 745
1 Afghanistan 1999 population 19987071
2 Afghanistan 2000 cases 2666
3 Afghanistan 2000 population 20595360
4 Brazil 1999 cases 37737
5 Brazil 1999 population 172006362
6 Brazil 2000 cases 80488
7 Brazil 2000 population 174504898
8 China 1999 cases 212258
9 China 1999 population 1272915272
10 China 2000 cases 213766
11 China 2000 population 1280428583
=== tibble2 = cast (table2, "type", "count") ===
country year cases population
0 Afghanistan 1999 745 19987071
1 Afghanistan 2000 2666 20595360
2 Brazil 1999 37737 172006362
3 Brazil 2000 80488 174504898
4 China 1999 212258 1272915272
5 China 2000 213766 1280428583

Separating variables

Consider the following table.


country year rate
0 Afghanistan 1999 745/19987071
1 Afghanistan 2000 2666/20595360
2 Brazil 1999 37737/172006362
3 Brazil 2000 80488/174504898
4 China 1999 212258/1272915272
5 China 2000 213766/1280428583

In this table, the rate variable combines what had previously been the cases
andpopulation data. This example is an instance in which we might want to separate a column into two variables.

Write a function that takes a data frame (df) and separates an existing column (key) into new variables (given by the list of new variable names, into).  How will the separation happen? The caller should provide a function, splitter(x), that given a value returns a list containing the components.


The following code implements the function:

import re

def default_splitter(text):
    """Searches the given spring for all integer and floating-point
    values, returning them as a list _of strings_.
    E.g., the call
      default_splitter('Give me $10.52 in exchange for 91 kitten stickers.')
    will return ['10.52', '91'].
    #fields = re.findall('(\d+\.?\d+)', text)
    fields = list(re.match('(\d+)/(\d+)', text).groups())
    return fields

def separate(df, key, into, splitter=default_splitter):
    """Given a data frame, separates one of its columns, the key,
    into new variables.
    assert type(df) is pd.DataFrame
    assert key in df.columns    
    return (df.merge(df[key].apply(lambda s: pd.Series({into[i]:splitter(s)[i] for i in range(len(into))})), 
    left_index=True, right_index=True)).drop(key, axis=1)

with the following output:

=== table3 ===
country year rate
0 Afghanistan 1999 745/19987071
1 Afghanistan 2000 2666/20595360
2 Brazil 1999 37737/172006362
3 Brazil 2000 80488/174504898
4 China 1999 212258/1272915272
5 China 2000 213766/1280428583
=== tibble3 = separate (table3, ...) ===
country year cases population
0 Afghanistan 1999 745 19987071
1 Afghanistan 2000 2666 20595360
2 Brazil 1999 37737 172006362
3 Brazil 2000 80488 174504898
4 China 1999 212258 1272915272
5 China 2000 213766 1280428583

Some Computational Photography: Image Quilting (Texture Synthesis) with Dynamic Programming and Texture Transfer (Drawing with Textures) in Python


The following problems appeared as a programming assignment in the Computation Photography course (CS445) at UIUC. The description of the problem is taken from the assignment itself. In this assignment, a python implementation of the problems will be described instead of matlab, as expected in the course.


The Problems

  • The goal of this assignment is to implement the image quilting algorithm for
    texture synthesis and transfer, described in this SIGGRAPH 2001 paper by Efros
    and Freeman.
  • Texture synthesis is the creation of a larger texture image from a small sample.
  • Texture transfer is giving an object the appearance of having the
    same texture as a sample while preserving its basic shape.
  • For texture synthesis, the main idea is to sample patches and lay them down in overlapping patterns, such that the overlapping regions are similar.
  • The overlapping regions may not match exactly, which will result in noticeable
    edges artifact. To fix this, we need to compute a path along pixels with similar intensities through the overlapping region and use it to select which overlapping patch from which to draw each pixel.
  • Texture transfer is achieved by encouraging sampled patches to have similar appearance to a given target image, as well as matching overlapping regions of already sampled patches.
  • In this project, we need to apply important techniques such as template matching, finding seams, and masking. These techniques are also useful for image stitching, image completion, image retargeting and blending.


Randomly Sampled Texture

First let’s randomly samples square patches of size patchsize from sample in order to create an output image of size outsize. Start from the upper-left corner, and tile samples until the image is full. If the patches don’t fit evenly into the output image, we can leave black borders at the edges. This is the simplest but least effective method. Save a result from a sample image to compare to the next two methods.


Overlapping Patches

Let’s start by sampling a random patch for the upper-left corner. Then sample new patches to overlap with existing ones. For example, the second patch along the top row will overlap by patchsize pixels in the vertical direction and overlap pixels in the horizontal direction. Patches in the first column will overlap by patchsize pixels in the horizontal direction and overlap pixels in the vertical direction. Other patches will have two overlapping regions (on the top and left) which should both be taken into account. Once the cost of each patch has been computed, randomly choose on patch whose cost is
less than a threshold determined by some tolerance value.

As described in the paper, the size of the block is the only parameter controlled by the user and it depends on the properties of a given texture; the block must be big enough to capture the relevant structures in the texture, but small enough so that the interaction between these structures is left up to the algorithm. The overlap size is taken to be one-sixth of the block size (B/6) in general.


Seam Finding

Next we need to find the min-cost contiguous path from the left to right side of the patch according to the cost. The cost of a path through each pixel is the square differences (summed over RGB for color images) of the output image and the newly
sampled patch. Use dynamic programming to find the min-cost path.

The following figure describes the algorithm to be implemented for image quilting.


Texture Transfer

The final task is to implement texture transfer, based on the quilt implementation for creating a texture sample that is guided by a pair of sample/target correspondence images (section 3 of the paper). The main difference between this function and
quilt function is that there is an additional cost term based on the difference between
the sampled source patch and the target patch at the location to be filled.

Image quilting (texture synthesis) results

The following figures and animations show the results of the outputs obtained with the quilting algorithm. The input texture images are mostly taken from the paper .

Input sample Texture

100 sampled blocks of a fixed size (e.g. 50×50) from the input sample

The next animation shows how the large output texture gets created (100 times larger than the input sample texture) with the quilting algorithm.


Output Texture (10×10 times larger than the input) created with texture synthesis (quilting)


Input Texture


Output Texture (25 times larger than the input) created with texture synthesis (quilting) with the minimum cost seams (showed as red lines) computed with dynamic programming


Output Texture (25 times larger than the input) created with quiltingquilt_q6


Input Texture


Output Texture (25 times larger than the input) created with quilting


Input Texture


Output Texture (25 times larger than the input) created with quilting


Input Texture


Output Texture (12 times larger than the input) created with quilting


Input Texture


Output Texture (25 times larger than the input) created with quilting


Input Texture


Output Texture (25 times larger than the input) created with quilting


Input Texture


Output Texture (36 times larger than the input) created with quilting


Input Texture


Output Texture (9 times larger than the input) created with quilting


Input Texture


Output Texture (25 times larger than the input) created with quilting


Input Texture


Output Texture (9 times larger than the input) created with quilting along with the min-cost seams (shown as red lines) computed with dynamic programming 


Output Texture (9 times larger than the input) created with quilting



Texture transfer results

The following figures and animations show how the texture from an input image can be transferred to the target image using the above-mentioned simple modification of the quilting algorithm. Again, some of the images are taken from the paper.

Input Texture (milk)


Target Image


Output Image after Texture Transfer



Input Texture (milk)


Target Image


Output Image after Texture Transfer



The following figures show the output image obtained when a few textures were transferred to my image.

Target Image (me)



Input Texture (fire)


Output Image after Texture Transfer  (with small block size)




Input Texture (cloud)


Output Image after Texture Transfer  (with small block size)



Input Texture (toast)

Output Image after Texture Transfer  (with small block size)text_tran_toast

Now applying some gradient mixing such as Poisson blending  on the original image and the the one after texture transfer with the target toast image we get the following two images respectively.



Input Texture

Output Image after Texture Transfer  (with small block size)



Input Texture


Output Image after Texture Transfer  (with small block size)text_tran_draw



Drawing with Textures


The following figures show the output image obtained when a few textures were transferred to another image of mine.


Target Image (me)



Some textures from a few famous painting-masterpieces (obtained from here, here, here, here and here)

Input Texture (Van Gogh’s The Starry Night)


Output Images after Texture Transfer  (with 3 different patch sizes)



Input Texture (Van Gogh’s Wheatfield with Cypresses)


Output Image after Texture Transfer


Input Texture (Van Gogh’s The Mulberry Tree)


Output Image after Texture Transfer


Input Texture (Van Gogh’s Wheatfield with Crows)


Output Images after Texture Transfer  (with 2 different patch sizes)


Input Texture (Van Gogh’s Vase with fifteen Sunflowers)


Output Images after Texture Transfer  (with 2 different patch sizes)


Input Texture (Van Gogh’s Sunflowers (F452))


Output Image after Texture Transfer


Input Texture (Van Gogh’s Irises)


Output Image after Texture Transfer


Input Texture (Van Gogh’s Almond Blossom)


Output Image after Texture Transfer


Input Texture (Van Gogh’s Starry Night Over the Rhone)


Output Image after Texture Transfer


Input Texture (Pablo Picasso’s Mediterranean Landscape)


Output Images after Texture Transfer  (with 2 different patch sizes)


Input Texture (Pablo Picasso’s Minotauromachy)


Output Image after Texture Transfer


Input Texture (Pablo Picasso’s Mother and Child 1921)


Output Image after Texture Transfer


Input Texture (Pablo Picasso’s Factory at Horto de Ebro)


Output Images after Texture Transfer  (with 2 different patch sizes)


Input Texture (Pablo Picasso’s Carafe and Three Bowls) picaso4

Output Image after Texture Transfer

Input Texture (Pablo Picasso’s Bullfight 3) 

Output Image after Texture Transfer

Input Texture (Pablo Picasso’s Accordionist) 


Output Image after Texture Transfertext_tran_024_picaso7

Input Texture (Pablo Picasso’s Las Meninas) picaso5

Output Image after Texture Transfertext_tran_024_picaso5.png

Input Texture (Claude Monet’s The Artist’s Garden at Giverny


Output Image after Texture Transfertext_tran_024_monet

Input Texture (Claude Monet’s The Poppy Field near Argenteuil


Output Image after Texture Transfer


Input Texture (Claude Monet’s The Magpie


Output Image after Texture Transfer


Input Texture (Claude Monet’s The Garden of Monet at Argenteuil


Output Images after Texture Transfer  (with 2 different patch sizes)text_tran_050_monet3text_tran_024_monet3

Input Texture (Claude Monet’s Sunset in Venice


Output Image after Texture Transfer


Input Texture (Claude Monet’s Waterloo Bridge


Output Image after Texture Transfer text_tran_024_monet4.png

Input Texture (Claude Monet’s Water Lilies     monet6

Output Image after Texture Transfer


Input Texture (Claude Monet’s Impression Sunrise)


Output Image after Texture Transfer


Input Texture (Claude Monet’s Sunflowers)


Output Image after Texture Transfer


Input Texture (Abanindranath Tagore’s Canal in Shahjadpur)


Output Image after Texture Transfer


Input Texture (Abanindranath Tagore’s The Victory of Buddha)


Output Image after Texture Transfer



Input Texture (Abanindranath Tagore’s Rabindranath in the role of  blind singer)


Output Image after Texture Transfer


Input Texture (Abanindranath Tagore’s Slaying the Tornado Demon)


Output Image after Texture Transfer


Input Texture (Nandalal Bose’s Village Huts)


Output Image after Texture Transfer


Some more Textures

Input Texture


Output Images after Texture Transfer  (with 2 different patch sizes)



Input Texture


Output Images after Texture Transfer  (with 2 different patch sizes)



Input Texture


Output Image after Texture Transfer



Input Texture


Output Images after Texture Transfer  (with 2 different patch sizes)



Input Texture


Output Images after Texture Transfer  (with 2 different patch sizes)



Input Texture


Output Images after Texture Transfer (with 2 different patch sizes)





Input Texture


Output Image after Texture Transfer




Input Texture


Output Image after Texture Transfer




Input Texture


Output Image after Texture Transfer



Input Texture


Output Image after Texture Transfer




Input Texture


Output Image after Texture Transfer




Input Texture


Output Images after Texture Transfer  (with 2 different patch sizes)



Input Texture

Output Image after Texture Transfer




Target Image (from The Mummy 1999)

Input Texture (sand)


Output Image after Texture Transfer




The following animation shows how the milk texture is being transformed to the target image of mine with the quilting algorithm with modified code.

Input Texture


Target Image (me)


Output Image after Texture Transfer  (with small block size)



The next figures and animations show the output image obtained after milk texture gets transferred to the target image of mine, for different block size of the samples (shown in red). As can be seen from the following outputs, the target image gets more and more prominent as the sampling block size gets smaller and smaller.




Feature Detection with Harris Corner Detector and Matching images with Feature Descriptors in Python

The following problem appeared in a project in this Computer Vision Course (CS4670/5670, Spring 2015) at Cornell. In this article, a python implementation is going to be described. The description of the problem is taken (with some modifications) from the project description. The same problem appeared in this assignment problem as well. The images used for testing the algorithms implemented are mostly taken from these assignments / projects.

The Problem

In this project, we need to implement the problem of detect discriminating features in an image and find the best matching features in other images.  The features should be reasonably invariant to translation, rotation, and illumination.  The slides presented in class can be used as reference.


The project has three parts:  feature detection, feature description, and feature matching.

1. Feature detection

In this step, we need to identify points of interest in the image using the Harris corner detection method.  The steps are as follows: 

  • For each point in the image, consider a window of pixels around that point.
  • Compute the Harris matrix H for (the window around) that point, defined asharriseq-structuretensor.pngwhere the summation is over all pixels p in the window.   is the x derivative of the image at point p, the notation is similar for the y derivative.
  •  The weights  are chosen to be circularly symmetric,  a 9×9 Gaussian kernel with 0.5 sigma is chosen to achieve this.
  • Note that H is a 2×2 matrix.  To find interest points, first we need to compute the corner strength function

  • Once c is computed for every point in the image, we need to choose points where c is above a threshold.
  • We also want c to be a local maximum in a 9×9 neighborhood (with non-maximum suppression).
  • In addition to computing the feature locations, we need to compute a canonical orientation for each feature, and then store this orientation (in degrees) in each feature element.
  • To compute the canonical orientation at each pixel, we need to compute the gradient of the blurred image and use the angle of the gradient as orientation.

2. Feature description

  • Now that the points of interest are identified,  the next step is to come up with a descriptor for the feature centered at each interest point.  This descriptor will be the representation to be used to compare features in different images to see if they match.
  • In this article, we shall implement a simple descriptor, a 8×8 square window without orientation. This should be very easy to implement and should work well when the images we’re comparing are related by a translation. We also normalize the window to have zero mean and unit variance, in order to obtain illumination invariance.
  • In order to obtain rotational invariance MOPS descriptor, by taking care of the orientation that is not discussed in this article for the time being.

3. Feature matching

  • Now that the features in the image are detected and described, the next step is to write code to match them, i.e., given a feature in one image, find the best matching feature in one or more other images.
  • The simplest approach is the following:  write a procedure that compares two features and outputs a distance between them.  For example, we simply sum the absolute value of differences between the descriptor elements.
  • We then use this distance to compute the best match between a feature in one image and the set of features in another image by finding the one with the smallest distance. The distance used here is the Manhattan distance.


The following theory and math for the Harris Corner Detection will be used that’s taken from this youtube video.


The following figure shows the structure of the python code to implement the algorithm.


Feature Detection (with Harris Corner Detection): Results on a few images

  • The threshold to be used for the Harris Corner Detection is varied (as shown in the following animations in red, with the value of the threshold being 10^x, where x is shown (the common logarithm of the threshold is displayed).
  • The corner strength function with kappa=0.04 is used instead of the minimum eigenvalue (since it’s faster to compute).
  • As can be seen from the following animations, lesser and lesser corner features are detected when the threshold is increased.
  • The direction and magnitude of the feature is shown by the bounding (green) square’s angle with the horizontal and the size of the square respectively, computed from the gradient matrices.

Input Imageflower.jpg

Harris Corner Features Detected for different threshold values (log10)

Input Image

The following figure shows the result of thresholding on

  1. the Harris corner strength R values and
  2. the minimum eigenvalue for the Harris matrix respectively,

for each pixel, before applying non-maximum suppression (computing the local maximum).


The next animation shows the features detected after applying non-maximum suppression, with different threshold values.

Harris Corner Features Detected for different threshold values (log10)

Input Image

Harris Corner Features Detected for different threshold values (log10)hcfea

Input Image

Harris Corner Features Detected for different threshold values (log10)


Input Image

Harris Corner Features Detected for different threshold values (log10)

Computing Feature descriptors

  • In this article, we shall implement a very simple descriptor, a 8×8 square window without orientation. This is expected to work well when the images being compared are related by a translation.
  • We also normalize the window to have zero mean and unit variance, in order to obtain illumination invariance.
  • In order to obtain rotational invariance MOPS descriptor, by taking care of the orientation that is not discussed in this article for the time being.


Matching Images with Detected Features: Results on a few images

  • First the Harris corner features and the simple descriptors are computed for each of the images to be compared.
  • Next, distance between each pair of corner feature descriptors is computed, by simply computing the sum the absolute value of differences between the descriptor elements.
  • This distance is then used to compute the best match between a feature in one image and the set of features in another image by finding the one with the smallest distance.
  • The following examples show how the matching works with simple feature descriptors around the Harris corners for images obtained using mutual translations.

Input images (one is a translation of the other) 

liberty2           liberty1
Harris Corner Features Detected for the images

Matched Features with minimum sum of absolute distancemhclib.gif


Input images
me        me3

Harris Corner Features Detected for the images

hc_0.000010me        hc_0.000010me3

Matched Features with minimum sum of absolute distance

The following example shows the input images to be compared being created more complex transformations (not only translation) and as expected, the simple feature descriptor does not work well in this case, as shown. We need some feature descriptors like SIFT to obtain robustness against rotation and scaling too.

Input images
trees_002  trees_003

Harris Corner Features Detected for the images
hc_0.000050trees_002   hc_0.000050trees_003

Matched Features with minimum sum of absolute distance