Deep Learning with TensorFlow in Python

The following problems appeared in the first few assignments in the Udacity course Deep Learning (by Google). The descriptions of the problems are taken from the assignments.

Classifying the letters with notMNIST dataset

Let’s first learn about simple data curation practices, and familiarize ourselves with some of the data that are going to be used for deep learning using tensorflow. The notMNIST dataset to be used with python experiments. This dataset is designed to look like the classic MNIST dataset, while looking a little more like real data: it’s a harder task, and the data is a lot less ‘clean’ than MNIST.


First the dataset needs to be downloaded and extracted to a local machine. The data consists of characters rendered in a variety of fonts on a 28×28 image. The labels are limited to ‘A’ through ‘J’ (10 classes). The training set has about 500k and the testset 19000 labelled examples. Given these sizes, it should be possible to train models quickly on any machine.

Let’s take a peek at some of the data to make sure it looks sensible. Each exemplar should be an image of a character A through J rendered in a different font. Display a sample of the images downloaded.


Now let’s load the data in a more manageable format. Let’s convert the entire dataset into a 3D array (image index, x, y) of floating point values, normalized to have approximately zero mean and standard deviation ~0.5 to make training easier down the road. A few images might not be readable, we’ll just skip them. Also the data is expected to be balanced across classes. Let’s verify that. The following output shows the dimension of the ndarray for each class.

(52909, 28, 28)
(52911, 28, 28)
(52912, 28, 28)
(52911, 28, 28)
(52912, 28, 28)
(52912, 28, 28)
(52912, 28, 28)
(52912, 28, 28)
(52912, 28, 28)
(52911, 28, 28)

Now some more preprocessing is needed:

  1. First the training data (for different classes) needs to be merged and pruned.
  2. The labels will be stored into a separate array of integers from 0 to 9.
  3. Also let’s create a validation dataset for hyperparameter tuning.
  4. Finally the data needs to be randomized. It’s important to have the labels well shuffled for the training and test distributions to match.

After preprocessing, let’s peek a few samples from the training dataset and the next figure shows how it looks.


Here is how the validation dataset looks (a few samples chosen).


Here is how the test dataset looks (a few samples chosen).


Trying a few off-the-shelf classifiers

Let’s get an idea of what an off-the-shelf classifier can give us on this data. It’s always good to check that there is something to learn, and that it’s a problem that is not so trivial that a canned solution solves it. Let’s first train a simple LogisticRegression model from sklearn (using default parameters) on this data using 5000 training samples.

The following figure shows the output from the logistic regression model trained, its accuracy on the test dataset (also the confusion matrix) and a few test instances classified wrongly (predicted labels along with the true labels) by the model.



Deep Learning with TensorFlow

Now let’s progressively train deeper and more accurate models using TensorFlow. Again, we need to do the following preprocessing:

  1. Reformat into a shape that’s more adapted to the models we’re going to train: data as a flat matrix,
  2. labels as 1-hot encodings.

Now let’s train a multinomial logistic regression using simple stochastic gradient descent.

TensorFlow works like this:

First we need to describe the computation that is to be performed: what the inputs, the variables, and the operations look like. These get created as nodes over a computation graph.

Then we can run the operations on this graph as many times as we want by calling, providing it outputs to fetch from the graph that get returned.

  1. Let’s load all the data into TensorFlow and build the computation graph corresponding to our training.
  2. Let’s use stochastic gradient descent training (with ~3k steps), which is much faster. The graph will be similar to batch gradient descent, except that instead of holding all the training data into a constant node, we create a Placeholder node which will be fed actual data at every call of

Logistic Regression with SGD

The following shows the fully connected computation graph and the results obtained.



Neural Network with SGD

Now let’s turn the logistic regression example with SGD into a 1-hidden layer neural network with rectified linear units nn.relu() and 1024 hidden nodes. As can be seen from the below results, this model improves the validation / test accuracy.



Regularization with Tensorflow

Previously we trained a logistic regression and a neural network model with Tensorflow. Let’s now explore the regularization techniques.

Let’s introduce and tune L2regularization for both logistic and neural network models. L2  amounts to adding a penalty on the norm of the weights to the loss. In TensorFlow, we can compute the L2loss for a tensor t using nn.l2_loss(t).

The right amount of regularization improves the validation / test accuracy, as can be seen from the following results.

L2 Regularized Logistic Regression with SGD

The following figure recapitulates the simple network without anyt hidden layer, with softmax outputs.


The next figures visualize the weights learnt for the 10 output neurons at different steps using SGD and L2 regularized loss function (with λ=0.005). As can be seen below, the weights learnt are gradually capturing the different features of the letters at the corresponding output neurons.


As can be seen, the test accuracy also gets improved to 88.3% with L2 regularized loss function (with λ=0.005).


The following results show the accuracy and the weights learnt for couple of different values of λ (0.01 and 0.05 respectively). As can be seen, with higher values of λ, the logistic regression model tends to underfit and test accuracy decreases.



Regularized Neural Network with SGD

The following figure recapitulates the neural network with a single hidden layer with 1024 nodes, with relu intermediate outputs. The L2 regularizations applied on the loss function for the weights learnt at the input and the hidden layers are λ1 and λ2, respectively.



The next figures visualize the weights learnt for 225 randomly selected hidden neurons (out of 1024) at different steps using SGD and L2 regularized loss function (with λλ0.01). As can be seen below, the weights learnt are gradually capturing (as the SGD steps increase) the different features of the letters at the corresponding output neurons.
If the regularization parameters used are λ1=λ2=0.005, the test accuracy increases to 91.1%, as shown below.


Let’s demonstrate an extreme case of overfitting. Restrict your training data to just a few batches. What happens?

Let’s restrict the training data to n=5 batches. The following figures show how it increases the training accuracy to 100% (along with a visualization of a few randomly-picked input weights learnt) and decreases the validation and the test accuracy, since the model is not generalized enough.






Introduce Dropout on the hidden layer of the neural network. Remember: Dropout should only be introduced during training, not evaluation, otherwise our evaluation results would be stochastic as well. TensorFlow provides nn.dropout() for that, but you have to make sure it’s only inserted during training.

What happens to our extreme overfitting case?

As we can see from the below results, introducing a dropout rate of 0.4 increases the validation and test accuracy by reducing the overfitting.




Till this point the highest accuracy on the test dataset using a single hidden-layer neural network is 91.1%. More hidden layers can be used / some other techniques (e.g., exponential decay in learning rate can be used) to improve the accuracy obtained (to be continued…).


Some NLP: Probabilistic Context Free Grammar (PCFG) and CKY Parsing in Python

This problem appeared as an assignment in the coursera course Natural Language Processing (by Stanford) in 2012. The following description of the problem is taken directly from the assignment description.

In this article, a probabilistic parser will be built by implementing the CKY parser. The Manually Annotated Sub-Corpus (MASC) from the American National Corpus (ANC): will be used for this purpose.


First, we need to learn a PCFG from the training trees. Since the training set is handparsed this learning is very easy. We need to simply set:


where C(N_jζ is the count observed for that production in the data set. While we could consider smoothing rule rewrite probabilities, it is sufficient to just work with un-smoothed MLE probabilities for rules. (Doing anything else makes things rather more complex and slow, since every rewrite will have a nonzero probability, so let’s get things working with an un-smoothed grammar before considering adding smoothing!).

At the very beginning, all the train and the test trees are read in. The training trees are going to be used to construct a PCFG parser, by learning the PCFG grammar. The parser is then used to predict trees for the sentences in the test set. For each test sentence, the parse given by the parser is evaluated by comparing the constituents it generates with the constituents in the hand-parsed version. From this, precision, recall and the F1 score are calculated.

There are the following basic data structures:

• ling.Tree: CFG tree structures (ling.Trees.PennTreeRenderer)

• Lexicon: Pre-terminal productions and probabilities

• Grammar, UnaryRule, BinaryRule: CFG rules and accessors

Most parsers require grammars in which the rules are at most binary branching. Hence, first we need to binarize the trees and then construct a Grammar out of them using MLE.

The Task

The first job is to build a CKY parser using this PCFG grammar learnt. Scan through a few of the training trees in the MASC dataset to get a sense of the range of inputs. Something worth noticing is that the grammar has relatively few non-terminal symbols but thousands of rules, many ternary-branching or longer. Currently there are 38 MASC train files and 11 test files.

Once we have a parser working on the treebank, the next task is improve upon the supplied grammar by adding 1st / 2nd-order vertical markovization. This means using parent annotation symbols like NP^S to indicate a subject noun phrase instead of just NP.

The Dynamic Programming Algorithm (CKY) for Parsing

The following figure shows the CKY algorithm to compute the best possible (most probable) parse tree for a sentence using the PCFG grammar learnt from the training dataset.


The following animation (prepared from the lecture slides of the same course) shows how the chart for CKY is constructed using dynamic programming for  a small set of PCFG grammar.



For this assignment we will use your average F1 score to evaluate the correctness of the CKY parser, although in essence we ought to know from the output on the development set (devtest) whether the parser is implemented correctly. The following figure shows an example evaluation:



(1) First let’s use a toy minimal training dataset containing just 3 POS-tagged trees, and a dev/test dataset with a single test sentence (with ground-truth), to start with. The following figure shows all the training trees.  There are just enough productions in the training set for the test sentence to have an ambiguity due to PP-attachment.


The following figure shows the PCFG learnt from these training trees. Now let’s try to parse a single test sentence‘cats scratch walls with claws’ with the CKY parser and using the PCFG grammar learnt. The following figure shows the manual (gold) parse tree and the best (most probable) parse tree using the CKY dynamic programming algorithm.



(2) Now, let’s use a much larger training dataeset MASC  (with a total of 3595
annotated training trees), a few of them are shown in the next figure.


Let’s use all these 3595  POS-tagged training trees to learn the PCFG grammar.

  • There are ~10k of  lexicon rules producing terminals (with non-zero probabilities) are learnt, some of them are shown below:lex
  • There are ~4.5k binary rules (with non-zero probabilitiesproducing terminals are learnt, some of them are shown below:bin.png
  • There are ~3.5k unary rules (with non-zero probabilitiesproducing terminals are learnt, some of them are shown below:uni.png

Then let’s evaluate/compare the best parse trees obtained (guessed) with CKY for a few testsentences from the dev/test dataset using the PCFG learnt, with the manually (gold) parsed test trees (there are 355 of them) using precision, recall and F1 score. A few of the test sentence parses are shown in the following figures.wvm1.png



Vertical Markovization

The independence assumptions of a PCFG are ofen too strong. In order to indicate the dependency on the parent non-terminal in a tree the grammar rules can be re-written depending on past k ancestor nodes, denoting the order-k vertical Markovization, as explained in the following figure, which often increases the accuracy of the parse trees.


There are ~14k of  lexicon rules producing terminals (with non-zero probabilities) are learnt, ~6k binary rules and ~5k  unary rules are learnt, some of them are shown below:


The following figure shows the best parse trees obtained with CKY for a sentence using PCFG learnt with and without vertical Markovization from the minimal dataset.


Similarly, using the MASC dataset, as can be seen for the following particular test sentence, the CKY parser performs much better with the PCFG learnt from the Vertically Markovized of the training trees:vm1.png

A few more parse trees guessed by the CKY using the PCFG  learnt from the vertically Markovized training trees are shown below:


The markdown file can be found here.