Some Deep Learning with Python, TensorFlow and Keras


The following problems are taken from a few assignments from the coursera courses Introduction to Deep Learning (by Higher School of Economics) and Neural Networks and Deep Learning (by Prof Andrew Ng, The problem descriptions are taken straightaway from the assignments.


1. Linear models, Optimization

In this assignment a linear classifier will be implemented and it will be trained using stochastic gradient descent with numpy.

Two-dimensional classification

To make things more intuitive, let’s solve a 2D classification problem with synthetic data.



As we  can notice the data above isn’t linearly separable. Hence we should add features (or use non-linear model). Note that decision line between two classes have form of circle, since that we can add quadratic features to make the problem linearly separable. The idea under this displayed on image below:


Here are some test results for the implemented expand function, that is used for adding quadratic features:

# simple test on random numbers

dummy_X = np.array([

# call expand function
dummy_expanded = expand(dummy_X)

# what it should have returned:   x0       x1       x0^2     x1^2     x0*x1    1
dummy_expanded_ans = np.array([[ 0.    ,  0.    ,  0.    ,  0.    ,  0.    ,  1.    ],
                               [ 1.    ,  0.    ,  1.    ,  0.    ,  0.    ,  1.    ],
                               [ 2.61  , -1.28  ,  6.8121,  1.6384, -3.3408,  1.    ],
                               [-0.59  ,  2.1   ,  0.3481,  4.41  , -1.239 ,  1.    ]])


Logistic regression

To classify objects we will obtain probability of object belongs to class ‘1’. To predict probability we will use output of linear model and logistic function:


def probability(X, w):
    Given input features and weights
    return predicted probabilities of y==1 given x, P(y=1|x), see description above
    :param X: feature matrix X of shape [n_samples,6] (expanded)
    :param w: weight vector w of shape [6] for each of the expanded features
    :returns: an array of predicted probabilities in [0,1] interval.

    return 1. / (1 + np.exp(, w)))

In logistic regression the optimal parameters w are found by cross-entropy minimization:


def compute_loss(X, y, w):
    Given feature matrix X [n_samples,6], target vector [n_samples] of 1/0,
    and weight vector w [6], compute scalar loss function using formula above.
    return -np.mean(y*np.log(probability(X, w)) + (1-y)*np.log(1-probability(X, w)))

Since we train our model with gradient descent, we should compute gradients. To be specific, we need the following derivative of loss function over each weight:


Here is the derivation (can be found here too):


def compute_grad(X, y, w):
    Given feature matrix X [n_samples,6], target vector [n_samples] of 1/0,
    and weight vector w [6], compute vector [6] of derivatives of L over each weights.
    return, w) - y), X) / X.shape[0]


In this section we’ll use the functions we wrote to train our classifier using stochastic gradient descent. We shall try to change hyper-parameters like batch size, learning rate and so on to find the best one.

Mini-batch SGD

Stochastic gradient descent just takes a random example on each iteration, calculates a gradient of the loss on it and makes a step:


w = np.array([0, 0, 0, 0, 0, 1]) # initialize

eta = 0.05 # learning rate
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
    dw = compute_grad(X_expanded[ind, :], y[ind], w)
    w = w - eta*dw

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD where batch-size=4.


SGD with momentum

Momentum is a method that helps accelerate SGD in the relevant direction and dampens oscillations as can be seen in image below. It does this by adding a fraction α of the update vector of the past time step to the current update vector.



eta = 0.05 # learning rate
alpha = 0.9 # momentum
nu = np.zeros_like(w)
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
    dw = compute_grad(X_expanded[ind, :], y[ind], w)
    nu = alpha*nu + eta*dw
    w = w - nu

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD + momentum  where batch-size=4. As can be seen, the loss function drops much faster, leading to a faster convergence.



We also need to implement RMSPROP algorithm, which use squared gradients to adjust learning rate as follows:


eta = 0.05 # learning rate
alpha = 0.9 # momentum
G = np.zeros_like(w)
eps = 1e-8
n_iter = 100
batch_size = 4
loss = np.zeros(n_iter)

for i in range(n_iter):
    ind = np.random.choice(X_expanded.shape[0], batch_size)
    loss[i] = compute_loss(X_expanded, y, w)
     dw = compute_grad(X_expanded[ind, :], y[ind], w)
     G = alpha*G + (1-alpha)*dw**2
     w = w - eta*dw / np.sqrt(G + eps)

The following animation shows how the decision surface and the cross-entropy loss function changes with different batches with SGD + RMSProp where batch-size=4. As can be seen again, the loss function drops much faster, leading to a faster convergence.


2. Planar data classification with a neural network with one hidden layer, an implementation from scratch

In this assignment a neural net with a single hidden layer will be trained from scratch. We shall see a big difference between this model and the one implemented using logistic regression.

We shall learn how to:

  • Implement a 2-class classification neural network with a single hidden layer
  • Use units with a non-linear activation function, such as tanh
  • Compute the cross entropy loss
  • Implement forward and backward propagation


The following figure visualizes a “flower” 2-class dataset that we shall work on, the colors indicates the class labels.  We have m = 400 training examples.


Simple Logistic Regression

Before building a full neural network, lets first see how logistic regression performs on this problem. We can use sklearn’s built-in functions to do that, by running the code below to train a logistic regression classifier on the dataset.

# Train the logistic regression classifier
clf = sklearn.linear_model.LogisticRegressionCV();, Y.T);

We can now plot the decision boundary of the model and accuracy with the following code.

# Plot the decision boundary for logistic regression
plot_decision_boundary(lambda x: clf.predict(x), X, Y)
plt.title("Logistic Regression")

# Print accuracy
LR_predictions = clf.predict(X.T)
print ('Accuracy of logistic regression: %d ' % float((,LR_predictions) +,1-LR_predictions))/float(Y.size)*100) +
       '% ' + "(percentage of correctly labelled datapoints)")
Accuracy of logistic regression: 47 % (percentage of correctly labelled datapoints)
Accuracy 47%

Interpretation: The dataset is not linearly separable, so logistic regression doesn’t perform well. Hopefully a neural network will do better. Let’s try this now!


Neural Network model

Logistic regression did not work well on the “flower dataset”. We are going to train a Neural Network with a single hidden layer, by implementing the network with python numpy from scratch.

Here is our model:


The general methodology to build a Neural Network is to:

1. Define the neural network structure ( # of input units,  # of hidden units, etc). 
2. Initialize the model's parameters
3. Loop:
    - Implement forward propagation
    - Compute loss
    - Implement backward propagation to get the gradients
    - Update parameters (gradient descent)

Defining the neural network structure

Define three variables and the function layer_sizes:

- n_x: the size of the input layer
- n_h: the size of the hidden layer (set this to 4) 
- n_y: the size of the output layer
def layer_sizes(X, Y):
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer

Initialize the model’s parameters

Implement the function initialize_parameters().


  • Make sure the parameters’ sizes are right. Refer to the neural network figure above if needed.
  • We will initialize the weights matrices with random values.
    • Use: np.random.randn(a,b) * 0.01 to randomly initialize a matrix of shape (a,b).
  • We will initialize the bias vectors as zeros.
    • Use: np.zeros((a,b)) to initialize a matrix of shape (a,b) with zeros.
def initialize_parameters(n_x, n_h, n_y):
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer
    params -- python dictionary containing the parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)

The Loop

Implement forward_propagation().


  • Look above at the mathematical representation of the classifier.
  • We can use the function sigmoid().
  • We can use the function np.tanh(). It is part of the numpy library.
  • The steps we have to implement are:
    1. Retrieve each parameter from the dictionary “parameters” (which is the output of initialize_parameters()) by using parameters[".."].
    2. Implement Forward Propagation. Compute Z[1],A[1],Z[2]Z[1],A[1],Z[2] and A[2]A[2] (the vector of all the predictions on all the examples in the training set).
  • Values needed in the backpropagation are stored in “cache“. The cache will be given as an input to the backpropagation function.
def forward_propagation(X, parameters):
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing the parameters (output of initialization function)
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"

Implement compute_cost() to compute the value of the cost J.  There are many ways to implement the cross-entropy loss.


def compute_cost(A2, Y, parameters):
    Computes the cross-entropy cost given in equation (13)
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing the parameters W1, b1, W2 and b2
    cost -- cross-entropy cost given equation (13)

Using the cache computed during forward propagation, we can now implement backward propagation.

Implement the function backward_propagation().

Instructions: Backpropagation is usually the hardest (most mathematical) part in deep learning. The following figure is taken from is the slide from the lecture on backpropagation. We’ll want to use the six equations on the right of this slide, since we are building a vectorized implementation.


def backward_propagation(parameters, cache, X, Y):
    Implement the backward propagation using the instructions above.
    parameters -- python dictionary containing our parameters 
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    grads -- python dictionary containing the gradients with respect to different parameters

Implement the update rule. Use gradient descent. We have to use (dW1, db1, dW2, db2) in order to update (W1, b1, W2, b2).

General gradient descent rule: θ=θα(J/θ) where α is the learning rate and θ
represents a parameter.

Illustration: The gradient descent algorithm with a good learning rate (converging) and a bad learning rate (diverging). Images courtesy of Adam Harley.


def update_parameters(parameters, grads, learning_rate = 1.2):
    Updates parameters using the gradient descent update rule given above
    parameters -- python dictionary containing our parameters 
    grads -- python dictionary containing our gradients 
    parameters -- python dictionary containing our updated parameters 

Integrate previous parts in nn_model()

Build the neural network model in nn_model().

Instructions: The neural network model has to use the previous functions in the right order.

def nn_model(X, Y, n_h, num_iterations = 10000, print_cost=False):
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations
    parameters -- parameters learnt by the model. They can then be used to predict.


Use the model to predict by building predict(). Use forward propagation to predict results.


def predict(parameters, X):
    Using the learned parameters, predicts a class for each example in X
    parameters -- python dictionary containing our parameters 
    X -- input data of size (n_x, m)
    predictions -- vector of predictions of our model (red: 0 / blue: 1)

It is time to run the model and see how it performs on a planar dataset. Run the following code to test our model with a single hidden layer of nh hidden units.

# Build a model with a n_h-dimensional hidden layer
parameters = nn_model(X, Y, n_h = 4, num_iterations = 10000, print_cost=True)

# Plot the decision boundary
plot_decision_boundary(lambda x: predict(parameters, x.T), X, Y)
plt.title("Decision Boundary for hidden layer size " + str(4))
Cost after iteration 0: 0.693048
Cost after iteration 1000: 0.288083
Cost after iteration 2000: 0.254385
Cost after iteration 3000: 0.233864
Cost after iteration 4000: 0.226792
Cost after iteration 5000: 0.222644
Cost after iteration 6000: 0.219731
Cost after iteration 7000: 0.217504
Cost after iteration 8000: 0.219471
Cost after iteration 9000: 0.218612


Cost after iteration 9000 0.218607
# Print accuracy
predictions = predict(parameters, X)
print ('Accuracy: %d' % float((,predictions.T) +,1-predictions.T))/float(Y.size)*100) + '%')
 Accuracy: 90%

Accuracy is really high compared to Logistic Regression. The model has learnt the leaf patterns of the flower! Neural networks are able to learn even highly non-linear decision boundaries, unlike logistic regression.

Now, let’s try out several hidden layer sizes. We can observe different behaviors of the model for various hidden layer sizes. The results are shown below.

Tuning hidden layer size

Accuracy for 1 hidden units: 67.5 %
Accuracy for 2 hidden units: 67.25 %
Accuracy for 3 hidden units: 90.75 %
Accuracy for 4 hidden units: 90.5 %
Accuracy for 5 hidden units: 91.25 %
Accuracy for 20 hidden units: 90.0 %
Accuracy for 50 hidden units: 90.25 %



  • The larger models (with more hidden units) are able to fit the training set better, until eventually the largest models overfit the data.
  • The best hidden layer size seems to be around n_h = 5. Indeed, a value around here seems to fits the data well without also incurring noticable overfitting.
  • We shall also learn later about regularization, which lets us use very large models (such as n_h = 50) without much overfitting.


3. Getting deeper with Keras


  • Tensorflow is a powerful and flexible tool, but coding large neural architectures with it is tedious.
  • There are plenty of deep learning toolkits that work on top of it like Slim, TFLearn, Sonnet, Keras.
  • Choice is matter of taste and particular task
  • We’ll be using Keras to predict handwritten digits with the mnist dataset.
  • The following figure shows 225 sample images from the dataset.




The pretty keras

Using only the following few lines of code we can learn a simple deep neural net with 3 dense hidden layers and with Relu activation, with dropout 0.5 after each dense layer.

import keras
from keras.models import Sequential
import keras.layers as ll

model = Sequential(name="mlp")
model.add(ll.InputLayer([28, 28]))

# network body



# output layer: 10 neurons for each class with softmax
model.add(ll.Dense(10, activation='softmax'))

# categorical_crossentropy is our good old crossentropy
# but applied for one-hot-encoded vectors
model.compile("adam", "categorical_crossentropy", metrics=["accuracy"])

The following shows the summary of the model:

Layer (type)                 Output Shape              Param #   
input_12 (InputLayer)        (None, 28, 28)            0         
flatten_12 (Flatten)         (None, 784)               0         
dense_35 (Dense)             (None, 128)               100480    
activation_25 (Activation)   (None, 128)               0         
dropout_22 (Dropout)         (None, 128)               0         
dense_36 (Dense)             (None, 128)               16512     
activation_26 (Activation)   (None, 128)               0         
dropout_23 (Dropout)         (None, 128)               0         
dense_37 (Dense)             (None, 128)               16512     
activation_27 (Activation)   (None, 128)               0         
dropout_24 (Dropout)         (None, 128)               0         
dense_38 (Dense)             (None, 10)                1290      
Total params: 134,794
Trainable params: 134,794
Non-trainable params: 0

Model interface

Keras models follow Scikit-learn‘s interface of fit/predict with some notable extensions. Let’s take a tour.

# fit(X,y) ships with a neat automatic logging.
#          Highly customizable under the hood., y_train,
          validation_data=(X_val, y_val), epochs=13);
 Train on 50000 samples, validate on 10000 samples
Epoch 1/13
50000/50000 [==============================] - 14s - loss: 0.1489 - acc: 0.9587 - val_loss: 0.0950 - val_acc: 0.9758
Epoch 2/13
50000/50000 [==============================] - 12s - loss: 0.1543 - acc: 0.9566 - val_loss: 0.0957 - val_acc: 0.9735
Epoch 3/13
50000/50000 [==============================] - 11s - loss: 0.1509 - acc: 0.9586 - val_loss: 0.0985 - val_acc: 0.9752
Epoch 4/13
50000/50000 [==============================] - 11s - loss: 0.1515 - acc: 0.9577 - val_loss: 0.0967 - val_acc: 0.9752
Epoch 5/13
50000/50000 [==============================] - 11s - loss: 0.1471 - acc: 0.9596 - val_loss: 0.1008 - val_acc: 0.9737
Epoch 6/13
50000/50000 [==============================] - 11s - loss: 0.1488 - acc: 0.9598 - val_loss: 0.0989 - val_acc: 0.9749
Epoch 7/13
50000/50000 [==============================] - 11s - loss: 0.1495 - acc: 0.9592 - val_loss: 0.1011 - val_acc: 0.9748
Epoch 8/13
50000/50000 [==============================] - 11s - loss: 0.1434 - acc: 0.9604 - val_loss: 0.1005 - val_acc: 0.9761
Epoch 9/13
50000/50000 [==============================] - 11s - loss: 0.1514 - acc: 0.9590 - val_loss: 0.0951 - val_acc: 0.9759
Epoch 10/13
50000/50000 [==============================] - 11s - loss: 0.1424 - acc: 0.9613 - val_loss: 0.0995 - val_acc: 0.9739
Epoch 11/13
50000/50000 [==============================] - 11s - loss: 0.1408 - acc: 0.9625 - val_loss: 0.0977 - val_acc: 0.9751
Epoch 12/13
50000/50000 [==============================] - 11s - loss: 0.1413 - acc: 0.9601 - val_loss: 0.0938 - val_acc: 0.9753
Epoch 13/13
50000/50000 [==============================] - 11s - loss: 0.1430 - acc: 0.9619 - val_loss: 0.0981 - val_acc: 0.9761

As we could see, with a simple model without any convolution layers we could obtain more than 97.5% accuracy on the validation dataset.

The following figures show the weights learnt at different layers.


Some Tips & tricks to improve accuracy

Here are some tips on what we can do to improve accuracy:

  • Network size
    • More neurons,
    • More layers, (docs)
    • Nonlinearities in the hidden layers
      • tanh, relu, leaky relu, etc
    • Larger networks may take more epochs to train, so don’t discard the net just because it could didn’t beat the baseline in 5 epochs.
  • Early Stopping
    • Training for 100 epochs regardless of anything is probably a bad idea.
    • Some networks converge over 5 epochs, others – over 500.
    • Way to go: stop when validation score is 10 iterations past maximum
  • Faster optimization
    • rmsprop, nesterov_momentum, adam, adagrad and so on.
      • Converge faster and sometimes reach better optima
      • It might make sense to tweak learning rate/momentum, other learning parameters, batch size and number of epochs
  • Regularize to prevent overfitting
  • Data augmemntation – getting 5x as large dataset for free is a great deal
    • Zoom-in+slice = move
    • Rotate+zoom(to remove black stripes)
    • any other perturbations
    • Simple way to do that (if we have PIL/Image):
      • from scipy.misc import imrotate,imresize
      • and a few slicing
    • Stay realistic. There’s usually no point in flipping dogs upside down as that is not the way we usually see them.

Some Data Processing and Analysis with Python

The following problems appeared as assignments in the edX course Analytics for Computing (by Gatech). The descriptions are taken from the assignments.

1. Association rule mining

First we shall implement the basic pairwise association rule mining algorithm.

Problem definition

Let’s say we have a fragment of text in some language. We wish to know whether there are association rules among the letters that appear in a word. In this problem:

  • Words are “receipts”
  • Letters within a word are “items”

We want to know whether there are association rules of the form, a⟹ b , where a and b are letters, for a given language (by calculating for each rule its confidencecon(⟹ b), which is an estimate of the conditional probability of b given a, or Pr[b|a].

Sample text input

Let’s carry out this analysis on a “dummy” text fragment, which graphic designers refer to as the lorem ipsum:

latin_text= """
Sed ut perspiciatis, unde omnis iste natus error sit voluptatem accusantium doloremque laudantium, totam
rem aperiam eaque ipsa, quae ab illo inventore veritatis et quasi architecto beatae vitae dicta
sunt, explicabo. Nemo enim ipsam voluptatem, quia voluptas sit, aspernatur aut odit aut fugit, sed
quia consequuntur magni dolores eos, qui ratione voluptatem sequi nesciunt, neque porro quisquam est,
qui dolorem ipsum, quia dolor sit amet consectetur adipisci[ng] velit, sed quia non numquam [do] eius
modi tempora inci[di]dunt, ut labore et dolore magnam aliquam quaerat voluptatem. Ut enim ad minima
veniam, quis nostrum exercitationem ullam corporis suscipit laboriosam, nisi ut aliquid ex ea commodi
consequatur? Quis autem vel eum iure reprehenderit, qui in ea voluptate velit esse, quam nihil molestiae
consequatur, vel illum, qui dolorem eum fugiat, quo voluptas nulla pariatur?

At vero eos et accusamus et iusto odio dignissimos ducimus, qui blanditiis praesentium voluptatum
deleniti atque corrupti, quos dolores et quas molestias excepturi sint, obcaecati cupiditate non
provident, similique sunt in culpa, qui officia deserunt mollitia animi, id est laborum et dolorum
fuga. Et harum quidem rerum facilis est et expedita distinctio. Nam libero tempore, cum soluta nobis est
eligendi optio, cumque nihil impedit, quo minus id, quod maxime placeat, facere possimus, omnis voluptas
assumenda est, omnis dolor repellendus. Temporibus autem quibusdam et aut officiis debitis aut rerum
necessitatibus saepe eveniet, ut et voluptates repudiandae sint et molestiae non recusandae. Itaque
earum rerum hic tenetur a sapiente delectus, ut aut reiciendis voluptatibus maiores alias consequatur
aut perferendis doloribus asperiores repellat.

Data cleaning

Like most data in the real world, this dataset is noisy. It has both uppercase and lowercase letters, words have repeated letters, and there are all sorts of non-alphabetic characters. For our analysis, we should keep all the letters and spaces (so we can identify distinct words), but we should ignore case and ignore repetition within a word.

For example, the eighth word of this text is “error.” As an itemset, it consists of the three unique letters{e,o,r}. That is, to treat the word as a set, meaning we only keep the unique letters. This itemset has six possible itempairs{e,o}{e,r}, and {o,r}.

  • We need to start by “cleaning up” (normalizing) the input, with all characters converted to lowercase and all non-alphabetic, non-space characters removed.
  • Next, we need to convert each word into an itemset like the following examples:

consequatur –> {‘a’, ‘e’, ‘c’, ‘s’, ‘n’, ‘o’, ‘u’, ‘t’, ‘q’, ‘r’}
voluptatem –> {‘l’, ‘a’, ‘e’, ‘o’, ‘p’, ‘u’, ‘m’, ‘t’, ‘v’}

Implementing the basic algorithm

The followed algorithm is implemented:



First all item-pairs within an itemset are enumerated and a table that tracks the counts of those item-pairs is updated in-place.

  • Now, given tables of item-paircounts and individual item counts, as well as a confidence threshold, the rules that meet the threshold are returned.
  • The returned rules should be in the form of a dictionary whose key is the tuple(a,bcorresponding to the rule a⇒ b, and whose value is the confidence of the rule, conf(⇒ b).
  • The following functions were implemented to compute the association rules.
    from collections import defaultdict
    from itertools import combinations 
    def update_pair_counts (pair_counts, itemset):
        Updates a dictionary of pair counts for all pairs of items 
        in a given itemset.
        assert type (pair_counts) is defaultdict
        for item in list(combinations(itemset, 2)):
            pair_counts[item] += 1
            pair_counts[item[::-1]] += 1
        return pair_counts
    def update_item_counts(item_counts, itemset):
        for item in itemset:
            item_counts[item] += 1
        return item_counts
    def filter_rules_by_conf (pair_counts, item_counts, threshold):
        rules = {} # (item_a, item_b) -> conf (item_a => item_b)
        for (item_a, item_b) in pair_counts:
            conf = pair_counts[(item_a, item_b)] / float(item_counts[item_a])
            if conf >= threshold:
                rules[(item_a, item_b)] = conf
        return rules
    def find_assoc_rules(receipts, threshold):
        pair_counts = defaultdict(int)
        item_counts = defaultdict(int)
        for receipt in receipts:
            update_pair_counts(pair_counts, receipt)
            update_item_counts(item_counts, receipt)
        return filter_rules_by_conf(pair_counts, item_counts, threshold)


  • For the Latin string, latin_text, the function find_assoc_rules() was used to compute the rules whose confidence is at least 0.75, with the following rules obtained as result.

    conf(q => u) = 1.000
    conf(x => e) = 1.000
    conf(x => i) = 0.833
    conf(h => i) = 0.833
    conf(v => t) = 0.818
    conf(r => e) = 0.800
    conf(v => e) = 0.773
    conf(f => i) = 0.750
    conf(b => i) = 0.750
    conf(g => i) = 0.750


  • Now, let’s look at rules common to the above Latin textandEnglish text obtained by a translation of the lorem ipsum text, as shown below:
    english_text = """
    But I must explain to you how all this mistaken idea of denouncing of a pleasure and praising pain was
    born and I will give you a complete account of the system, and expound the actual teachings of the great
    explorer of the truth, the master-builder of human happiness. No one rejects, dislikes, or avoids
    pleasure itself, because it is pleasure, but because those who do not know how to pursue pleasure
    rationally encounter consequences that are extremely painful. Nor again is there anyone who loves or
    pursues or desires to obtain pain of itself, because it is pain, but occasionally circumstances occur in
    which toil and pain can procure him some great pleasure. To take a trivial example, which of us
    ever undertakes laborious physical exercise, except to obtain some advantage from it? But who has any
    right to find fault with a man who chooses to enjoy a pleasure that has no annoying consequences, or
    one who avoids a pain that produces no resultant pleasure?
    On the other hand, we denounce with righteous indignation and dislike men who are so beguiled and
    demoralized by the charms of pleasure of the moment, so blinded by desire, that they cannot foresee the
    pain and trouble that are bound to ensue; and equal blame belongs to those who fail in their duty
    through weakness of will, which is the same as saying through shrinking from toil and pain. These
    cases are perfectly simple and easy to distinguish. In a free hour, when our power of choice is
    untrammeled and when nothing prevents our being able to do what we like best, every pleasure is to
    be welcomed and every pain avoided. But in certain circumstances and owing to the claims of duty or
    the obligations of business it will frequently occur that pleasures have to be repudiated and
    annoyances accepted. The wise man therefore always holds in these matters to this principle of
    selection: he rejects pleasures to secure other greater pleasures, or else he endures pains to
    avoid worse pains.


  • Again, for the English string, english_text, the function find_assoc_rules()  was used to compute the rules whose confidence is at least 0.75, with the following rules obtained as result.

    conf(z => a) = 1.000
    conf(j => e) = 1.000
    conf(z => o) = 1.000
    conf(x => e) = 1.000
    conf(q => e) = 1.000
    conf(q => u) = 1.000
    conf(z => m) = 1.000
    conf(z => r) = 1.000
    conf(z => l) = 1.000
    conf(z => e) = 1.000
    conf(z => d) = 1.000
    conf(z => i) = 1.000
    conf(k => e) = 0.778
    conf(q => n) = 0.750


  • Let’s consider any rules with a confidence of at least 0.75 to be a “high-confidence rule“.  The common_high_conf_rules are all the high-confidence rules appearing in both the Latin text and the English text. The rules shown below are all such rules:High-confidence rules common to _lorem ipsum_ in Latin and English:

    q => u
    x => e


  • The following table and the figure show the high confidence rules for  the latin and the english texts.
    index rule confidence
    0 z=> o 1.000000 English
    1 z=> l 1.000000 English
    2 z=> m 1.000000 English
    3 q=> u 1.000000 English
    4 q=> e 1.000000 English
    5 x=> e 1.000000 English
    6 z=> e 1.000000 English
    7 j=> e 1.000000 English
    8 z=> a 1.000000 English
    9 z=> d 1.000000 English
    10 q=> u 1.000000 Latin
    11 z=> i 1.000000 English
    12 x=> e 1.000000 Latin
    13 z=> r 1.000000 English
    14 x=> i 0.833333 Latin
    15 h=> i 0.833333 Latin
    16 v=> t 0.818182 Latin
    17 r=> e 0.800000 Latin
    18 k=> e 0.777778 English
    19 v=> e 0.772727 Latin
    20 g=> i 0.750000 Latin
    21 q=> n 0.750000 English
    22 f=> i 0.750000 Latin
    23 b=> i 0.750000 Latin


Putting it all together: Actual baskets!

Let’s take a look at some real data  from this link. First few lines of the transaction data is shown below:

citrus fruit,semi-finished bread,margarine,ready soups
tropical fruit,yogurt,coffee
whole milk
pip fruit,yogurt,cream cheese ,meat spreads
other vegetables,whole milk,condensed milk,long life bakery product
whole milk,butter,yogurt,rice,abrasive cleaner
other vegetables,UHT-milk,rolls/buns,bottled beer,liquor (appetizer)
pot plants
whole milk,cereals
tropical fruit,other vegetables,white bread,bottled water,chocolate
citrus fruit,tropical fruit,whole milk,butter,curd,yogurt,flour,bottled water,dishes
chicken,tropical fruit
butter,sugar,fruit/vegetable juice,newspapers
fruit/vegetable juice
packaged fruit/vegetables
specialty bar
other vegetables
butter milk,pastry
whole milk
tropical fruit,cream cheese ,processed cheese,detergent,newspapers
tropical fruit,root vegetables,other vegetables,frozen dessert,rolls/buns,flour,sweet spreads,salty snack,waffles,candy,bathroom cleaner
bottled water,canned beer
other vegetables
brown bread,soda,fruit/vegetable juice,canned beer,newspapers,shopping bags
yogurt,beverages,bottled water,specialty bar

  • Our task is to mine this dataset for pairwise association rules to produce a final dictionary, basket_rules, that meet these conditions:
  1. The keys are pairs (a,b), where a and b are item names.
  2. The values are the corresponding confidence scores, conf(⇒ b).
  3. Only include rules ⇒ b where item a occurs at least MIN_COUNT times and conf(⇒ b) is at least THRESHOLD.

The result is shown below:

Found 19 rules whose confidence exceeds 0.5.
Here they are:

conf(honey => whole milk) = 0.733
conf(frozen fruits => other vegetables) = 0.667
conf(cereals => whole milk) = 0.643
conf(rice => whole milk) = 0.613
conf(rubbing alcohol => whole milk) = 0.600
conf(cocoa drinks => whole milk) = 0.591
conf(pudding powder => whole milk) = 0.565
conf(jam => whole milk) = 0.547
conf(cream => sausage) = 0.538
conf(cream => other vegetables) = 0.538
conf(baking powder => whole milk) = 0.523
conf(tidbits => rolls/buns) = 0.522
conf(rice => other vegetables) = 0.520
conf(cooking chocolate => whole milk) = 0.520
conf(frozen fruits => whipped/sour cream) = 0.500
conf(specialty cheese => other vegetables) = 0.500
conf(ready soups => rolls/buns) = 0.500
conf(rubbing alcohol => butter) = 0.500
conf(rubbing alcohol => citrus fruit) = 0.500

2. Simple string processing with Regex

Phone numbers

  • Write a function to parse US phone numbers written in the canonical “(404) 555-1212” format, i.e., a three-digit area code enclosed in parentheses followed by a seven-digit local number in three-hyphen-four digit format.
  • It should also ignore all leading and trailing spaces, as well as any spaces that appear between the area code and local numbers.
  • However, it should not accept any spaces in the area code (e.g., in ‘(404)’) nor should it in the local number.
  • It should return a triple of strings, (area_code, first_three, last_four).
  • If the input is not a valid phone number, it should raise a ValueError.
The following function implements the regex parser.
import re 
def parse_phone(s):
    pattern = re.compile("\s*\((\d{3})\)\s*(\d{3})-(\d{4})\s*")
    m = pattern.match(s)
    if not m:
        raise ValueError('not a valid phone number!')
    return m.groups()

#print(parse_phone1('(404) 201-2121'))    
  • Implement an enhanced phone number parser that can handle any of these patterns.
    • (404) 555-1212
    • (404) 5551212
    • 404-555-1212
    • 404-5551212
    • 404555-1212
    • 4045551212
  • As before, it should not be sensitive to leading or trailing spaces. Also, for the patterns in which the area code is enclosed in parentheses, it should not be sensitive to the number of spaces separating the area code from the remainder of the number.

The following function implements the enhanced regex parser.

import re 
def parse_phone2 (s):
    pattern = re.compile("\s*\((\d{3})\)\s*(\d{3})-?(\d{4})\s*")
    m = pattern.match(s)
    if not m:
        pattern2 = re.compile("\s*(\d{3})-?(\d{3})-?(\d{4})\s*")
        m = pattern2.match(s)
        if not m:
            raise ValueError('not a valid phone number!')
    return m.groups()

3. Tidy data and the Pandas

“Tidying data,” is all about cleaning up tabular data for analysis purposes.

Definition: Tidy datasets. More specifically, Wickham defines a tidy data set as one that can be organized into a 2-D table such that

  1. each column represents a variable;
  2. each row represents an observation;
  3. each entry of the table represents a single value, which may come from either categorical (discrete) or continuous spaces.

Definition: Tibbles. if a table is tidy, we will call it a tidy table, or tibble, for short.

Apply functions to data frames

Given the following pandas DataFrame (first few rows are shown in the next table),compute the prevalence, which is the ratio of cases to the population, using the apply() function, without modifying the original DataFrame.

country year cases population
0 Afghanistan ’99 745 19987071
1 Brazil ’99 37737 172006362
2 China ’99 212258 1272915272
3 Afghanistan ’00 2666 20595360
4 Brazil ’00 80488 174504898
5 China ’00 213766 1280428583

The next function does exactly the same thing.

def calc_prevalence(G):
    assert 'cases' in G.columns and 'population' in G.columns
    H = G.copy()
    H['prevalence'] = H.apply(lambda row: row['cases'] / row['population'], axis=1)
    return H

Tibbles and Bits

Now let’s start creating and manipulating tibbles.

Write a function, canonicalize_tibble(X), that, given a tibble X, returns a new copy Y of X in canonical order. We say Y is in canonical order if it has the following properties.

  1. The variables appear in sorted order by name, ascending from left to right.
  2. The rows appear in lexicographically sorted order by variable, ascending from top to bottom.
  3. The row labels (Y.index) go from 0 to n-1, where n is the number of observations.

The following code exactly does the same:

def canonicalize_tibble(X):
    # Enforce Property 1:
    var_names = sorted(X.columns)
    Y = X[var_names].copy()
    Y = Y.sort_values(by=var_names, ascending=True)
    Y.reset_index(drop=True, inplace=True)
    return Y

Basic tidying transformations: Implementing Melting and Casting

Given a data set and a target set of variables, there are at least two common issues that require tidying.


First, values often appear as columns. Table 4a is an example. To tidy up, we want to turn columns into rows:


Because this operation takes columns into rows, making a “fat” table more tall and skinny,  it is sometimes called melting.

To melt the table, we need to do the following.

  1. Extract the column values into a new variable. In this case, columns "1999"  and "2000" of table4 need to become the values of the variable, "year".
  2. Convert the values associated with the column values into a new variable as well. In this case, the values formerly in columns "1999" and "2000"become the values of the "cases" variable.

In the context of a melt, let’s also refer to "year" as the new key variable and
"cases" as the new value variable.

Implement the melt operation as a function,

def melt(df, col_vals, key, value):

It should take the following arguments:

  • df: the input data frame, e.g., table4 in the example above;
  • col_vals: a list of the column names that will serve as values;
  • key: name of the new variable, e.g., year in the example above;
  • value: name of the column to hold the values.

The next function implements the melt operation:

def melt(df, col_vals, key, value):
    assert type(df) is pd.DataFrame
    df2 = pd.DataFrame()
    for col in col_vals:
        df1 = pd.DataFrame(df[col].tolist(), columns=[value]) #,
        df1[key] = col
        other_cols = list(set(df.columns.tolist()) - set(col_vals))
        for col1 in other_cols:
            df1[col1] = df[col1]
        df2 = df2.append(df1, ignore_index=True)
    df2 = df2[other_cols + [key, value]]    
    return df2

with the following output

=== table4a ===
country 1999 2000
0 Afghanistan 745 2666
1 Brazil 37737 80488
2 China 212258 213766
=== melt(table4a) ===
country year cases
0 Afghanistan 1999 745
1 Brazil 1999 37737
2 China 1999 212258
3 Afghanistan 2000 2666
4 Brazil 2000 80488
5 China 2000 213766
=== table4b ===
country 1999 2000
0 Afghanistan 19987071 20595360
1 Brazil 172006362 174504898
2 China 1272915272 1280428583
=== melt(table4b) ===
country year population
0 Afghanistan 1999 19987071
1 Brazil 1999 172006362
2 China 1999 1272915272
3 Afghanistan 2000 20595360
4 Brazil 2000 174504898
5 China 2000 1280428583


The second most common issue is that an observation might be split across multiple rows. Table 2 is an example. To tidy up, we want to merge rows:


Because this operation is the moral opposite of melting, and “rebuilds” observations from parts, it is sometimes called casting.

Melting and casting are Wickham’s terms from his original paper on tidying data. In his more recent writing, on which this tutorial is based, he refers to the same operation as gathering. Again, this term comes from Wickham’s original paper, whereas his more recent summaries use the term spreading.

The signature of a cast is similar to that of melt. However, we only need to know the key, which is column of the input table containing new variable names, and the value, which is the column containing corresponding values.

Implement a function to cast a data frame into a tibble, given a key column containing new variable names and a value column containing the corresponding cells.

Observe that we are asking your cast() to accept an optional parameter, join_how, that may take the values 'outer' or 'inner' (with 'outer' as the default).

The following function implements the casting operation:

def cast(df, key, value, join_how='outer'):
    """Casts the input data frame into a tibble,
    given the key column and value column.
    assert type(df) is pd.DataFrame
    assert key in df.columns and value in df.columns
    assert join_how in ['outer', 'inner']
    fixed_vars = df.columns.difference([key, value])
    tibble = pd.DataFrame(columns=fixed_vars) # empty frame
    fixed_vars = fixed_vars.tolist()
    #tibble[fixed_vars] = df[fixed_vars]
    cols = []
    for k,df1 in df.groupby(df[key]):
        #tibble = pd.concat([tibble.reset_index(drop=True), df1[value]], axis=1)
        tibble = tibble.merge(df1[fixed_vars+[value]], on=fixed_vars, how=join_how)
        cols.append(str(k)) #list(set(df1[key]))[0])
    tibble.columns = fixed_vars + cols
    return tibble

with the following output:

=== table2 ===
country year type count
0 Afghanistan 1999 cases 745
1 Afghanistan 1999 population 19987071
2 Afghanistan 2000 cases 2666
3 Afghanistan 2000 population 20595360
4 Brazil 1999 cases 37737
5 Brazil 1999 population 172006362
6 Brazil 2000 cases 80488
7 Brazil 2000 population 174504898
8 China 1999 cases 212258
9 China 1999 population 1272915272
10 China 2000 cases 213766
11 China 2000 population 1280428583
=== tibble2 = cast (table2, "type", "count") ===
country year cases population
0 Afghanistan 1999 745 19987071
1 Afghanistan 2000 2666 20595360
2 Brazil 1999 37737 172006362
3 Brazil 2000 80488 174504898
4 China 1999 212258 1272915272
5 China 2000 213766 1280428583

Separating variables

Consider the following table.


country year rate
0 Afghanistan 1999 745/19987071
1 Afghanistan 2000 2666/20595360
2 Brazil 1999 37737/172006362
3 Brazil 2000 80488/174504898
4 China 1999 212258/1272915272
5 China 2000 213766/1280428583

In this table, the rate variable combines what had previously been the cases
andpopulation data. This example is an instance in which we might want to separate a column into two variables.

Write a function that takes a data frame (df) and separates an existing column (key) into new variables (given by the list of new variable names, into).  How will the separation happen? The caller should provide a function, splitter(x), that given a value returns a list containing the components.


The following code implements the function:

import re

def default_splitter(text):
    """Searches the given spring for all integer and floating-point
    values, returning them as a list _of strings_.
    E.g., the call
      default_splitter('Give me $10.52 in exchange for 91 kitten stickers.')
    will return ['10.52', '91'].
    #fields = re.findall('(\d+\.?\d+)', text)
    fields = list(re.match('(\d+)/(\d+)', text).groups())
    return fields

def separate(df, key, into, splitter=default_splitter):
    """Given a data frame, separates one of its columns, the key,
    into new variables.
    assert type(df) is pd.DataFrame
    assert key in df.columns    
    return (df.merge(df[key].apply(lambda s: pd.Series({into[i]:splitter(s)[i] for i in range(len(into))})), 
    left_index=True, right_index=True)).drop(key, axis=1)

with the following output:

=== table3 ===
country year rate
0 Afghanistan 1999 745/19987071
1 Afghanistan 2000 2666/20595360
2 Brazil 1999 37737/172006362
3 Brazil 2000 80488/174504898
4 China 1999 212258/1272915272
5 China 2000 213766/1280428583
=== tibble3 = separate (table3, ...) ===
country year cases population
0 Afghanistan 1999 745 19987071
1 Afghanistan 2000 2666 20595360
2 Brazil 1999 37737 172006362
3 Brazil 2000 80488 174504898
4 China 1999 212258 1272915272
5 China 2000 213766 1280428583