Custom Object Detection with transfer learning with pre-trained YOLO-V4 model

In this blog we shall demonstrate how to start with a pre-trained Yolo (You only look once) V4 end-to-end one-stage object detection model (trained on MS COCO dataset) and train it to detect a custom object (Raccoon).

Dataset Description / Exploration

  • Roboflow allows to download the annotated images (with bounding boxes for the object Raccoon to be detected) in different formats, here we shall use darknet text format for the bounding box annotations, which can be used for both YOLO V3 and V4, as shown in the next figure.
  • The following figure shows an image and the corresponding annotation text, denoting the position of the bounding box for the Raccoon object in the image. It will be used to firther train the YOLO-V4 model, to make it able to detect the custom object Raccoon.
  • In the above annotation, the first two coordinates represent the center of the bounding box and the next two represent the width and height of the bounding box, respectively.
  • From the above representation, the bounding box left, top and right bottom
    coordinates can be computed as follows:

(x1, y1) = (416 × 0.3790 − 416×0.4904 / 2, 416 × 0.4796 − 416×0.7115 / 2) ≈ (56, 52)
(x2, y2) = (416 × 0.3790 + 416×0.4904 / 2, 416 × 0.4796 + 416×0.7115 / 2) ≈ (260, 348)

the corresponding bounding box can be drawn as shown in the next figure:

Objective & Outline

  • The original YOLO-v4 deep learning model being trained on MS COCO dataset, it can detect objects
    belonging to 80 different classes. Unfortunately, those 80 classes don’t include Raccoon, hence, without explicit training the pre-trained model will not be able to identify the Raccoons from the image dataset.
  • Also, we have only 196 images of Raccoon, which is a pretty small number, so it’s not feasible to train the YOLO-V4 model from scratch.
  • However, this is an ideal scenario to apply transfer learning. Since the task is same, i.e., object
    detection, we can always start with the pretrained weights on COCO dataset and then train the model on our images, starting from those initial weights.
  • Instead of training a model from scratch, let’s use pre-trained YOLOv4 weights which have been trained up to 137 convolutional layers. Since the original model was trained on COCO dataset with 80 classes and we are interested in detection of an object of a single class (namely Raccoon), we need to modify the corresponding layers (in the conig file).

Data cleaning / feature engineering

  • Images are normalized to have value in between 0-1. Histogram equalization / contrast stretching can be used for image enhancement. Since this task involves object localization, data augmentation was not used, since it would then require the re-computation of the bounding box.
  • No other feature engineering technique was used, since the deep neural net contains so many
    convolution layers that automatically generates many different features, the earlier layers with simpler features and later layers more complicated features.

Training with transfer learning – Configuration and hyperparameter settings

  • Google colab is to be used to train the model on GPU.
  • To start with we need to first clone the darknet source from the following git repository using the following command:
!git clone
  • We need to change the Makefile to enable GPU and opencv and run make to create the darknet
  • Next we need to download the pre-trained model yolov4.conv.137 and copy it to the right folder, using the following code.
!wget -P build/darknet/x64/
  • We need to copy the input image / annotation files to the right folders and provide the information for training data to the model, e.g., create a file build/darknet/x64/data/, that looks like the following:
  • Here the training and validation text files list the names of the training and validation set images, whereas backup represents the location for saving the model checkpoints while training.
  • We need to create a configuration file (yolov4_train.cfg, e.g.,) for training the model on our images. A
    relevant portion of the config file (with few of the hyperparameters) to be used for training the YOLO-V4 model is shown below:
  • Total number of images we have is 196, out of which 153 of them are used for training and the remaining are used for validation.
  • Since the number of training images is small, we keep the batch size hyperparamater (used 3 different values, namely, 16, 8 and 4) for training small too.
  • Notice that we need to chaneg the number of classes to (since we are interested to detect a single
    object here), as opposed to 80 in the original config file and the number of features as (1+5) x 3 = 18, as shown in the next figure, a part of the config file, again.
  • Number of batches for which the model is trained is 2000 (since it is recommended to be at least
    200 x num_classes), the model checkpoints stored at batches 500, 1000 and 2000 respectively.
    Now we can start training the model on our images, initializing it with the pretrained weights, using the following line of code.
!./darknet detector train build/darknet/x64/data/ cfg/yolov4_train.cfg build/darknet/x64/
yolov4.conv.137 -dont_show
  • A few iterations of training are shown in the below figure:
  • It takes around ~2 hrs to finish 2000 batches and the final model weights are stored in a file
    (yolov4_train_final.weights) on the backup folder provide.

Model Selection and Testing / Prediction

  • Since the batch size 8 and subdivision size 2 resulted in higher accuracy (in terms of IOU), the
    corresponding model is selected as the best fit model.
  • The final model checkpoint saved can be used for prediction (with an unseen image test.jpg) with the following line of code:
! ./darknet detector test build/darknet/x64/data/ cfg/yolov4_train.cfg build/darknet/x64/backup/
yolov4_train_latest.weights -dont_show test.jpg
  • Around ~500 test images with raccoons were used for custom object detection with the model trained. The following figures show the custom objects (Raccoons) detected with the model on a few unseen images.

The next animation shows how the racoons are detected with the model:


  • With a relatively few number of iterations and a small number of training images we could do a descent job for detecting custom objects using transfer learning.
  • The YOLO model’s advantage being its speed (since a one-stage object detection model), starting with weights pretrained on MS-COCO for object detection followed by transfer learning one can detect custom objects with a few hours of training.

Next steps

We obtained a few false positives and false negatives with the model trained. To improve the performance of the model,

  • We can train the model for more batches (~10k)
  • Increase the input images with data augmentation + re-annotation
  • Tune many of the hyperparameters (momentum, decay etc.) of the model



Environmental (Spatiotemporal) Data Analysis with Gaussian Processes

In this blog, we shall first discuss about how to use simulation techniques and then Gaussian Processes to analyze spatiotemporal data and make predictions. This problem appeared as homework / project in the edX course MITx 6.419x Data Analysis: Statistical Modeling and Computation in Applications, everything about the problem description, statement is taken from the course itself.

  • The Philippine Archipelago is a fascinating multiscale ocean region. Its geometry is complex, with multiple straits, islands, steep shelf-breaks, and coastal features, leading to partially interconnected seas and basins. In this part, we will be studying, understanding, and navigating through the ocean current flows.
  • The data set may be found in OceanFlow. It consists of the ocean flow vectors for time  from 1 to 100. The flow in the data set is an averaged flow from the surface to either near the bottom or 400m of depth, whichever is shallower. It is thus a 2D vector field.
  • There are two sets of csv data files:  the first set contains 100 files (corresponding to the information flow at that time index) for the horizontal components of the vectors and the second set contains another 100 files with the vertical components at different (x,y) coordinates. Here zero-indexed 
    arrays are used for python. The following code shows how the horizontal component of the velocity vectors corresponding to time index 1 looks like:
import pandas as pd
df = pd.read_csv('Module5\OceanFlow/1u.csv', header=None)
  • There is yet another mask file, which one, if needed, contains a 0-1 matrix identifying land and water. The following code shows how a mask file looks like.
dfm = pd.read_csv('Module5\OceanFlow/mask.csv', header=None)
  • Additional info and units: The data were collected in January 2009. Flows are gien in kilometers per hour (km/h) units. The time interval between the data snapshots is 3hrs. The first time index ( for zero-indexed,  for one-indexed) will correspond in these problems to the time coordinate of  hrs. Thus, for example, 1u.csv gives data at a time coordinate of  hours.
  • The grid spacing used is 3 km. The matrix index  will correspond in these problems to the coordinate (km,km), or the bottom, left of the plot. For simplicity, we will not be using longitudes and latitudes in this problem.
  • The columns of the .csv files correspond to the horizontal direction (-axis), the rows of the .csv files correspond to the vertical direction (-axis).
  • The data has been provided by the MSEAS research group at MIT ( The flow field is from a data-assimilative multiresolution simulation obtained using their MSEAS primitive-equation ocean modeling system. It simulates tidal flows to larger-scale dynamics in the region, assimilating a varied set of gappy observations.


We need to start by preprocessing the spatiotemporal dataset. We can create a 3D numpy array for each of horizontal and vertical flows, where the dimensions represent x, y coordinates and time, respectively. The following code snippet shows, how the preprocessing is done for the vertical flow velocity vectors.

import numpy as np
v = np.zeros((100, 504, 555))
for i in range(1, 101):
    df = pd.read_csv('Module5\OceanFlow/{}v.csv'.format(i), header=None)
    v[i-1] = df.values * mask #df.iloc[::-1].values * mask

Problem 1: Identifying long-range correlations

In this problem, we will try to identify areas in the Philippine Archipelago with long-range correlations. Your task is to identify two places on the map that are not immediately next to each other but still have some high correlation in their flows. Your response should be the map of the Archipelago with the two areas marked (e.g., circled). You claim that those two areas have correlated flows. Explain how did you found those two areas have correlated flows. Below you will find some hints.

A point is a particular location in the provided data set. Moreover, we have 100 measurements (at 100 different times) of the flow speeds in the -axis and -axis for each point. To compute the correlation between two points: select two points and compute the correlation coefficient along the -direction, or the -direction, or both. That is a correlation between two vectors of dimension 100.

  • The provided data set is quite large, with a map of  555×504 points. Thus, computing the correlation between every possible pair of points might be computationally expensive. We shall need to compute (555×504)2 correlations! That is very large. Instead, we can randomly sample pairs of points and compute their correlations.
  • Since this is a relatively small area, there might be many correlations between the points, so set the threshold to define “high correlations” sufficiently large. If it is too small, we shall find that most of the points are correlated. If it is too high, no pairs of points will be correlated.
  • An area might be a single point correlated with other points. However, maybe there are clusters of points that are correlated with other clusters of points.
  • Remember that correlation can be positive or negative. We may want to find positively correlated areas or negatively correlated areas.
  • Let’s find a correlation for each direction separately. We can combine these two values using some aggregate measure. Maybe set the correlation between the two points as the maximum directional correlation. Average between the two directional correlations also work. Of course, the minimum as well.


The next figure shows 2 locations that are far apart but still have high positive correlation > 0.7 (threshold). In the next figure the blue lines correspond to low correlation, where the red ones correspond to high correlations (> threshold).

The following code snippet shows how the long correlation was found.

# np.random.seed(7)
n = 100
cor_thresh = 0.7
dist_thresh = 100
i1s, j1s = np.random.choice(mask.shape[1], n, replace=True), np.random.choice(mask.shape[0], n, replace=True)
i2s, j2s = np.random.choice(mask.shape[1], n, replace=True), np.random.choice(mask.shape[0], n, replace=True)
for k in range(n):
    cor_u = np.corrcoef(u[:,j1s[k],i1s[k]], u[:,j2s[k],i2s[k]])[0,1]
    cor_v = np.corrcoef(v[:,j1s[k],i1s[k]], v[:,j2s[k],i2s[k]])[0,1]
    cor = min(cor_u, cor_v)
    cor_high = cor > cor_thresh
    far_enough = np.sqrt((i1s[k]-i2s[k])**2+(j1s[k]-j2s[k])**2) > dist_thresh
    if cor_high and far_enough:
        print('point1: {}, point2: {}, correlation = {}'.format((i1s[k],j1s[k]), (i2s[k],j2s[k]), cor))  

How the correlation was computed:

  • First 100 random pairs were chosen from all possible pair of 555×504 points.
  • Correlations between each of the pair of points, for both flow u and v were computed (between 100 dim vectors across the time axis).
  • The minimum of correlations of u and v flows were chosen to compute the final correlation.

Why the two marked locations could be correlated:

As can be seen from the below figure, comparing the u-flows and v-flows of the two locations, they have similar trends over time and that’s why they are highly correlated (e.g., for v flows, the blue and green lines decrease till 75*3 hrs and then increase).

The following figure shows how the time series data for the horizontal and vertical flows are highly correlated for the pair of points found above.

plt.plot(range(100), u[:,233, 75], label='u(75, 233)')
plt.plot(range(100), v[:,233, 75], label='u(75, 233)')
plt.plot(range(100), u[:,103 ,548], label='v(548, 103)')
plt.plot(range(100), v[:,103 ,548], label='v(548, 103)')

Problem 2: Simulating particle movement in flows

In this problem, we shall build a simulator that can track a particle’s movement on a time-varying flow.

  • We assume that the velocity of a particle in the ocean, with certain coordinates, will be determined by the corresponding water flow velocity at those coordinates. Let’s implement a procedure to track the position and movement of multiple particles as caused by the time-varying flow given in the data set. Also, let’s explain the procedure, and show that it works by providing examples and plots.
  • The above figure shows a simple flow system with four points, shown as white circles. Each point corresponds to a physical location also shown in kilometers. Attached to each point is a flow data point, which is shown as a blue arrow. Recall that flow data is given in the  and  direction. It is assumed that a particle moving in one of the zones or boxes acquires the velocity given by the flow data. This example shows that at time , a particle is located at a position x(t), shown as a green diamond.
  • After some time ε has passed, at time t+ε, the particle is at a new location x(t+ε) shown as a blue diamond. The new location can be computed as the previous location x(t) plus a displacement computed as the time-step  multiplied by the velocity at that point v(t) . Note that the velocity is given in kilometers per hour. Thus, the location x(t)  must be given in the same distance units, e.g., kilometers, and the time-step  should be given in hours.
  • The data provides a discretization of the ocean flow. The particles will, however, be moving on a continuous surface. For simplicity, let us assume that the surface is the plane R2. The data can be seen to provide flow information at integer points, namely at  (m, n) for m and n integers. Let’s divide the continuous surface into squares in such a way that each square contains a unique data point. One way to achieve this is to assign to every point in the surface the closest data point. For instance, given (x, y) ∈ R2, this consist of rounding both x and y to the closest integer. We may then suppose that each square has the same flow information as the data point it contains.
  • Now take a particle at (x, y) in a particular square. The flow in the square will displace it at the corresponding velocity. Once the particle moves out of this square, it is governed by the flow information of the next square that it enters.
  • The below figure shows a grid that we should have in mind when building the simulator. For each point, which corresponds to a physical location, there is a flow vector given, for each time step (recall there are 100 time instances).
Draw particle locations uniformly at random across the entire map, let’s not worry if some of them are placed on land. Let’s simulate the particle trajectories for 300 hours and provide a plot of the initial state, a plot of the final state, and two plots at intermediate states of the simulation. We may wish to draw colors at random for our particles in order to help distinguish them.



The following code shows the function that implements the simulation for the evolution of flow velocities. At a point (x,y), the velocity vector at time t is given by the tuple (u(t,y,x), v(t,y,x)), since in u and v matrices, x axis represents the columns and y axis represents the rows (x, y being integers, so that velocities are discretized). We can draw quivers with the position of a location and the directions from the flow matrices and evolve over time t.

def plot_flows_over_time(u, v, mask):
    xx, yy = np.meshgrid(range(0, mask.shape[1], 10), range(0, mask.shape[0], 10))
    for t in range(100):
        sns.heatmap(mask, cmap=pal, alpha=0.4, cbar=None)
        plt.quiver(xx, yy, u[t,yy,xx],v[t,yy,xx], scale=0.1, units='xy', angles='xy', \
                   color=[(np.random.random(),np.random.random(),np.random.random()) \ 
                   for _ in range([t,yy,xx].shape))]) #color=(uv, uv, uv))
        plt.title('T = {} hrs'.format(3*t), size=20)

The next figures show the initial, final and the intermediate flows.

The following animation shows the simulation of the flows at different point in time

Problem 3: Guess the location of the debris for a toy plane crash with simulation

A (toy) plane has crashed north of Palawan at T=0. The exact location is unknown, but data suggests that the location of the crash follows a Gaussian distribution with mean (100, 350) (namely (350 km, 1050 km) ) with variance σ2. The debris from the plane has been carried away by the ocean flow.

Let’s say we are about to lead a search expedition for the debris. Where would we expect the parts to be at 48 hrs, 72 hrs, 120 hrs? Study the problem by varying the variance of the Gaussian distribution. Either pick a few variance samples or sweep through the variances if desired. Let’s sample particles and track their evolution.


First let’s visualize how the mean location drifts with time, Let’s first create one single plot instead of three to show the location of the plane at different three times (shown as legends), since it’s more insightful and easier to find the trajectory this way.

Now, let’s sample 100 particles around the mean for 3 different variance values from the Gaussian distribution and visualize where the plane could have been at different time points.

The following animation shows how the debris of the toy plane can get carried away by the flows over time for σ=5:

  • It’s quite natural to assume a normal distribution of locations around which debris could have been placed. Since this approach provides us a controlled way to increase the search space (just increase σ), it can be efficient, since we have to explore less locations till finding the debris.
  • However, because we only have data for 100-time instances, separated 3 hours each, technically, we can only run such simulation for a total of 300 hours. Hence, simulation can’t extend beyond the time for which we don’t have any (flow) data (i.e., larger time scales).
  • Also, we can see from the previous exercise that in the debris simulation based on this time scale, the points did not move too much or too far.

Problem 4: Creating a Gaussian process model for the flow

In this problem, we shall create a Gaussian process model for the flow given the information we have. We shall perform a toy experiment where we can simulate larger time scales. Furthermore, we are going to model flows as a Gaussian process to estimate some unobserved variables.

  • Here onward, we are going to assume that the flow data available, instead of being given for measurements every 3 hours, is given for measurements every three days!. This implies that we can potentially simulate the movement of debris for almost one year.
  • However, to achieve a sufficiently smooth simulation, having flow data every three days seems quite a stretch. Whereas in the previous problems we assumed that the flow in the ocean remains approximately the same on a period of 3 hours, to assume that this is also the case for three days does not seem correct. For example, we would like to have time steps of 24 hours, 12 hours or possibly even smaller. 
  • NOTE: Of course we are not trying to make a perfect simulation of the ocean, nor an exact fluid dynamics simulations, so for the sake of simplicity and the toy experiment, we shall this simplistic scenario.
  • Let’s start by picking a location of from the map for which we are given flow data (ideally from a location on the ocean not in the land). Moreover, consider the two vectors containing the flow speed for each direction: we shall end up with two vectors of dimension 100.
  • Let’s find the parameters of the kernel function that best describes the data independently for each direction. That means we shall do the following for each direction, and obtain a model for each direction at that particular location, as per the following instructions.

The following figure shows the general ideas for a Gaussian Process model

Let’s recall the steps to estimate the parameters of the kernel function (for a GP).

  • Pick a kernel function: choose the proper kernel function.
  • Identify the parameters of the kernel function.
  • Find a suitable search space for each of the parameters.

We shall find a good set of parameters via cross-validation. Pick a number k and split the data k-wise defining some training and some testing points. Each vector is of dimension 100, so for k=10 we shall have ten different partitions, where there are 90 points for training and 10 points for testing.

For each possible set of parameters and each data partition.

  • Let’s build an estimate for the mean at each possible time instance. For example, we can estimate the mean as all zeros, or take averages – moving or otherwise – over our training data.
  • Construct the covariance matrix for the selected kernel functions and the selected set of parameters.
  • Compute the conditional mean and variance of the testing data points, given the mean and variance of the training data points and the data itself. Let’s recall the following for GP:
  • In this case,  X1 are the velocities for the testing data-points, X2 are the velocities for the training data-points. μ1 are the mean velocities for the testing data-points.  μ2 are the mean velocities for the training data-points.  Σ22 is the covariance of the training data-points.  Σ11 is the covariance of the testing data-points.  Σ12 is the cross-covariance. x2 are the observed velocities for the training data-points. Finally, τ is the parameter indicating the variance of the noise in the observations. We can pick τ=0.001.
  • Compute the log-likelihood performance for the selected parameters. We can select to compute it only on the testing data, or on the complete vector.
  • For each possible set of parameters, we shall then have the performance for each of the  partitions of our data. Find the parameters that maximize performance. Save the computed cost/performance metric for each choice of parameters, then create a plot over the search space that shows this metric.


Here using the same point (100, 350) where the toy plane crashed. The following figure shows the plot of the time series for u-flows and v-flows at the location.

Choice and the parameters of kernel function

Squared exponential (RBF) kernel was chosen as kernel function:

It makes sense to assume that the similarity between the flow velocities decrease exponentially with the square of the time difference they were recorded at.

Search space for each kernel parameter

The search space is the Cartesian product of the following values of the hyper-parameters l and σ (each of size 10, chosen at uniform gaps in between 0.1 and 5), as shown below, with 100 possible combinations of hyper-parameter values in the search space.

The number of folds (k) for the cross-validation

10-fold cross-validation was used (k=10). The following code shows how the average marginal log-likelihood over k folds was computed on the held-out test dataset. The prior mean (μ1) is set to the mean of all the 100 data points in the time series. The following code shows how the hyperparameter tuning is done with 10-fold cross validation and then optimal parameters are chosen by maximizing the marginal log-likelihood on the test dataset.

def compute_log_marginal_likelihood(y, μ, K):
    return -(y-μ).T@np.linalg.solve(K, (y-μ))/2 - log(round(np.linalg.det(K), 3))/2 - len(y)*np.log(2*np.pi)/2

def get_train_test_fold(z, i, k):
    sz = len(z) // k
    return np.concatenate([np.arange((i-1)*sz), np.arange(i*sz, k*sz)]), \
          np.concatenate([z[np.arange((i-1)*sz)], z[np.arange(i*sz, k*sz)]]), \
          np.arange((i-1)*sz, i*sz), z[(i-1)*sz:i*sz]

k = 10
z = u[:,x,y]
l = np.linspace(0.1, 5, 10)
σ = np.linspace(0.1, 5, 10)    
ll_mean_k_folds = np.zeros((len(l), len(σ)))
for i1 in range(1, k+1):
    x2, y2, x1, y1 = get_train_test_fold(z, i1, k)
    ll = np.zeros((len(l), len(σ)))
    for i in range(len(l)):
        for j in range(len(σ)):
            Σ22 = compute_sigma(x2, x2, σ[j], l[i])
            Σ12 = compute_sigma(x1, x2, σ[j], l[i])
            Σ21 = compute_sigma(x2, x1, σ[j], l[i])
            Σ11 = compute_sigma(x1, x1, σ[j], l[i])
            Σ1_2 = Σ11 - Σ12@(np.linalg.solve(Σ22 + τ**2*np.eye(len(y2)), Σ21)) 
            ll[i,j] = compute_log_marginal_likelihood(y1, Σ1_2)
    ll_mean_k_folds += ll
ll_mean_k_folds /= k

Optimal kernel parameters from the search

The following figures show the plots of the computed cost/performance metric over the search space for the kernel parameters.

For the GP model for the u-flow direction:

l = 1.733, σ=1.733, with corresponding maximum marginal-log-likelihood = -7.89, as shown in the next heatmap.

For the GP model for the v-flow direction: l = 1.733, σ=1.733, with corresponding maximum marginal-log-likelihood = -7.67, as shown in the next heatmap.

Let’s compute the optimal kernel values for three new locations that are different from the last location, as shown below:

  • (407,350): optimal GP hyper-parameters obtained with 10-fold CV: l = 1.733, σ = 1.733 for u-flow and l = 1.733, σ = 1.733 for v-flow direction
  • (100,33): optimal GP hyper-parameters obtained with 10-fold CV: l = 1.733, σ = 1.733 for u-flow and l = 1.733, σ = 1.733 for v-flow direction
  • (8,450): optimal GP hyper-parameters obtained with 10-fold CV: l = 1.733, σ = 1.733 for u-flow and l = 1.733, σ = 1.733 for v-flow direction

For both the hyperparameters, we get maximum average log-likelihood on held-out folds with the same values of l (length-scale) and σ parameters, typically at l = 1.733, σ = 1.733.

Impact of τ value on the optimal hyperparameter selection

We have suggested one particular value for τ. Let’s consider other possible values and observe the effects such parameter has on the estimated parameters and the estimation process’s performance.

  • (407,350), τ=0.1: optimal GP hyper-parameters obtained with 10-fold CV: l = 2.278 , σ = 3.367 for u-flow and l = 2.278, σ = 3.367 for v-flow direction.

Heatmap for u flow

Heatmap for v flow

  • (407,350), τ=1: optimal GP hyper-parameters obtained with 10-fold CV: l = 2.822 , σ = 5.0 for u-flow and l = 2.278, σ = 2.822 for v-flow direction.

Heatmap for u flow

Heatmap for v flow

As can be seen from above, the optimal values for τ=0.1 and τ=1 differ from those found in with τ=0.001, and also they differ from each other too.

Problem 5: Using with Gpy package in python and comparing with above results

Currently, most of the commonly used languages like Python, R, Matlab, etc., have pre-installed libraries for Gaussian processes. Let’s use one library of your choice, maybe the language or environment, and compare the obtained results.


  • Optimal kernel parameters as found through the software library (length scale) l = 3.465, σ = 0.367 (as shown in the next figure)
  • Details on the library used: Gpy – A Gaussian Process (GP) framework in Python
  • The following code snippet shows how to perform the hyperparameter tuning and prediction with GP using the Gpy modules’ functions:
import GPy
kernel = GPy.kern.RBF(input_dim=1, variance=1., lengthscale=1.)
model = GPy.models.GPRegression(np.arange(100).reshape(-1,1), z.reshape(-1,1), kernel)
model.noise_variance = .001
model.optimize(messages=True, max_iters=1e5)
  • The optimal hyperparameters seem to be little different from the ones obtained without using the library. This may be due to the grid search values used, different hyperparameter tuning method is used, and / or different initialization values used for the prior mean and difference in the noise variances used. The next figures show the comparison of the predictions generated (the predictions for the u-flow direction is shown).

Problem 6: Estimating unobserved flow data.

In the previous problem, we have found a good set of parameters to model the sequence of speeds at one location as a Gaussian process. Recall that we have assumed our 100 observations came at a rate of one every three days. We are going to assume that when we advance to our simulations, we will choose a smaller time step. Thus, we need to interpolate how the flow would look like at some unobserved points.

Let’s say we are given flow information every three days. Let’s pick some time stamps in-between each observation for which to estimate the flow. For example, if we want flows every day, there will be two unknown points between two observations. We could pick only one, or more than two.

  • Let’s compute the conditional distribution (mean and covariance) at the time locations selected in the previous solution.
  • Use the kernel parameters that you obtained earlier, and use the same location as we did in the previous problem.
  • For the initial estimate of the mean at the unknown time locations, we can use zero, use the average of all the observations, or take the average of the two closest observed points.
  • Let’s plot the predictions, clearly showing:
    • The predicted means.
    • The predicted standard deviation as a 3σ band (three standard deviations above and below the mean).
    • The observed data points.


  • The predicted means are shown in as blue continuous line and the observed data points are shown as red stars in the next figure.
  • Choice of time-stamps at which to create predictions: Wanted flows every day, so two unknown points were chosen between every two observations, so in total we have 300 datapoints to predict, corresponding to 300 days.
  • The prior means were chosen as the mean of the entire (observed) dataset (i.e., u and v velocities corresponding to the 100 points, respectively).
  • Predictions for the horizontal velocity components at the chosen location (100, 350) at different points in time is shown in the next figure.
  • Predictions for the vertical velocity components at the chosen location at different points in time is shown in the next figure.

The following code snippet used to compute the prediction with GP using our own implementation of GP (also find the code here):

z = u[:,x,y] #v[:,x,y]
μ1 = np.mean(z)
x = np.arange(0, 100)
x_ = np.linspace(0, 99, 300)
#μ1 = np.zeros(len(x_))
Σ22 = compute_sigma(x, x, σ, l)
Σ12 = compute_sigma(x_, x, σ, l)
Σ21 = compute_sigma(x, x_, σ, l)
Σ11 = compute_sigma(x_, x_, σ, l)
μ2 = np.mean(z)
μ1_2 = μ1 + Σ12@(np.linalg.solve(Σ22 + τ**2*np.eye(len(z)), z - μ2)) 
Σ1_2 = Σ11 - Σ12@(np.linalg.solve(Σ22 + τ**2*np.eye(len(z)), Σ21)) 

The next animation shows the effect of the hyperparameter (length-scale and variance for the RBF kernel) values on the fitted GP and their impact on prediction:

Problem 7: A longer time-scale simulation with Gaussian Process

In the previous problems, we learned to model the flow at one location as a Gaussian process. Thus, we can extend this to estimate the flow at any point in time at that particular location using the kernel function parameters. At a certain point in time, the flow can be computed as the realization of a multivariate Gaussian random variable with parameters given by the conditional distributions given the flow data. At this point, you are asked to simulate a particle moving according to the flow data and using the estimates for times between the original timestamps of the problem.

  • Ideally, one would have to estimate the parameters of the flow at every point in the map. However, having to run  parameter selection models seems like much computational work. So, here we take a more straightforward approach: use the results from the previous problem to choose a value of our kernel parameters that is generally representative of the points that we tested.
  • Let’s modify the simulator that we built earlier to use this new flow estimated flow information. Note that with this new change, we should be able to simulate the flow of particles for 300 days! Regarding data, originally, we have 100 measurements per point, now with this approach, let us say we use the estimates for two extra points to get one flow data per day, so in total, we should have at your disposal 300 descriptions of flow per location.
  • Let’s now repeat the simulation problem with GP. This time we shall be simulating flows for 300 days. This allows some debris to arrive on land. Let’s investigate where are some possible places along the coast where one could find debris? Again, to do this, let’s choose a σ, and simulate the movement of particles with initial location sampled from the bivariate Gaussian. Evolve the location of the particles. Some times particles trajectories will terminate on the shore. Continue to keep track of such particles. These points are likely where we could find the debris.
  • Provide a plot that includes your initial, final, and at least one intermediate state of your simulation. For the final state, clearly mark one location on land where we would search for debris. Also mark one location over the ocean where we would search for debris.
  • Try at least one other value for σ, and create the same three plots.


  • Plot with the initial / intermediate / final state of the simulation with GP (used 10 sample points from the Gaussian distribution with mean location (100,350) and s.d. 5) is shown in the next figure:
  • Marking a location on the coast and another one over the ocean of the final state of the simulation where one should search for debris: as can be seen from next figure, the debris is most likely to land at (112, 380), keeping in view of the smoothness introduced by GP while getting carried away.
  • Plots (initial, intermediate, final) for one other choice of σ.

As can be seen the conclusion changes with higher standard deviation, the debris are more dispersed and can potentially land at any of the above two locations.

To be continued…

Implementing a few algorithms with python from scratch

In this blog, we shall focus on implementing a few famous algorithms with python – the algorithms will be from various topics from computer science, such as graph theory, compiler construction, theory of computation, numerical analysis, data structures, digital logic, networking, operating systems, DBMS, cryptography, optimization, quantum computation etc. and all of the implementations will be from scratch.

Let’s start with a problem called Segmented Least Squares, which will be solved using Dynamic Programming.

Segmented Least Squares

  • Least squares is a foundational problem in statistics and numerical analysis. Given n points in the plane: (x1, y1), (x2, y2) , . . . , (xn, yn), the objective is to fit a line y = ax + b that minimizes the sum of the squared error.
  • The Segmented least squares is a more general problem, where
    • The data points lie roughly on a sequence of several line segments.
    • Given n points in the plane (x1, y1), (x2, y2) , . . . , (xn, yn), with x1 < x2 < … < xn the objective is to find a sequence of lines that minimizes f(x), with a reasonable choice for f(x) to balance accuracy (goodness of fit) and parsimony (number of lines).
    • The goal is to minimize the sum of
      • the sums of the squared errors E in each segment
      • the number of lines L
      • the Tradeoff function: E + c L, for some constant c > 0 (cost of a line).

The following figure shows how the dynamic programming for the segmented least squares problem is formulated:

Here, M[j] represents the minimum error (regression) line fitted on the points i to j, we can track the starting point i with minimum error (MSE) by keeping a back pointer array along with the dynamic programming array. Also, c denotes the cost to draw a line (acts as a penalty on number of lines fit). The optimal substructure property is by Bellman equation.

Here is my python implementation for the above DP algorithm, on the following 2D dataset with points (xs, ys) (scatter plotted below):

def ls_fit(xs, ys, m):
    a = (m*sum(xs*ys)-sum(xs)*sum(ys)) / (m*sum(xs**2)-sum(xs)**2)
    b = (sum(ys)-a*sum(xs)) / m
    return a, b

def compute_errors(xs, ys):
    n = len(xs)
    e = np.zeros((n,n))
    for j in range(n):
        for i in range(j+1):
            m = j-i+1
            if m > 1:
                a, b = ls_fit(xs[i:i+m], ys[i:i+m], m)
                e[i,j] = sum((ys[i:i+m] - a*xs[i:i+m] - b)**2)
    return e

def build_DP_table(e, n):
    M = np.zeros(n)
    p = np.zeros(n, dtype=int) # backpointers
    for j in range(1, n):
        cost = [e[i,j] + c + M[i-1] for i in range(j)]
        M[j] = np.min(cost)
        p[j] = np.argmin(cost)
    return M, p

Now plot the least square line segments obtained with the dynamic programming formulation:

c = 10
tol = 2
starts = np.unique(p)
drawn = set([])
plt.plot(xs, ys, 'g.')
for start in starts:
    indices = np.where(abs(p-start) < tol)[0]
    a, b = ls_fit(xs[indices], ys[indices], len(indices))
    if not (a, b) in drawn:
        plt.plot([xs[min(indices)],xs[max(indices)]], [a*xs[min(indices)]+b, a*xs[max(indices)]+b], linewidth=3, 
                 label='line: ({:.2f}, {:.2f})'.format(a,b))

As expected, the DP found the 3 optimal least-square lines fitted on the data. The following figure shows how the cost of a line c can impact the number of lines created with the dynamic programming method:

The Havel-Hakimi Algorithm

Now, let’s concentrate on a problem from graph theory. Given degree sequence of an undirected graph, let’s say we have to determine whether the sequence is graphic or not. If graphic, we have to draw a graph with the same degree sequence as input. We shall use the Havel-Hakimi algorithm to accomplish this.

By the Havel-Hakimi theorem from, we have the following:

from which we have the following algorithm:

Iteratively execute the following steps

  • Sort the degree sequence list in non-increasing order.
  • Extract (remove) the max-degree vertex of degree d from the sorted degree-sequence list.
  • Decrement the degrees of next d vertices in the list
  • If there are not enough vertices in the list, or degree of some vertex becomes negative, return False
  • Stop and return True if list of all zeros remains.

Here is how we can implement the above Havel-Hakimi algorithm to draw a simple graph, given its degree sequence, provided such a graph exists:

import networkx as nx
import numpy as np

def get_adj_list(deg_seq):        
    deg_seq = np.array(deg_seq)
    labels = np.arange(len(deg_seq))
    adj_list = {}
    while True:
        indices = np.argsort(-deg_seq)
        deg_seq = deg_seq[indices]
        labels = labels[indices]
        if all(deg_seq == 0):
            return adj_list     
        v = deg_seq[0]            
        if v > len(deg_seq[1:]):
            return None             
        for i in range(1,v+1):
            deg_seq[i] -= 1
            if deg_seq[i] < 0:
                return None
            adj_list[labels[0]] = adj_list.get(labels[0], []) + [labels[i]]            
        deg_seq[0] = 0            
    return None     
deg_seqs = [[5, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3], 
            [6, 3, 3, 3, 3, 2, 2, 2, 2, 2,1,1],
            [3, 3, 3, 3]]

for deg_seq in deg_seqs:
    adj_list = get_adj_list(deg_seq)
    if adj_list:
        print('The graph with {} can be drawn'.format(deg_seq))
        G = nx.from_dict_of_lists(adj_list)#.to_undirected()
        nx.draw(G, pos=nx.spring_layout(G), with_labels=True)
        print('The graph with {} can\'t be drawn'.format(deg_seq))    

The following animations show how the graphs are drawn with the algorithm, given the corresponding degree-sequence lists.

Graph with degree sequence [5, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3]:
Graph with degree sequence [6, 3, 3, 3, 3, 2, 2, 2, 2, 2,1,1]:

Graph with degree sequence [3, 3, 3, 3]:

Encoding / Decoding with Prüfer sequences

Again we shall concentrate on implementation another graph theory algorithm that uniquely encodes a labelled tree into a sequence and decodes it back. Let’s say we want to constructing all possible spanning trees of the complete graph with K vertices, i.e. Kn. By Cayley’s theorem, we can count number of different possible spanning trees of Kn – there are exactly n^(n-2) of them. Using Prüfer sequences, we can show that there are exactly as many Prufer seuqences as we have number of spanning trees, encode a spanning tree into a unique sequence and decode it back.

Prüfer Encoding

The following algorithm can be used to generate the Prüfer code (of length n-2) given a tree on n vertices – there is a one-to-one mapping between the Prüfer sequence and a spanning tree of Kn (complete graph with n vertices).

The following python code implements the above algorithm:
def get_degree_sequence(tree):
    ds, nbrs = {}, {}
    for node in tree:
        for nbr in tree[node]:
            ds[node] = ds.get(node, 0) + 1
            ds[nbr] = ds.get(nbr, 0) + 1
            nbrs[node] = nbrs.get(node, []) + [nbr]
            nbrs[nbr] = nbrs.get(nbr, []) + [node]
    return ds, nbrs

def get_prufer_seq(tree):
    ds, nbrs = get_degree_sequence(tree)
    seq = []
    while len(ds) > 2:
        min_leaf = min(list(filter(lambda x: ds[x] == 1, ds)))
        parent = nbrs[min_leaf][0]
        ds[parent] -= 1
        del ds[min_leaf]
        del nbrs[min_leaf]
    return seq

Invoke the function with the input tree to get the Prüfer code corresponding to the tree, as shown below:

T = {1:[2, 4, 5], 2:[7], 7:[3], 3:[6]}
# [1, 1, 2, 7, 3]

The following animation shows the steps in Prüfer code generation:

Prüfer Decoding

We can use the following algorithm (from here) to generate a spanning tree (on nn vertices) of Kn, given the corresponding Prüfer sequence (of length n−2), since there exists a 1-1 mapping always, in between set of all spanning trees of labeled graph Kn (there are n^(n−2) spanning trees of Kn by Cayley’s theorem) and set of all Prüfer sequences of length n−2.

The following python code implements the above algorithm:
def get_tree(S):
    n = len(S)
    L = set(range(1, n+2+1))
    tree_edges = []
    for i in range(n):
        u, v = S[0], min(L - set(S))
    tree_edges.append((L.pop(), L.pop()))
    return tree_edges

Invoking the above function on a Prüfer sequenceof length 9, we can obtain the corresponding spanning tree of K11, as shown below:

S = [6,2,2,6,2,5,10,9,9]
T_E = get_tree(S)

The next figure shows the final tree:

The following animation shows the tree-building steps:

Generate all Spanning Trees of Kn

We can use Prüfer sequences (of length n−2) to find the labeled spanning trees for Kn, using the above decoding algorithm, by Cayley’s theorem the number of spanning trees are n^(n−2).

For n=5, there are 5^3=125 such spanning trees on 5 labeled vertices, as can be computed using the above algorithm and seen from the following animation:

Generate Random Trees

Using Prüfer sequences again, here is how we can generate a 20-node labeled random tree (in python):

  1. start from a randomly generated length-18 sequence with each element chosen from the set {1,2,…,20}.
  2. use the generated string as Prufer sequence for the spanning tree for the complete graph K20 on 20 vertices and generate the corresponding labeled tree (since there is a 1-1 correspondence always) using the above decoding algorithm (from here).

Now, we can always generate a prufer sequence (of length n-2) randomly and subsequently generate the corresponding spanning tree (on n vertices) with the function get_tree(), which can serve as our random tree (can be thought of to be randomly sampled from the set of n^(n-2) spanning trees of Kn).

n = 20 # Kn with n vertices
N = 25 # generate 25 random trees with 20 vertices (as spanning trees of K20)
for i in range(N):
    S = np.random.choice(range(1,n+1), n-2, replace=True).tolist()
    T_E = get_tree(S) # the spanning tree corresponding to S
    # plot the tree generated (with `networkx`, e.g.,)

The next animation shows a few such randomly generated labeled trees with 20 nodes.

Construction of an LR(0) Parser

In this problem we shall try to implement an algorithm with compiler-construction with python. The LR-parsing is Non-Backtracking Shift-Reduce Bottom-Up Parser. It scans input string from left-to-right and use left most derivation in reverse.

  • The general idea is to shift some symbols of input to the stack until a reduction can be applied
  • At each reduction step, if a specific substring is matched, then the body of a production is replaced by the Non Terminal at the head of the production
  • The LR parser consists of 1) Input 2)Output 3)Stack 4) Driver Program 5) Parsing Table, as shown in the next figure
  • The Driver Program is same for all LR Parsers.
  • Only the Parsing Table changes from one parser to the other.
  • The difference in between the individual parsers in the class of bottom-up LR parsers is whether they result in shift/reduce or reduce/reduce conflicts when generating the parsing tables. The less it will have the conflicts, the more powerful will be the grammar (LR(0) < SLR(1) < LALR(1) < CLR(1)).
  • The LR Shift-Reduce Parsers can be efficiently implemented by computing an LR(0) automaton and a parsing table to guide the processing.
  • The Parsing Table consists of two parts:
    • A Parsing Action Function and
    • A GOTO function.
  • We shall now focus on the implementation of the basic LR(0) parser for a few simple grammars, which does not use any lookahead and then see how it can be improved to a more powerful SLR(1) parser, with a single lookahead symbol.

For example, consider the following expression grammar:

E → E + T
E → T
T → F
T → T * F
F → ( E )
F → id

It’s not LR(0) but SLR(1). Using the following code, we can construct the LR0 automaton and build the parsing table (we need to augment the grammar, compute the DFA with closure, compute the action and goto sets):

from copy import deepcopy
import pandas as pd

def update_items(I, C):
    if len(I) == 0:
         return C
    for nt in C:
         Int = I.get(nt, [])
         for r in C.get(nt, []):
              if not r in Int:
          I[nt] = Int
     return I

def compute_action_goto(I, I0, sym, NTs): 
    #I0 = deepcopy(I0)
    I1 = {}
    for NT in I:
        C = {}
        for r in I[NT]:
            r = r.copy()
            ix = r.index('.')
            #if ix == len(r)-1: # reduce step
            if ix >= len(r)-1 or r[ix+1] != sym:
            r[ix:ix+2] = r[ix:ix+2][::-1]    # read the next symbol sym
            C = compute_closure(r, I0, NTs)
            cnt = C.get(NT, [])
            if not r in cnt:
            C[NT] = cnt
        I1 = update_items(I1, C)
    return I1

def construct_LR0_automaton(G, NTs, Ts):
    I0 = get_start_state(G, NTs, Ts)
    I = deepcopy(I0)
    queue = [0]
    states2items = {0: I}
    items2states = {str(to_str(I)):0}
    parse_table = {}
    cur = 0
    while len(queue) > 0:
        id = queue.pop(0)
        I = states[id]
        # compute goto set for non-terminals
        for NT in NTs:
            I1 = compute_action_goto(I, I0, NT, NTs) 
            if len(I1) > 0:
                state = str(to_str(I1))
                if not state in statess:
                    cur += 1
                    states2items[cur] = I1
                    items2states[state] = cur
                    parse_table[id, NT] = cur
                    parse_table[id, NT] = items2states[state]
        # compute actions for terminals similarly
        # ... ... ...
    return states2items, items2states, parse_table
states, statess, parse_table = construct_LR0_automaton(G, NTs, Ts)

where the grammar G, non-terminal and terminal symbols are defined as below

G = {}
NTs = ['E', 'T', 'F']
Ts = {'+', '*', '(', ')', 'id'}
G['E'] = [['E', '+', 'T'], ['T']]
G['T'] = [['T', '*', 'F'], ['F']]
G['F'] = [['(', 'E', ')'], ['id']]

Here are few more useful function I implemented along with the above ones for LR(0) parsing table generation:

def augment(G, S): # start symbol S
    G[S + '1'] = [[S, '$']]
    NTs.append(S + '1')
    return G, NTs

def compute_closure(r, G, NTs):
    S = {}
    queue = [r]
    seen = []
    while len(queue) > 0:
        r = queue.pop(0)
        ix = r.index('.') + 1
        if ix < len(r) and r[ix] in NTs:
            S[r[ix]] = G[r[ix]]
            for rr in G[r[ix]]:
                if not rr in seen:
    return S

The following figure (expand it to view) shows the LR0 DFA constructed for the grammar using the above code:

The following table shows the LR(0) parsing table generated as a pandas dataframe, notice that there are couple of shift/reduce conflicts, indicating that the grammar is not LR(0).

SLR(1) parser avoids the above shift / reduce conflicts by reducing only if the next input token is a member of the Follow Set of the nonterminal being reduced. The following parse table is generated by SLR:

  • The LR driver Program determines Sm, the state on top of the stack and ai , the Current Input symbol.
  • It then consults Action[ Sm, ai] which can take one of four values:
    • Shift
    • Reduce
    • Accept
    • Error
  • The next code snippet implements a driver program.
def parse(input, parse_table, rules):
    stack = [0]
    df = pd.DataFrame(columns=['stack', 'input', 'action'])
    df = df.append({'stack':0, 'input':''.join(input), 'action':''}, ignore_index = True)
    i, k = 0, 0
    accepted = False
    while i < len(input):
        state = stack[-1]
        char = input[i]
        action = parse_table.loc[parse_table.states == state, char].values[0]
        if action[0] == 's':   # shift
            i += 1
        elif action[0] == 'r': # reduce
            r = rules[int(action[-1])]
            l, r = r['l'], r['r']
            char = ''
            for j in range(2*len(r)):
                s = stack.pop()
                if type(s) != int:
                    char = s + char
            if char == ''.join(r):
                goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
                stack.append(int(goto)) #[-1]))
        elif action == 'acc':
            accepted = True
        df2 = {'stack': ''.join(map(str, stack)), 'input': ''.join(input[i:]), 
               'action': 'shift' if action[0] == 's' else 'accept' if action == 'acc' else 
                         'reduce by rule {}'.format(rules[int(action[-1])]['l'] + '-->' + ''.join(rules[int(action[-1])]['r']))}
        df = df.append(df2, ignore_index = True)
        k += 1
        if accepted:
    return df

input = ['id', '*', 'id', '+', 'id', '*', 'id', '+', 'id', '$'] #'aaacbbb$' '(', , ')' #list('aaacbbb$') #
parse(input, parse_table, rules)
  • The following animation shows how an input expression is parsed by the driver of the above SLR(1) grammar:

But, the following grammar which accepts the strings of the form a^ncb^n, n >= 1 is LR(0):

S → A
A → a A b
A → c

# S --> A 
# A --> a A b | c
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c'}
G['S'] = [['A']]
G['A'] = [['a', 'A', 'b'], ['c']]

As can be seen from the following figure, there is no conflict in the parsing table generated.

Here is how the input string a^2cb^2 can be parsed using the above LR(0) parse table, using the following code:
def parse(input, parse_table, rules):
    input = 'aaacbbb$'
    stack = [0]
    df = pd.DataFrame(columns=['stack', 'input', 'action'])
    i, accepted = 0, False
    while i < len(input):
        state = stack[-1]
        char = input[i]
        action = parse_table.loc[parse_table.states == state, char].values[0]
        if action[0] == 's':   # shift
            i += 1
        elif action[0] == 'r': # reduce
            r = rules[int(action[-1])]
            l, r = r['l'], r['r']
            char = ''
            for j in range(2*len(r)):
                s = stack.pop()
                if type(s) != int:
                    char = s + char
            if char == r:
                goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
        elif action == 'acc':  # accept
            accepted = True
        df2 = {'stack': ''.join(map(str, stack)), 'input': input[i:], 'action': action}
        df = df.append(df2, ignore_index = True)
        if accepted:
    return df

parse(input, parse_table, rules)

The next animation shows how the input string aacbb is parsed with LR(0) parser using the above code:

Deterministic Pushdown Automata

Let’s now focus on simulating an automaton from the theory of computation. The PDA is an automaton equivalent to the CFG in language-defining power. Only the nondeterministic PDA defines all
the CFL’s. But the deterministic version models parsers. Most programming languages have
deterministic PDA’s.

A PDA is described by:

  1. A finite set of states (Q).
  2. An input alphabet (Σ).
  3. A stack alphabet (Γ).
  4. A transition function (δ).
  5. A start state (q0, in Q).
  6. A start symbol (Z0, in Γ).
  7. A set of final states (F ⊆ Q).

If δ(q, a, Z) contains (p,α) among its actions, then one thing the PDA can do in state q, with
a at the front of the input, and Z on top of the stack is:

  1. Change the state to p.
  2. Remove a from the front of the input (but a may be ε).
  3. Replace Z on the top of the stack by α

More specifically, let’s implement a Deterministic Pushdown Automata (DPDA) for the DCFL L = a^nb^n | n >=1 with python.

Here is how we can implement the class DPDA for the CFL a^nb^n, using the following states, stack symbols and transition function from here:

The states shown in the above figure are:
* q = start state. We are in state q if we have seen only 0’s so far.
* p = we’ve seen at least one 1 and may now proceed only if the inputs are 1’s.
* acc = final state; accept.

class DPDA:
    def __init__(self, trf, input, state):
        self.head = 0
        self.trf = {}
        self.state = str(state)
        self.input = input
        self.trf = trf
        self.stack = ['Z']
    def step(self):
        a = self.input[self.head]
        s = self.stack.pop()
        state, ss = self.trf.get((self.state, a, s))
        if ss != 'ε':
            for s in ss[::-1]:
        self.state = state
        print('{:20s} [{:10s}] {:5s}'.format(self.input[self.head:], 
                       ''.join(self.stack), self.state))        
        self.head += 1
    def run(self):
        print('{:20s} [{:10s}] {:5s}'.format(self.input[self.head:], 
                              ''.join(self.stack), self.state))
        while self.head  < len(self.input):

        s = self.stack.pop()        
        if self.trf.get((self.state, 'ε', s)):
            state, ss = self.trf.get((self.state, 'ε', s))
            self.state = state        
            print('{:20s} [{:10s}] {:5s}'.format('ε', 
                 ''.join(self.stack), self.state))
# run DPDA to accept the input string a^9b^9
DPDA({('q', 'a', 'Z'): ('q', 'XZ'),
     ('q', 'a', 'X'): ('q', 'XX'),
     ('q', 'b', 'X'): ('p', 'ε'),
     ('p', 'b', 'X'): ('p', 'ε'),
     ('p', 'ε', 'Z'): ('acc', 'Z'),
    'aaaaaaaaabbbbbbbbb', 'q').run()

#input                #stack       #state
#aaaaaaaaabbbbbbbbb   [Z         ] q    
#aaaaaaaaabbbbbbbbb   [ZX        ] q    
#aaaaaaaabbbbbbbbb    [ZXX       ] q    
#aaaaaaabbbbbbbbb     [ZXXX      ] q    
#aaaaaabbbbbbbbb      [ZXXXX     ] q    
#aaaaabbbbbbbbb       [ZXXXXX    ] q    
#aaaabbbbbbbbb        [ZXXXXXX   ] q    
#aaabbbbbbbbb         [ZXXXXXXX  ] q    
#aabbbbbbbbb          [ZXXXXXXXX ] q    
#abbbbbbbbb           [ZXXXXXXXXX] q    
#bbbbbbbbb            [ZXXXXXXXX ] p    
#bbbbbbbb             [ZXXXXXXX  ] p    
#bbbbbbb              [ZXXXXXX   ] p    
#bbbbbb               [ZXXXXX    ] p    
#bbbbb                [ZXXXX     ] p    
#bbbb                 [ZXXX      ] p    
#bbb                  [ZXX       ] p    
#bb                   [ZX        ] p    
#b                    [Z         ] p    
#ε                    [Z         ] acc  

The next animation shows how the string is accepted by the above DPDA:

Simplification of a Boolean function with K-map

In this problem, we shall aim at implementing a greedy version of the K-map algorithm to represent a Boolean function in SOP with minimum number of terms for 4 variables. for 4 variables. The function accepts the Boolean function in SOP (sum of products) form and the names of the variables and returns a simplified reduced representation. Basically you need to create rectangular groups containing total terms in power of two like 8, 4, 2 and try to cover as many elements as you can in one group (we need to cover all the ones).

For example, the function


can be represented as


As can be seen from the output of the next code snippet, the program outputs the simplified form BD’+A’BC’+AB’C’+ABC+A’B’DBD’+A’BC’+AB’C’+ABC+A’B’D, where negation of a boolean variable A is represented A’ and equivalently as ¬A in the code.

from collections import defaultdict
from itertools import permutations, product
def kv_map(sop, vars):
    sop = set(sop)
    not_covered = sop.copy()
    sop_covered = set([])
    mts = [] # minterms
    # check for minterms with 1 variable
    all_3 = [''.join(x) for x in product('01', repeat=3)]
    for i in range(4):
        for v_i in [0,1]:
                if len(not_covered) == 0: continue
                mt = ('' if v_i else '¬') + vars[i]
                s = [x[:i]+str(v_i)+x[i:] for x in all_3]
                sop1 = set(map(lambda x: int(x,2), s))
                if len(sop1 & sop) == 8 and len(sop_covered & sop1) < 8: # if not already covered
                    sop_covered |= sop1
                    not_covered = not_covered - sop1
        if len(not_covered) == 0:
           return mts
    # check for minterms with 2 variables
    all_2 = [''.join(x) for x in product('01', repeat=2)]
    for i in range(4):
        for j in range(i+1, 4):
            for v_i in [0,1]:
                for v_j in [0,1]:
                    if len(not_covered) == 0: continue
                    mt = ('' if v_i else '¬') + vars[i] + ('' if v_j else '¬') + vars[j]
                    s = [x[:i]+str(v_i)+x[i:] for x in all_2]
                    s = [x[:j]+str(v_j)+x[j:] for x in s]
                    sop1 = set(map(lambda x: int(x,2), s))
                    if len(sop1 & sop) == 4 and len(sop_covered & sop1) < 4: # if not already covered
                        sop_covered |= sop1
                        not_covered = not_covered - sop1
    if len(not_covered) == 0:
        return mts

    # check for minterms with 3 variables similarly (code omitted)
    # ... ... ...
    return mts
kv_map([1,3,4,5,6,8,9,12,14,15], ['A', 'B', 'C', 'D'])
# ['B¬D', '¬AB¬C', 'A¬B¬C', 'ABC', '¬A¬BD']

The following animation shows how the above code (greedily) simplifies the Boolean function given in SOP form (the basic goal is to cover all the 11s with minimum number of power-2 blocks).

Since the algorithm is greedy it may get stuck to some local minimum, that we need to be careful about (can be improved!).


Now, let’s implement a very popular algorithm for numerical analysis, namely the Newton-Raphson algorithm, with which roots of functions can be found. The following figure shows the iterative update steps for the algorithm to compute a root of a function f(x). We start with an initial guess and then go by the tangent at the current location to obtain the next guess and finally converge to the root.

Now let’s say we want to approximate 1/a, for an integer a. In order to compute 1/a, for a given a, we can try to find the roots of f(x)=a−1/x=0, s. t., f′(x)=1/x^2 and we have the following iterative update equation till convergence:

Use the following python code to implement the above algorithm:

def f(x):
    return a - 1/x

def df(x):
    return 1/x**2

def newton_raphson(f, df, x, ϵ): 
    x1 = x
    while True:
        x1 -= f(x) / df(x)
        if abs(x1 - x) < ϵ: # converged?
        x = x1
    return x

a = 3
ϵ= 1e-6 # accuracy
x = 0.6 # initial guess
newton_raphson(f, df, x, ϵ)
# 0.3333331240966088

The following animations show how the algorithm converges with the output 0.33333:

Double Hashing

Now, let’s try to implement a hashing algorithm from data structures. In general, here is how we resolve collision with double-hashing: use the second hash function if there is a collision as follows, to find the next location in hash table T, as shown below:

If there is collision even using the composite hash function, the probing continues for |T| times and if still the collision is not resolved, the element can’t be inserted into the hash table using double hashing (use separate-chaining etc. in this case).

Now, let’s implement double hashing with the following code:

def h1(k, m):
    return (2*k+3)%m

def h2(k, m):
    return (3*k+1)%m

def resolve_collision_with_double_hashing(hastable, keys, m, h1, h2):
    for k in keys:
        index = h1(k, m)
        if not hastable[index]: # no collision
            hastable[index] = k
        else:         # use double-hashing
            v = h2(k, m)
            inserted = False
            i = 1 # no need to check for i = 0, since collision already occurred
            while i < m: 
                index1 =  (index +  v * i) % m
                i += 1
                print('inserting {}, number of probings: {}'.format(k, i))
                if not hastable[index1]:
                    hastable[index1], inserted = k, True
            if not inserted:
                print('could not insert {}'.format(k))

print('hash table: ' + ' '.join(map(lambda x: str(x) if x else '', hastable)))

m = 11
hashtable = [None]*m
keys = [3,2,9,6,11,13,7,1,12,22]
resolve_collision_with_double_hashing(hashtable, keys, m, h1, h2)

# trying to insert 13, number of probings: 2
# trying to insert 13, number of probings: 3
# trying to insert 13, number of probings: 4
# inserted 13
# trying to insert 7, number of probings: 2
# trying to insert 7, number of probings: 3
# trying to insert 7, number of probings: 4
# trying to insert 7, number of probings: 5
# trying to insert 7, number of probings: 6
# trying to insert 7, number of probings: 7
# trying to insert 7, number of probings: 8
# trying to insert 7, number of probings: 9
# trying to insert 7, number of probings: 10
# trying to insert 7, number of probings: 11
# could not insert 7
# trying to insert 12, number of probings: 2
# trying to insert 12, number of probings: 3
# inserted 12
# trying to insert 22, number of probings: 2
# trying to insert 22, number of probings: 3
# trying to insert 22, number of probings: 4
# trying to insert 22, number of probings: 5
# trying to insert 22, number of probings: 6
# inserted 22
# hash table: _ _ 12 11 6 1 13 2 22 3 9

The following animation shows how the keys are inserted in the hash table and collision resolution was attempted using double hashing:

The CYK Algorithm

Let’s focus again on another famous problem from compilers, which is known as the membership problem:

• Given a context-free grammar G and a string w
– G = (V, ∑ ,P , S) where
» V finite set of variables
» ∑ (the alphabet) finite set of terminal symbols
» P finite set of rules
» S start symbol (distinguished element of V)
» V and ∑ are assumed to be disjoint
– G is used to generate the string of a language

– The question we try to answer is the following:
• Is w in L(G)?

The CYK algorithm solves the membership problem for a CFG, with the following assumption:

  • The Structure of the rules in a Chomsky Normal Form (CNF) grammar
    • Context-free grammar is in CNF if each rule has one of the following forms:
      • A –> BC at most 2 symbols (variables / non-terminals) on right side
      • A –> a, or terminal symbol
      • S –> λ, null string
      • here B, C Є V – {S}
  • It uses a “dynamic programming” or “table-filling algorithm”

The following section describes the algorithm steps:

1. Given an input string w of length n, construct a table DP for size n × n.
2. If w = e (empty string) and S -> e is a rule in G then we accept the string else we reject.
3. For i = 1 to n:
     For each variable A:
       We check if A -> b is a rule and b = wi for some i:
        If so, we place A in cell (i, i) of our table. 
4. For l = 2 to n:
     For i = 1 to n-l+1:
       j = i+l-1
       For k = i to j-1:
          For each rule A -> BC: 
       We check if (i, k) cell contains B and (k + 1, j) cell contains C:
           If so, we put A in cell (i, j) of our table. 
5. We check if S is in (1, n):
   If so, we accept the string
   Else, we reject.

The following `python` code shows how to implement the algorithm for a given CFG and how it solves the membership problem given an input string:

Let’s use the above implementation of algorithm for the following simple CFG G (already in CNF):

S -> AB | BC

A -> BA | a

B -> CC | b

C -> AB | a

and the input string w = baaba to test membership of w in L(G).

# first use a data structure to store the given CFG

def get_grammar(rules):
    G = {}
    for rule in rules:
        rule = rule.replace(' ', '')
        lhs, rhs = rule.split('->')
        for r in rhs.split('|'):
            G[lhs] = G.get(lhs, []) + [r]
    return G

NTs = ['S', 'A', 'B', 'C', 'D']
Ts = ['a', 'b']
rules = ['S -> AB | BC', 'A -> BA | a', 'B -> CC | b', 'C -> AB | a'] #, 'D -> ϵ']
G = get_grammar(rules)
# {'S': ['AB', 'BC'], 'A': ['BA', 'a'], 'B': ['CC', 'b'], 'C': ['AB', 'a']}

# now check if the grammar is in the chomosky normal form (CNF):
def is_in_CNF(G, NTs, Ts):
    for lhs in G.keys():
        if lhs in NTs:
            for rhs in G[lhs]:
                if len(rhs) == 2:   # of the form A -> BC
                    if not rhs[0] in NTs or not rhs[1] in NTs:
                        return False
                elif len(rhs) == 1: # of the form S -> a
                    if not rhs in Ts + ['ϵ']:
                        return False
                    return False
    return True

is_in_CNF(G, NTs, Ts)
# True
import numpy as np
import pandas as pd

def is_in_cartesian_prod(x, y, r):
    return r in [i+j for i in x.split(',') for j in y.split(',')]

def accept_CYK(w, G, S):
    if w == 'ϵ':
        return 'ϵ' in G[S]
    n = len(w)
    DP_table = [['']*n for _ in range(n)]
    for i in range(n):
        for lhs in G.keys():
            for rhs in G[lhs]:
                 if w[i] == rhs: # rules of the form A -> a
                    DP_table[i][i] = lhs if not DP_table[i][i] else DP_table[i][i] + ',' + lhs
    for l in range(2, n+1):       # span
        for i in range(n-l+1):    # start
            j = i+l-1                    # right
            for k in range(i, j):     # partition
                for lhs in G.keys():
                    for rhs in G[lhs]:
                        if len(rhs) == 2: #rules of form A -> BC
                            if is_in_cartesian_prod(DP_table[i][k], DP_table[k+1][j], rhs):
                                if not lhs in DP_table[i][j]:
                                    DP_table[i][j] = lhs if not DP_table[i][j] else DP_table[i][j] + ',' + lhs

    return S in DP_table[0][n-1]  

accept_CYK('baaba', G, 'S')
# True

The following animation shows how the algorithm constructs the dynamic programming table for the following simple grammar:

CRC Polynomial Division

Let’s now concentrate on how to encode data with an error-checking code, namely, the cyclic redundancy code (CRC), in order to detect any error at the time of communication. We need to follow the following algorithm (given by the steps listed below) to compute the bits to be appended to the data to encode the data and send it from the transmitter side, given the CRC generating polynomial and the data polynomial, and then at the receiver end use the appended bits to check if there is a possible corruption in any data bit in the communication channel:

  • Convert CRC / data polynomials to corresponding binary equivalents.
  • if the CRC key (binary representation obtained from the polynomial) has k bits, we need to pad an additional k-1 bits with the data to check for errors. In the example given, the bits 011 should be appended to the data, not 0011, since k=4.
  • At the transmitter end
    • The binary data is to be augmented first by adding k-1 zeros in the end of the data.
    • Use modulo-2 binary division to divide binary data by the CRC key and store remainder of division.
    • Append the remainder at the end of the data to form the encoded data and send the same
  • At the receiver end
    • Check if there are errors introduced in transmission
    • Perform modulo-2 division again on the sent data with the CRC key and if the remainder is 0, then there is no error.

Now let’s implement the above:

def CRC_polynomial_to_bin_code(pol):
    return bin(eval(pol.replace('^', '**').replace('x','2')))[2:]

def get_remainder(data_bin, gen_bin):
    ng = len(gen_bin)
    data_bin += '0'*(ng-1)
    nd = len(data_bin)
    divisor = gen_bin
    i = 0
    remainder = ''
    print('\nmod 2 division steps:')
    print('divisor dividend remainder')
    while i < nd:
        j = i + ng - len(remainder)
        if j > nd: 
            remainder += data_bin[i:]
        dividend = remainder + data_bin[i:j]
        remainder = ''.join(['1' if dividend[k] != gen_bin[k] else '0' for k in range(ng)])
        print('{:8s} {:8s} {:8s}'.format(divisor, dividend, remainder[1:]))
        remainder = remainder.lstrip('0')
        i = j
    return remainder.zfill(ng-1)

gen_bin = CRC_polynomial_to_bin_code('x^3+x')
data_bin = CRC_polynomial_to_bin_code('x^11 + x^8 + x^7 + x^2 + x + 1') 
print('transmitter end:\n\nCRC key: {}, data: {}'.format(gen_bin, data_bin))
r = get_remainder(data_bin, gen_bin)
data_crc = data_bin + r
print('\nencoded data: {}'.format(data_crc))
print('\nreceiver end:')
r = get_remainder(data_crc, gen_bin)
print('\nremainder {}'.format(r))

if eval(r) == 0:
    print('data received at the receiver end has no errors')

# ---------------------------------
# transmitter end:
# CRC key: 1010, data: 100110000111
# mod 2 division steps:
# divisor dividend remainder
# 1010     1001     011     
# 1010     1110     100     
# 1010     1000     010     
# 1010     1000     010     
# 1010     1011     001     
# 1010     1100     110     
# 1010     1100     110     
# encoded data: 100110000111110
# ---------------------------------
# receiver end:
# mod 2 division steps:
# divisor dividend remainder
# 1010     1001     011     
# 1010     1110     100     
# 1010     1000     010     
# 1010     1000     010     
# 1010     1011     001     
# 1010     1111     101     
# 1010     1010     000     
# remainder 000
# data received at the receiver end has no errors
# ---------------------------------

Shortest Remaining Time Process Scheduling Algorithm

Let’s now try to implement a preemptive process scheduling algorithm, namely, shortest remaining time next (SRTN) scheduling algorithm for assigning a process to CPU (this one is from the operating systems). This is the preemptive version of Shortest Job First algorithm. At any point in time, the process with smallest amount of time remaining until completion is selected to execute. A process running on CPU is preempted by a new process iff the latter one has smaller execution time than the current one.

When a new process arrives that has execution (burst) time that is less as the remaining completion time of the currently executing process, a context witching will happen and the current process will be removed from the CPU and the newly arrived process will start its execution on CPU.

The Gantt chart is used to show the scheduling of the processes to the CPU.

We can implement the algorithm for preemptive shortest remaining time next scheduling using the following python function and simulate the execution of the processes on CPU, given the process arrival and burst times as a data frame.

import pandas as pd

def SRTN(df): # df is the data frame with arrival / burst time of processes

    queue = []
    cpu, cur_pdf = None, None
    alloc, dalloc = {}, {}

    time = 0

    while True: # simulate the CPU scheduling algorithm

        # check if all processes finished execution
        if df['RemainingTime'].max() == 0:

        # get current process assigned to cpu, if any
        if cpu:
            cur_pdf =  df[df.Process == cpu]    

        # check if a process arrived at this time instance and put it into wait queue
        pdf = df[df.ArrivalTime == time]

        if len(pdf) > 0:
            for p in pdf['Process'].values:

        if len(queue) > 0:
            pdf = df[df['Process'].isin(queue)]

            # find the process with shortest remaining time
            if len(pdf) > 0:
                pdf = pdf[pdf['RemainingTime']==pdf['RemainingTime'].min()]

            # allocate a process to CPU, pre-empt the running one if required
            if (cpu is None) or (len(pdf) > 0 and pdf['RemainingTime'].values[0] < cur_pdf['RemainingTime'].values[0]):
                if cpu:
                    # prempt the current process
                    dalloc[cpu] = dalloc.get(cpu, []) + [time]
                    print('Process {} deallocated from CPU at time {}'.format(cpu, time))
                cur_pdf = pdf
                cpu = cur_pdf['Process'].values[0]
                print('Process {} allocated to CPU at time {}'.format(cpu, time))
                alloc[cpu] = alloc.get(cpu, []) + [time]

        df.loc[df['Process']==cpu,'RemainingTime'] -= 1

        time += 1 # increment timer

        # deallocate process
        if df[df['Process']==cpu]['RemainingTime'].values[0] == 0:
            print('Process {} deallocated from CPU at time {}'.format(cpu, time))
            dalloc[cpu] = dalloc.get(cpu, []) + [time]
            cpu = cur_pdf = None
    return alloc, dalloc

Now, run SRTN on the following data (process arrival / burst times):

df = pd.DataFrame({'Process':['A','B','C','D'], 'BurstTime':[3,5,3,2], 'ArrivalTime':[0,2,5,6]})
df.sort_values('ArrivalTime', inplace=True)
df['RemainingTime'] = df.BurstTime

alloc, dalloc = SRTN(df)
# Process A allocated to CPU at time 0
# Process A deallocated from CPU at time 3
# Process B allocated to CPU at time 3
# Process B deallocated from CPU at time 8
# Process D allocated to CPU at time 8
# Process D deallocated from CPU at time 10
# Process C allocated to CPU at time 10
# Process C deallocated from CPU at time 13
# alloc
# {'A': [0], 'B': [3], 'D': [8], 'C': [10]}
# dalloc
# {'A': [3], 'B': [8], 'D': [10], 'C': [13]}

The following animation shows how the Gantt chart for the preemptive SRTN scheduling algorithm:

Let’s consider the following input table for the arrival of the following 3 processes:

alloc, dalloc, events = SRTN(df)
# Process A allocated to CPU at time 0
# Process A deallocated from CPU at time 1
# Process B allocated to CPU at time 1
# Process B deallocated from CPU at time 5
# Process A allocated to CPU at time 5
# Process A deallocated from CPU at time 11
# Process C allocated to CPU at time 11
# Process C deallocated from CPU at time 19

The Gantt chart corresponding to the above table is shown in the following animation:

Determine if a Relation is in BCNF

Let’s now focus on the following problem from DBMS: given a relation and set of functional dependencies whether the relation is in Boyce-Codd Normal form or not. Using the algorithm to compute the closure of a given set of attributes and the definition of BCNF as shown in the following figure, we can determine whether or not the given relation (with associated set of FDs) is in BCNF.

We can implement the above algorithm in python 

  • to compute closure of a given set of attributes and
  • then determine whether they form a superkey (if the closure yields in all the attributes in the relation) or not, as shown in the following code snippet:
def closure(s, fds):
    c = s
    for f in fds:
        l, r = f[0], f[1]
        if l.issubset(c):
            c = c.union(r)
    if s != c:
        c = closure(c, fds)
    return c

def is_superkey(s, rel, fds):
    c = closure(s, fds)
    print(f'({"".join(sorted(s))})+ = {"".join(sorted(c))}')
    return c == rel

Now check if for each given Functional Dependency A -> B from relation RA is a superkey or not, to determine whether R is in BCNF or not:

def is_in_BCNF(rel, fds):
        for fd in fds:
            l, r = fd[0], fd[1]
            isk = is_superkey(l, rel, fds)
            print(f'For the Functional Dependency {"".join(sorted(l))} -> {"".join(sorted(r))}, ' +\
                  f'{"".join(sorted(l))} {"is" if isk else "is not"} a superkey')
            if not isk:
                print('=> R not in BCNF!')
                return False
        print('=> R in BCNF!')   
        return True

To process the given FDs in standard form, to convert to suitable data structure, we can use the following function:

import re

def process_fds(fds):
    pfds = []
    for fd in fds:
        fd = re.sub('\s+', '', fd)
        l, r = fd.split('->')
        pfds.append([set(list(l)), set(list(r))])
    return pfds

Now, let’s test with a few relations (and associated FDs):

relation = {'U','V','W','X','Y','Z'}
fds = process_fds(['UVW->X', 'VW->YU', 'VWY->Z'])
is_in_BCNF(relation, fds)

# For the Functional Dependency UVW -> X, UVW is a superkey
# (VW)+ = UVWXYZ
# For the Functional Dependency VW -> UY, VW is a superkey
# For the Functional Dependency VWY -> Z, VWY is a superkey
# => R in BCNF!

relation = {'A','B','C'}
fds = process_fds(['A -> BC', 'B -> A'])
is_in_BCNF(relation, fds)

# (A)+ = ABC
# For the Functional Dependency A -> BC, A is a superkey
# (B)+ = ABC
# For the Functional Dependency B -> A, B is a superkey
# => R in BCNF!

relation = {'A','B','C', 'D'}
fds = process_fds(['AC -> D', 'D -> A', 'D -> C', 'D -> B'])
is_in_BCNF(relation, fds)

# (AC)+ = ABCD
# For the Functional Dependency AC -> D, AC is a superkey
# (D)+ = ABCD
# For the Functional Dependency D -> A, D is a superkey
# (D)+ = ABCD
# For the Functional Dependency D -> C, D is a superkey
# (D)+ = ABCD
# For the Functional Dependency D -> B, D is a superkey
# => R in BCNF!

relation = {'A','B','C', 'D', 'E'}
fds = process_fds(['BCD -> E', 'BDE -> C', 'BE -> D', 'BE -> A'])
is_in_BCNF(relation, fds)

# (BCD)+ = ABCDE
# For the Functional Dependency BCD -> E, BCD is a superkey
# (BDE)+ = ABCDE
# For the Functional Dependency BDE -> C, BDE is a superkey
# (BE)+ = ABCDE
# For the Functional Dependency BE -> D, BE is a superkey
# (BE)+ = ABCDE
# For the Functional Dependency BE -> A, BE is a superkey
# => R in BCNF!

relation = {'A','B','C','D','E'}
fds = process_fds(['BC->D', 'AC->BE', 'B->E'])
is_in_BCNF(relation, fds)

# (BC)+ = BCDE
# For the Functional Dependency BC -> D, BC is not a superkey
 #=> R not in BCNF!

Check if a Relation decomposition is lossless

Let’s now focus on another problem from DBMS: given a relation (along with FDs) and the decompositions, check if the decomposition is lossless or not.

As described here, decomposition of R into R1 and R2 is lossless if

  1. Attributes(R1) U Attributes(R2) = Attributes(R)
  2. Attributes(R1) ∩ Attributes(R2) ≠ Φ
  3. Common attribute must be a key for at least one relation (R1 or R2)

with the assumption that we are not considering the trivial cases where all tuples of R1 / R2 are unique, hence any decomposition will be lossless, since spurious tuples cant be created upon joining (so that (2) holds under this non-trivial assumption).

We can check the above condition with the following python code snippet:

def is_supekey(s, rel, fds):
    c = closure(s, fds)
    print(f'({"".join(sorted(s))})+ = {"".join(sorted(c))}')
    return rel.issubset(c) #c == rel

def is_lossless_decomp(r1, r2, r, fds):
    c = r1.intersection(r2)
    if r1.union(r2) != r:
        print('not lossless: R1 U R2 ≠ R!')
        return False
    if r1.union(r2) != r or len(c) == 0:
        print('not lossless: no common attribute in between R1 and R2!')
        return False
    if not is_supekey(c, r1, fds) and not is_supekey(c, r2, fds):
        print(f'not lossless: common attribute {"".join(c)} not a key in R1 or R2!')
        return False
    print('lossless decomposition!')
    return True     

Now let’s test with the above decompositions given:

r = {'A','B','C','D','E'}
fds = process_fds(['AB->E', 'C->AD', 'D->B', 'E->C'])

r1, r2 = {'A', 'C', 'D'}, {'B', 'C', 'E'} 
is_lossless_decomp(r1, r2, r, fds)
# (C)+ = ACD
# lossless decomposition!

r1, r2 = {'A', 'C', 'D'}, {'A', 'B', 'E'} 
is_lossless_decomp(r1, r2, r, fds)
# (A)+ = A
# not lossless: common attribute A not a key in R1 or R2!

r = {'A','B','C','D','E', 'G'}
fds = process_fds(['AB->C', 'AC->B', 'AD->E', 'B->D', 'BC->A', 'E->G'])
r1, r2 = {'A', 'D', 'G'}, {'A', 'B', 'C', 'D', 'E'} 
is_lossless_decomp(r1, r2, r, fds)
# (AD)+ = ADEG
# lossless decomposition!

The RSA Encryption Algorithm

Let’s now focus on a conceptual implementation of the RSA algorithm for (asymmetric) encryption of a message. Let’s consider sending some message over a communication channel from a sender A to receiver B. The RSA algorithm can be used to encrypt the message (plain-text) to be sent into a cipher-text in such a way that it will be computationally hard for an eavesdropper to extract the plain-text from the cipher-text. For each of the users, the algorithm generates two keys, one being public and shared across all the users, the second one being private, which must be kept secret. The algorithm relies on the fact that the prime factorization is hard and in order to break the encryption, one needs to factorize a number into two large primes, no polynomial time algorithm for which exists, hence providing the security.

The RSA algorithm can be described as follows:

  1. Choose two different large primes (here for the purpose of demonstration let’s choose smaller primes p=89q=97)
  2. Compute n = p*q
  3. Compute Euler Totient φ(n) ≡ (p-1)*(q-1)
  4. Choose the public key e as coprime with φ(n), for simplicity, let’s choose e=257, which is a prime
  5. Compute the private key d, s.t. d*e ≡ 1 (mod φ(n)), using the multiplicative inverse algorithm (extended Euclidean) from here:

# Compute multiplicative inverse of a modulo n
# solution t to a*t ≡ 1 (mod n) 

def multiplicative_inverse(a, n):

    t, newt = 0, 1
    r, newr = n, a

    while newr != 0:
        #print(t, newt, r, newr)
        q = r // newr
        t, newt = newt, t - q * newt
        r, newr = newr, r - q * newr

    if t < 0:
        t = t + n

    return t

Python code for steps 1-5 are shown below:

p, q = 89, 97 # choose large primes here
n = p*q
φ = (p-1)*(q-1)
e = 257 # choose public key e as a Fermat's prime, s.t., gcd(φ, e) = 1
d = multiplicative_inverse(e, φ) # private key
# 4865
  1. Encrypt the message with the receiver’s public key (e) at sender’s end
  2. Decrypt the ciphertext received at the receiver end with his private key (d)

The following code shows how encryption / decryption can be done:

def rsa_encrypt(plain_text, e, n):
    # ideally we should convert the plain text to byte array and 
    # then to a big integer which should be encrypted, but here for the sake of 
    # simplicity character-by-character encryption is done, which will be slow in practice
    cipher_text = [ord(x)**e % n for x in plain_text]        
    return cipher_text

def rsa_decrypt(cipher_text, d, n):
    decoded_text = ''.join([chr(x**d % n) for x in cipher_text])
    return decoded_text 

Now, let’s use the above functions for encryption / decryption:

plain_text = 'Hello world'
cipher_text = rsa_encrypt(plain_text, e, n)
# [5527, 7221, 8242, 8242, 5243, 2381, 3611, 5243, 886, 8242, 1735]
decoded_text = rsa_decrypt(cipher_text, d, n)
# Hello world

Bernstein-Vazirani Quantum Algorithm to find a hidden bit string

Now let’s focus on the following problem: Given an oracle access to f : {0, 1}n → {0, 1} and a promise that the function f(x) ≡ s·x (mod 2), where s is a secret string, the algorithm will learn s just with a single query to the oracle.

Using a quantum computer, we can solve this problem with 100% confidence after only one call to the function f(x). The quantum Bernstein-Vazirani algorithm to find the hidden bit string is described as follows:

  1. Initialize the inputs qubits to the |0⟩⊗n state, and an auxiliary qubit to |−⟩.
  2. Apply Hadamard gates to the input register
  3. Query the oracle
  4. Apply Hadamard gates to the input register
  5. Measure

With qasm simulator with qiskit and the inner-product quantum oracle (parameterized by the secret bits and leveraging the phase-kickback using the auxiliary qubit at state |−⟩), the BV algorithm can be implemented as follows:

import numpy as np
from qiskit.circuit import QuantumCircuit
from qiskit import Aer, execute 
from qiskit.visualization import plot_histogram

def oracle(qc, s):
    n = len(s)
    for i in range(n):
        if s[i] == '1': # phase-kickback
  , n)
def BV_circuit(s): 
    s = s[::-1]
    n = len(s)

    # n-qubit quantum register + 1 auxiliary qubit with n classical registers
    qc = QuantumCircuit(n+1, n) 
    for i in range(n):
    oracle(qc, s)

    for i in range(n):
    qc.measure(list(range(n)), list(range(n)))
    return qc, execute(qc,
        Aer.get_backend('qasm_simulator'), shots=1024

Running with secret bits 1010, the following figure shows the Hadamard sandwich and the Oracle circuit

Finally, after measurement, it always outputs the secret bits, just 1 run is required as opposed to the classical algorithm requiring n runs to determine the secret bits.



Probabilistic Deep Learning with Tensorflow

In this blog, we shall discuss on how to implement probabilistic deep learning models using Tensorflow. The problems to be discussed in this blog appeared in the exercises / projects in the coursera course “Probabilistic Deep Learning“, by Imperial College, London, as a part of TensorFlow 2 for Deep Learning Specialization. The problem statements / descriptions are taken from the course itself.

Naive Bayes and logistic regression with Tensorflow Probability

In this problem, we shall develop a Naive Bayes classifier model to the Iris dataset using Distribution objects from TensorFlow Probability. We shall also explore the connection between the Naive Bayes classifier and logistic regression.

The Iris dataset

In this problem, we shall use the Iris dataset. It consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. For a reference, see the following papers:

  • R. A. Fisher. “The use of multiple measurements in taxonomic problems”. Annals of Eugenics. 7 (2): 179–188, 1936.

Our goal will be to construct a Naive Bayes classifier model that predicts the correct class from the sepal length and sepal width features. Under certain assumptions about this classifier model, we shall explore the relation to logistic regression.

The following figures show the 3 categories of flowers from the Iris dataset, namely, setosa, versicolor and verginica, respectively.

Let’s start by importing the required libraries.

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection
%matplotlib inline

Load and prepare the data

We will first read in the Iris dataset, and split the dataset into training and test sets, using the following code snippet.

iris = datasets.load_iris()
# Use only the first two features: sepal length and width

data =[:, :2]
targets =

x_train, x_test, y_train, y_test = model_selection.train_test_split(data, targets, test_size=0.2)

labels = {0: 'Iris-Setosa', 1: 'Iris-Versicolour', 2: 'Iris-Virginica'}
label_colours = ['blue', 'orange', 'green']

def plot_data(x, y, labels, colours):
    for c in np.unique(y):
        inx = np.where(y == c)
        plt.scatter(x[inx, 0], x[inx, 1], label=labels[c], c=colours[c])
    plt.title("Training set")
    plt.xlabel("Sepal length (cm)")
    plt.ylabel("Sepal width (cm)")
plt.figure(figsize=(8, 5))
plot_data(x_train, y_train, labels, label_colours)

Naive Bayes classifier

We will briefly review the Naive Bayes classifier model. The fundamental equation for this classifier is Bayes’ rule:

In the above, dd is the number of features or dimensions in the inputs X (in our case d=2), and K is the number of classes (in our case K=3). The distribution P(Y) is the class prior distribution, which is a discrete distribution over K classes. The distribution P(X|Y) is the class-conditional distribution over inputs.

The Naive Bayes classifier makes the assumption that the data features Xi are conditionally independent give the class Y (the ‘naive’ assumption). In this case, the class-conditional distribution decomposes as follows:

This simplifying assumption means that we typically need to estimate far fewer parameters for each of the distributions P(Xi|Y=yk) instead of the full joint distribution P(X|Y=yk).

Once the class prior distribution and class-conditional densities are estimated, the Naive Bayes classifier model can then make a class prediction for a new data input according to

Define the class prior distribution

We will begin by defining the class prior distribution. To do this we will simply take the maximum likelihood estimate, given by

where the superscript (n) indicates the n-th dataset example, and N is the total number of examples in the dataset. The above is simply the proportion of data examples belonging to class k.

Let’s now define a function that builds the prior distribution from the training data, and returns it as a Categorical Distribution object.

  • The input to your function y will be a numpy array of shape (num_samples,)
  • The entries in y will be integer labels k=0,1,…,K−1
  • The function should build and return the prior distribution as a Categorical distribution object
    • The probabilities for this distribution will be a length-KK vector, with entries corresponding to P(Y=yk) for k=0,1,…,K−1
    • Your function should work for any value of K≥1
    • This Distribution will have an empty batch shape and empty event shape
def get_prior(y):
    return tfd.Categorical(probs=[np.mean(y == i) for i in range(3)]) 

prior = get_prior(y_train)

# Plot the prior distribution

labels = ['Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica'][0, 1, 2], prior.probs.numpy(), color=label_colours)
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1, 2], labels)

Define the class-conditional densities

Let’s now turn to the definition of the class-conditional distributions P(Xi|Y=yk) for i=0,1 and k=0,1,2. In our model, we will assume these distributions to be univariate Gaussian:

with mean parameters μik and standard deviation parameters σik, twelve parameters in all. We will again estimate these parameters using maximum likelihood. In this case, the estimates are given by

Note that the above are just the means and variances of the sample data points for each class.

Let’s now implement a function the computes the class-conditional Gaussian densities, using the maximum likelihood parameter estimates given above, and returns them in a single, batched MultivariateNormalDiag Distribution object.

  • The inputs to the function are
    • a numpy array x of shape (num_samples, num_features) for the data inputs
    • a numpy array y of shape (num_samples,) for the target labels
  • The function should work for any number of classes K≥1 and any number of features d≥1
def get_class_conditionals(x, y):
    mu = [np.mean(x[y == k], axis=0) for k in range(3)]
    sigma = [np.sqrt(np.mean((x[y == k]-mu[k])**2, axis=0)) for k in range(3)]
    return tfd.MultivariateNormalDiag(loc=mu, scale_diag=sigma) 

class_conditionals = get_class_conditionals(x_train, y_train)

We can visualise the class-conditional densities with contour plots by running the cell below. Notice how the contours of each distribution correspond to a Gaussian distribution with diagonal covariance matrix, since the model assumes that each feature is independent given the class.

def get_meshgrid(x0_range, x1_range, num_points=100):
    x0 = np.linspace(x0_range[0], x0_range[1], num_points)
    x1 = np.linspace(x1_range[0], x1_range[1], num_points)
    return np.meshgrid(x0, x1)

def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
    X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
    Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
    #print(Z.shape, batch_shape, 'x', *X0.shape)
    Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
    for batch in np.arange(batch_shape):
        if levels:
            plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
            plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals.prob, 3, label_colours)
plt.title("Training set with class-conditional density contours")

Make predictions from the model

Now the prior and class-conditional distributions are defined, you can use them to compute the model’s class probability predictions for an unknown test input, according to

The class prediction can then be taken as the class with the maximum probability:

Let’s now implement a function to return the model’s class probabilities for a given batch of test inputs of shape (batch_shape, 2), where the batch_shape has rank at least one.

  • The inputs to the function are the prior and class_conditionals distributions, and the inputs x
  • The function should use these distributions to compute the probabilities for each class k as above
    • As before, the function should work for any number of classes K≥1
  • It should then compute the prediction by taking the class with the highest probability
  • The predictions should be returned in a numpy array of shape (batch_shape)
def predict_class(prior, class_conditionals, x):
    x = x[:, np.newaxis, :]
    return tf.argmax(tf.cast(class_conditionals.prob(x),tf.float32)*tf.cast(prior.probs,tf.float32),axis=1).numpy()

predictions = predict_class(prior, class_conditionals, x_test)

# Evaluate the model accuracy on the test set
accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))
# Test accuracy: 0.8000

# Plot the model's decision regions

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
             lambda x: predict_class(prior, class_conditionals, x), 
             3, label_colours, levels=[-0.5, 0.5, 1.5, 2.5],
plt.title("Training set with decision regions")

Binary classifier

We will now draw a connection between the Naive Bayes classifier and logistic regression.

First, we will update our model to be a binary classifier. In particular, the model will output the probability that a given input data sample belongs to the ‘Iris-Setosa’ class: P(Y=y0|X1,…,Xd). The remaining two classes will be pooled together with the label y1.

# Redefine the dataset to have binary labels

y_train_binary = np.array(y_train)
y_train_binary[np.where(y_train_binary == 2)] = 1

y_test_binary = np.array(y_test)
y_test_binary[np.where(y_test_binary == 2)] = 1

# Plot the training data

labels_binary = {0: 'Iris-Setosa', 1: 'Iris-Versicolour / Iris-Virginica'}
label_colours_binary = ['blue', 'red']

plt.figure(figsize=(8, 5))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)

We will also make an extra modelling assumption that for each class kk, the class-conditional distribution P(Xi|Y=yk) for each feature i=0,1, has standard deviation σi, which is the same for each class k.

This means there are now six parameters in total: four for the means μik and two for the standard deviations σiσi (i,k=0,1).

We will again use maximum likelihood to estimate these parameters. The prior distribution will be as before, with the class prior probabilities given by

We will use our previous function get_prior to redefine the prior distribution.

prior_binary = get_prior(y_train_binary)
# Plot the prior distribution[0, 1], prior_binary.probs.numpy()[:-1], color=label_colours_binary)
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1], labels_binary)

For the class-conditional densities, the maximum likelihood estimate for the means are again given by

However, the estimate for the standard deviations σi is updated. There is also a closed-form solution for the shared standard deviations, but we will instead learn these from the data.

Let’s now implement a function that takes the training inputs and target labels as input, as well as an optimizer object, number of epochs and a TensorFlow Variable. This function should be written according to the following spec:

  • The inputs to the function are:
    • a numpy array x of shape (num_samples, num_features) for the data inputs
    • a numpy array y of shape (num_samples,) for the target labels
    • tf.Variable object scales of length 2 for the standard deviations σiσi
    • optimiser: an optimiser object
    • epochs: the number of epochs to run the training for
  • The function should first compute the means μik of the class-conditional Gaussians according to the above equation
  • Then create a batched multivariate Gaussian distribution object using MultivariateNormalDiag with the means set to μik and the scales set to scales
  • Run a custom training loop for epochs number of epochs, in which:
    • the average per-example negative log likelihood for the whole dataset is computed as the loss
    • the gradient of the loss with respect to the scales variables is computed
    • the scales variables are updated by the optimiser object
  • At each iteration, save the values of the scales variable and the loss
  • The function should return a tuple of three objects:
    • a numpy array of shape (epochs,) of loss values
    • a numpy array of shape (epochs, 2) of values for the scales variable at each iteration
    • the final learned batched MultivariateNormalDiag distribution object
mu = [np.mean(x_train[y_train_binary == k], axis=0) for k in range(2)]

def learn_stdevs(x, y, scales, optimiser, epochs):
    def nll(x, y, distribution):
        predictions = - distribution.log_prob(x)
        return tf.reduce_sum(predictions[y==0][:,0]) + tf.reduce_sum(predictions[y==1][:,1])

    def get_loss_and_grads(x, y, distribution):
        with tf.GradientTape() as tape:
            loss = nll(x, y, distribution)
            grads = tape.gradient(loss, distribution.trainable_variables)
        return loss, grads

    shape = (len(set(y)), x.shape[-1])
    loc = np.zeros(shape, dtype=np.float32)

    for feature in range(shape[0]):
        for category in range(shape[-1]):
            data_point = x[y == category][:, feature]
            loc[category, feature] = np.mean(data_point)

    distribution = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scales) #b(2), e(2)
    x = np.expand_dims(x , 1).astype('float32')

    train_loss_results = []
    train_scale_results = []

    for epoch in range(epochs):
        loss, grads = get_loss_and_grads(x, y, distribution)
        optimiser.apply_gradients(zip(grads, distribution.trainable_variables))
        scales = distribution.parameters['scale_diag'].numpy()
        if epoch % 10 == 0:
            print(f'epoch: {epoch}, loss: {loss}')
    return np.array(train_loss_results), np.array(train_scale_results), distribution

scales = tf.Variable([1., 1.])
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
epochs = 500
# run the training loop
nlls, scales_arr, class_conditionals_binary = learn_stdevs(x_train, y_train_binary, scales, opt, epochs)

epoch: 0, loss: 246.33450317382812
epoch: 10, loss: 227.07168579101562
epoch: 20, loss: 207.1158905029297
epoch: 30, loss: 187.12120056152344
epoch: 40, loss: 168.60015869140625
epoch: 50, loss: 153.5633087158203
epoch: 60, loss: 143.8475341796875
epoch: 70, loss: 142.80393981933594
epoch: 80, loss: 142.56259155273438
epoch: 90, loss: 142.23074340820312
epoch: 100, loss: 142.25711059570312
epoch: 110, loss: 142.18955993652344
epoch: 120, loss: 142.1979217529297
epoch: 130, loss: 142.18882751464844
epoch: 140, loss: 142.18991088867188
epoch: 150, loss: 142.1887664794922
epoch: 160, loss: 142.1888885498047
epoch: 170, loss: 142.18875122070312
epoch: 180, loss: 142.1887664794922
epoch: 190, loss: 142.1887664794922
epoch: 200, loss: 142.1887664794922
epoch: 210, loss: 142.18875122070312
epoch: 220, loss: 142.1887664794922
epoch: 230, loss: 142.18873596191406
epoch: 240, loss: 142.18878173828125
epoch: 250, loss: 142.18875122070312
epoch: 260, loss: 142.18875122070312
epoch: 270, loss: 142.18875122070312
epoch: 280, loss: 142.18875122070312
epoch: 290, loss: 142.18875122070312
epoch: 300, loss: 142.18878173828125
epoch: 310, loss: 142.18875122070312
epoch: 320, loss: 142.18875122070312
epoch: 330, loss: 142.18875122070312
epoch: 340, loss: 142.18875122070312
epoch: 350, loss: 142.1887664794922
epoch: 360, loss: 142.1887664794922
epoch: 370, loss: 142.1887664794922
epoch: 380, loss: 142.1887664794922
epoch: 390, loss: 142.1887664794922
epoch: 400, loss: 142.1887664794922
epoch: 410, loss: 142.1887664794922
epoch: 420, loss: 142.1887664794922
epoch: 430, loss: 142.1887664794922
epoch: 440, loss: 142.1887664794922
epoch: 450, loss: 142.1887664794922
epoch: 460, loss: 142.1887664794922
epoch: 470, loss: 142.1887664794922
epoch: 480, loss: 142.1887664794922
epoch: 490, loss: 142.1887664794922
print("Class conditional means:")
print("\nClass conditional standard deviations:")
Class conditional means:
[[4.9692307 3.3820512]
 [6.2172837 2.8814814]]

Class conditional standard deviations:
[[0.5590086  0.34253535]
 [0.5590086  0.34253535]]
# Plot the loss and convergence of the standard deviation parameters

fig, ax = plt.subplots(1, 2, figsize=(14, 5))
ax[0].set_title("Loss vs epoch")
ax[0].set_ylabel("Average negative log-likelihood")
for k in [0, 1]:
    ax[1].plot(scales_arr[:, k], color=label_colours_binary[k], label=labels_binary[k])
ax[1].set_title("Standard deviation ML estimates vs epoch")
ax[1].set_ylabel("Standard deviation")

We can also plot the contours of the class-conditional Gaussian distributions as before, this time with just binary labelled data. Notice the contours are the same for each class, just with a different centre location.

We can also plot the decision regions for this binary classifier model, notice that the decision boundary is now linear.

The following animation shows how we can learn the standard deviation parameters for the class-conditional distributions for Naive Bayes using tensorflow, with the above code snippet.

In fact, we can see that our predictive distribution P(Y=y0|X) can be written as follows:

With our additional modelling assumption of a shared covariance matrix Σ, it can be shown (using the Gaussian pdf) that a is in fact a linear function of X:

The model therefore takes the form P(Y=y0|X)=σ(wTX+w0), with weights w∈R2 and bias w0∈R. This is the form used by logistic regression, and explains why the decision boundary above is linear.

In the above we have outlined the derivation of the generative logistic regression model. The parameters are typically estimated with maximum likelihood, as we have done.

Finally, we will use the above equations to directly parameterize the output Bernoulli distribution of the generative logistic regression model.

Let’s now write the following function, according to the following specification:

  • The inputs to the function are:
    • the prior distribution prior over the two classes
    • the (batched) class-conditional distribution class_conditionals
  • The function should use the parameters of the above distributions to compute the weights and bias terms w and w0 as above
  • The function should then return a tuple of two numpy arrays for w and w0
def get_logistic_regression_params(prior, class_conditionals):    
    Sigma = class_conditionals.covariance().numpy()
    SI = np.linalg.inv(Sigma)
    p = prior.probs.numpy()
    mu = class_conditionals.parameters['loc'] #.numpy()
    w = SI @ (mu[0] - mu[1])
    w0 = -0.5*mu[0].T@SI@mu[0] + 0.5*mu[1].T@SI@mu[1] + np.log(p[0]/p[1])
    return w, w0

w, w0 = get_logistic_regression_params(prior_binary, class_conditionals_binary)

We can now use these parameters to make a contour plot to display the predictive distribution of our logistic regression model.

Probabilistic generative models

Let’s start with generative models, using normalizing flow networks and the variational autoencoder algorithm. We shall create a synthetic dataset with a normalizing flow with randomised parameters. This dataset will then be used to train a variational autoencoder, and the trained model will be used to interpolate between the generated images. The concepts to be used will be

  • Distribution objects
  • Probabilistic layers
  • Bijectors
  • ELBO optimization
  • KL divergence Regularizers.

The next figure represents the theory required for the implementation:

Let’s start by running importing the required libraries first. 

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
tfpl = tfp.layers

import numpy as np
import matplotlib.pyplot as plt

We shall create our own image dataset from contour plots of a transformed distribution using a random normalizing flow network and then use the variational autoencoder algorithm to train generative and inference networks, and synthesise new images by interpolating in the latent space.

The normalising flow

  • To construct the image dataset, you will build a normalizing flow to transform the 2-D Gaussian random variable z = (z1, z2), which has mean 0 and covariance matrix Σ=σ^2.I2, with σ=0.3.
  • This normalizing flow uses bijectors that are parameterized by the following random variables:
    • θ ∼ U[0,2π)
    • a ∼ N(3,1)

The complete normalising flow is given by the following chain of transformations:

  • f1(z)=(z1,z2−2)
  • f2(z)=(z1,z2/2),
  • f3(z)=(z1,z2+a.z1^2),
  • f4(z)=R.z, where R is a rotation matrix with angle θ,
  • f5(z)=tanh(z), where the tanh function is applied elementwise.

The transformed random variable x is given by x=f5(f4(f3(f2(f1(z))))).

  • We need to use or construct bijectors for each of the transformations fi, i=1,…,5 and use tfb.Chain and tfb.TransformedDistribution to construct the final transformed distribution.
  • Ensure to implement the log_det_jacobian methods for any subclassed bijectors that you write.
  • Display a scatter plot of samples from the base distribution.
  • Display 4 scatter plot images of the transformed distribution from your random normalizing flow, using samples of θ and a. Fix the axes of these 4 plots to the range [−1,1].

The following code block shows how to implement the above steps:

def plot_distribution(samples, ax, title, col='red'):
    ax.scatter(samples[:, 0], samples[:, 1], marker='.', c=col, alpha=0.5) 
    ax.set_title(title, size=15)
# f3(𝑧)=(𝑧1,𝑧2+𝑎𝑧1^2) 
class Degree2Polynomial(tfb.Bijector):

    def __init__(self, a):
        self.a = a
        super(Degree2Polynomial, self).__init__(forward_min_event_ndims=1, is_constant_jacobian=True)
    def _forward(self, x):
        return tf.concat([x[..., :1], x[..., 1:] + self.a * tf.square(x[..., :1])], axis=-1)
    def _inverse(self, y):
        return tf.concat([y[..., :1], y[..., 1:] - self.a * tf.square(y[..., :1])], axis=-1)
    def _forward_log_det_jacobian(self, x):
        return tf.constant(0., dtype=x.dtype)

# f4(𝑧)=Rz
class Rotation(tfb.Bijector):

    def __init__(self, theta):
        self.R = tf.constant([[np.cos(theta), -np.sin(theta)], 
                             [np.sin(theta), np.cos(theta)]], dtype=tf.float32)
        super(Rotation, self).__init__(forward_min_event_ndims=1, is_constant_jacobian=True)
    def _forward(self, x):
        return tf.linalg.matvec(self.R, x)
    def _inverse(self, y):
        return tf.linalg.matvec(tf.transpose(self.R), y)
    def _forward_log_det_jacobian(self, x):
        return tf.constant(0., x.dtype)
def get_normalizing_flow_dist(a, theta):
    bijectors = [
                    tfb.Shift([0.,-2]),   # f1
                    tfb.Scale([1,1/2]),   # f2
                    Degree2Polynomial(a), # f3
                    Rotation(theta),      # f4
                    tfb.Tanh()            # f5
    flow_bijector = tfb.Chain(list(reversed(bijectors)))
    return tfd.TransformedDistribution(distribution=base_distribution,
nsamples= 10000
sigma = 0.3
base_distribution = tfd.MultivariateNormalDiag(loc=tf.zeros(2), scale_diag=sigma*tf.ones(2))
samples = base_distribution.sample(nsamples)
fig, ax = plt.subplots(figsize=(8,8))
plot_distribution(samples, ax, 'Base distribution', 'blue')
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15,15))
axes = axes.flatten()
plt.subplots_adjust(0, 0, 1, 0.925, 0.05, 0.05)
colors = ['red', 'green', 'orange', 'magenta']
for i in range(4):
    a = tfd.Normal(loc=3, scale=1).sample(1)[0].numpy()
    theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1)[0].numpy()
    transformed_distribution = get_normalizing_flow_dist(a, theta)
    samples = transformed_distribution.sample(nsamples)
    plot_distribution(samples, axes[i], r'$\theta$={:.02f}, a={:.02f}'.format(theta, a), colors[i])
plt.suptitle('Transformed Distribution with Normalizing Flow', size=20)

Create the image dataset

  • Let’s now use your random normalizing flow to generate an image dataset of contour plots from our random normalising flow network.
  • First, let’s display a sample of 4 contour plot images from our normalizing flow network using 4 independently sampled sets of parameters, using the following get_densities function: this function calculates density values for a (batched) Distribution for use in a contour plot.
  • The dataset should consist of at least 1000 images, stored in a numpy array of shape (N, 36, 36, 3). Each image in the dataset should correspond to a contour plot of a transformed distribution from a normalizing flow with an independently sampled set of parameters. It will take a few minutes to create the dataset.
  • As well as the get_densities function, the following get_image_array_from_density_values function will help to generate the dataset. This function creates a numpy array for an image of the contour plot for a given set of density values Z. Feel free to choose your own options for the contour plots.
  • Let’s display a sample of 20 images from your generated dataset in a figure.
X, Y = np.meshgrid(np.linspace(-1, 1, 100), np.linspace(-1, 1, 100))
inputs = np.transpose(np.stack((X, Y)), [1, 2, 0])

def get_densities(transformed_distribution):
    batch_shape = transformed_distribution.batch_shape
    Z = transformed_distribution.prob(np.expand_dims(inputs, 2))
    Z = np.transpose(Z, list(range(2, 2+len(batch_shape))) + [0, 1])
    return Z
import numpy as np
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
from matplotlib.figure import Figure

def get_image_array_from_density_values(Z):
    assert Z.shape == (100, 100)
    fig = Figure(figsize=(0.5, 0.5))
    canvas = FigureCanvas(fig)
    ax = fig.gca()
    ax.contourf(X, Y, Z, cmap='hot', levels=100)

    image_from_plot = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
    image_from_plot = image_from_plot.reshape(fig.canvas.get_width_height()[::-1] + (3,))
    return image_from_plot
plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
for i in range(4):
    a = tfd.Normal(loc=3, scale=1).sample(1)[0].numpy()
    theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1)[0].numpy()
    transformed_distribution = get_normalizing_flow_dist(a, theta)
    transformed_distribution = tfd.BatchReshape(transformed_distribution, [1])
    Z = get_densities(transformed_distribution)
    image = get_image_array_from_density_values(Z.squeeze())
    plt.subplot(2,2,i+1), plt.imshow(image), plt.axis('off')
    plt.title(r'$\theta$={:.02f}, a={:.02f}'.format(theta, a), size=10)
N = 1000
image_dataset = np.zeros((N, 36, 36, 3))
for i in range(N):
    a = tfd.Normal(loc=3, scale=1).sample(1)[0].numpy()
    theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1)[0].numpy()
    transformed_distribution = tfd.BatchReshape(get_normalizing_flow_dist(a, theta), [1])
    image_dataset[i,...] = get_image_array_from_density_values(get_densities(transformed_distribution).squeeze())
image_dataset = tf.convert_to_tensor(image_dataset, dtype=tf.float32)
# TensorShape([1000, 36, 36, 3])

plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
indices = np.random.choice(N, 20)
for i in range(20):
    image = image_dataset[indices[i]].numpy()
    image = image / image.max()
    plt.subplot(2,10,i+1), plt.imshow(image), plt.axis('off')

Create objects

  • Let’s now split your dataset to create objects for training and validation data.
  • Using the map method, ;et’s normalize the pixel values so that they lie between 0 and 1.
  • These Datasets will be used to train a variational autoencoder (VAE). Use the map method to return a tuple of input and output Tensors where the image is duplicated as both input and output.
  • Randomly shuffle the training Dataset.
  • Batch both datasets with a batch size of 20, setting drop_remainder=True.
  • Print the element_spec property for one of the Dataset objects.
n = len(image_dataset)
tf_image_dataset =
tf_image_dataset = tf_image_dataset.shuffle(3)
tf_image_dataset = x : x / tf.reduce_max(x))
tf_image_dataset = x: (x, x))
train_sz = int(0.8*n)
training = tf_image_dataset.take(train_sz)
validation = tf_image_dataset.skip(train_sz)
training = training.batch(batch_size=20, drop_remainder=True)
validation = validation.batch(batch_size=20, drop_remainder=True)
#(TensorSpec(shape=(20, 36, 36, 3), dtype=tf.float32, name=None),
# TensorSpec(shape=(20, 36, 36, 3), dtype=tf.float32, name=None))

Build the encoder and decoder networks

  • Let’s now create the encoder and decoder for the variational autoencoder algorithm.
  • Let’s design these networks, subject to the following constraints:
    • The encoder and decoder networks should be built using the Sequential class.
    • The encoder and decoder networks should use probabilistic layers where necessary to represent distributions.
    • The prior distribution should be a zero-mean, isotropic Gaussian (identity covariance matrix).
    • The encoder network should add the KL divergence loss to the model.
  • Print the model summary for the encoder and decoder networks.
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import (Dense, Flatten, Reshape, Concatenate, Conv2D, UpSampling2D, BatchNormalization)
latent_dim = 2 #50
prior = tfd.MultivariateNormalDiag(loc=tf.zeros(latent_dim))

def get_kl_regularizer(prior_distribution):
    return tfpl.KLDivergenceRegularizer(prior_distribution,
                                        test_points_fn=lambda q: q.sample(3),

kl_regularizer = get_kl_regularizer(prior)

def get_encoder(latent_dim, kl_regularizer):
    return Sequential([
            Conv2D(filters=32, kernel_size=3, activation='relu', strides=2, padding='same', input_shape=(36,36,3)),
            Conv2D(filters=64, kernel_size=3, activation='relu', strides=2, padding='same'),
            Conv2D(filters=128, kernel_size=3, activation='relu', strides=3, padding='same'),
            tfpl.MultivariateNormalTriL(latent_dim, activity_regularizer=kl_regularizer)
        ], name='encoder')      

def get_decoder(latent_dim):
    return Sequential([
        Dense(1152, activation='relu', input_shape=(latent_dim,)), 
        Conv2D(filters=64, kernel_size=3, activation='relu', padding='same'),
        Conv2D(filters=32, kernel_size=2, activation='relu', padding='same'),
        Conv2D(filters=128, kernel_size=2, activation='relu', padding='same'),
        Conv2D(filters=3, kernel_size=2, activation=None, padding='same'),
    ], name='decoder')    

encoder = get_encoder(latent_dim=2, kl_regularizer=kl_regularizer)

Model: "encoder"
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 18, 18, 32)        896       
batch_normalization (BatchNo (None, 18, 18, 32)        128       
conv2d_1 (Conv2D)            (None, 9, 9, 64)          18496     
batch_normalization_1 (Batch (None, 9, 9, 64)          256       
conv2d_2 (Conv2D)            (None, 3, 3, 128)         73856     
batch_normalization_2 (Batch (None, 3, 3, 128)         512       
flatten (Flatten)            (None, 1152)              0         
dense (Dense)                (None, 5)                 5765      
multivariate_normal_tri_l (M multiple                  0         
Total params: 99,909
Trainable params: 99,461
Non-trainable params: 448

decoder = get_decoder(latent_dim=2)

Model: "decoder"
Layer (type)                 Output Shape              Param #   
dense_1 (Dense)              (None, 1152)              3456      
reshape (Reshape)            (None, 3, 3, 128)         0         
up_sampling2d (UpSampling2D) (None, 9, 9, 128)         0         
conv2d_3 (Conv2D)            (None, 9, 9, 64)          73792     
up_sampling2d_1 (UpSampling2 (None, 18, 18, 64)        0         
conv2d_4 (Conv2D)            (None, 18, 18, 32)        8224      
up_sampling2d_2 (UpSampling2 (None, 36, 36, 32)        0         
conv2d_5 (Conv2D)            (None, 36, 36, 128)       16512     
conv2d_6 (Conv2D)            (None, 36, 36, 3)         1539      
flatten_1 (Flatten)          (None, 3888)              0         
independent_bernoulli (Indep multiple                  0         
Total params: 103,523
Trainable params: 103,523
Non-trainable params: 0
def reconstruction_loss(batch_of_images, decoding_dist):
    return -tf.reduce_mean(decoding_dist.log_prob(batch_of_images))

Train the variational autoencoder

  • Let’s now train the variational autoencoder. Build the VAE using the Model class and the encoder and decoder models. Print the model summary.
  • Compile the VAE with the negative log likelihood loss and train with the fit method, using the training and validation Datasets.
  • Plot the learning curves for loss vs epoch for both training and validation sets.
vae = Model(inputs=encoder.inputs, outputs=decoder(encoder.outputs))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)
vae.compile(optimizer=optimizer, loss=reconstruction_loss)
history =, validation_data=validation, epochs=20)

Epoch 1/20
40/40 [==============================] - 34s 777ms/step - loss: 1250.2296 - val_loss: 1858.7103
Epoch 2/20
40/40 [==============================] - 29s 731ms/step - loss: 661.8687 - val_loss: 1605.1261
Epoch 3/20
40/40 [==============================] - 29s 720ms/step - loss: 545.2802 - val_loss: 1245.0518
Epoch 4/20
40/40 [==============================] - 28s 713ms/step - loss: 489.1101 - val_loss: 1024.5863
Epoch 5/20
40/40 [==============================] - 29s 718ms/step - loss: 453.3464 - val_loss: 841.4725
Epoch 6/20
40/40 [==============================] - 29s 733ms/step - loss: 438.8413 - val_loss: 742.0212
Epoch 7/20
40/40 [==============================] - 30s 751ms/step - loss: 433.2563 - val_loss: 657.4024
Epoch 8/20
40/40 [==============================] - 30s 751ms/step - loss: 417.5353 - val_loss: 602.7039
Epoch 9/20
40/40 [==============================] - 29s 726ms/step - loss: 409.8351 - val_loss: 545.5004
Epoch 10/20
40/40 [==============================] - 30s 741ms/step - loss: 406.3284 - val_loss: 507.9868
Epoch 11/20
40/40 [==============================] - 30s 741ms/step - loss: 402.9056 - val_loss: 462.0777
Epoch 12/20
40/40 [==============================] - 29s 733ms/step - loss: 397.4801 - val_loss: 444.4444
Epoch 13/20
40/40 [==============================] - 30s 741ms/step - loss: 398.2078 - val_loss: 423.1287
Epoch 14/20
40/40 [==============================] - 29s 723ms/step - loss: 395.5187 - val_loss: 411.3030
Epoch 15/20
40/40 [==============================] - 30s 739ms/step - loss: 397.3987 - val_loss: 407.5134
Epoch 16/20
40/40 [==============================] - 29s 721ms/step - loss: 399.3271 - val_loss: 402.7288
Epoch 17/20
40/40 [==============================] - 29s 736ms/step - loss: 393.4259 - val_loss: 401.4711
Epoch 18/20
40/40 [==============================] - 29s 726ms/step - loss: 390.5508 - val_loss: 399.1924
Epoch 19/20
40/40 [==============================] - 29s 736ms/step - loss: 389.3187 - val_loss: 401.1656
Epoch 20/20
40/40 [==============================] - 29s 728ms/step - loss: 389.4718 - val_loss: 393.5178
nepochs = 20
plt.plot(range(nepochs), history.history['loss'], label='train-loss')
plt.plot(range(nepochs), history.history['val_loss'], label='valid-loss')

Use the encoder and decoder networks

  • Let’s now put your encoder and decoder networks into practice!
  • Randomly sample 1000 images from the dataset, and pass them through the encoder. Display the embeddings in a scatter plot (project to 2 dimensions if the latent space has dimension higher than two).
  • Randomly sample 4 images from the dataset and for each image, display the original and reconstructed image from the VAE in a figure.
    • Use the mean of the output distribution to display the images.
  • Randomly sample 6 latent variable realisations from the prior distribution, and display the images in a figure.
    • Again use the mean of the output distribution to display the images.
def reconstruct(encoder, decoder, batch_of_images):
    approx_distribution = encoder(batch_of_images)
    decoding_dist = decoder(approx_distribution.mean())
    return decoding_dist.mean()

embedding = encoder(image_dataset / 255).mean()
fig, ax = plt.subplots(figsize=(8,8))
plt.scatter(embedding[:,0], embedding[:,1], c='red', s=50, edgecolor='k')
plt.title('Embedding', size=20)
plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
indices = np.random.choice(len(image_dataset), 4)
for i in range(4):
    image = image_dataset[indices[i]].numpy()
    image = image / image.max()
    plt.subplot(4,2,2*i+1), plt.imshow(image), plt.axis('off')
    reconstructions = reconstruct(encoder, decoder, np.expand_dims(image, axis=0))
    plt.subplot(4,2,2*i+2), plt.imshow(reconstructions[0].numpy()), plt.axis('off')
plt.suptitle('original (left column) vs. VAE-reconstructed (right column)', size=15)
nsample = 6
samples = np.random.uniform(-10, 10, (nsample, latent_dim)) #prior.sample(6)
fig, ax = plt.subplots(figsize=(8,8))
plt.scatter(samples[:,0], samples[:,1], color='blue')
for i in range(nsample):
    plt.text(samples[i,0] + 0.05, samples[i,1] + 0.05, 'embedding {}'.format(i), fontsize=15)
plt.title('Embeddings', size=20)
reconstructions = decoder(samples).mean()
#print(samples.shape, reconstructions.shape)
plt.subplots_adjust(0, 0, 1, 0.9, 0.05, 0.08)
indices = np.random.choice(len(image_dataset), 4)
for i in range(nsample):
    plt.subplot(2,3,i+1), plt.imshow(reconstructions[i]), plt.title('image {}'.format(i)), plt.axis('off')
plt.suptitle('VAE-reconstructions', size=20)

The following animation of latent space interpolation shows the decoder’s generations, depending on the latent space.

To be continued…

Machine learning with H2O in R / Python

In this blog, we shall discuss about how to use H2O to build a few supervised Machine learning models. H2O is a Java-based software for data modeling and general computing, with the primary purpose of it being a distributed, parallel, in memory processing engine. It needs to be installed first (instructions) and by default an H2O instance will run on localhost:54321. Additionally, one needs to install R/python clients to to communicate with the H2O instance. Every new R / python session first needs to initialize a connection between the python client and the H2O cluster.

The problems to be described in this blog appeared in the exercises / projects in the coursera course “Practical Machine Learning on H2O“, by H2O. The problem statements / descriptions / steps are taken from the course itself. We shall use the concepts from the course, in order to, e.g.,

  • to build a few machine learning / deep learning models using different algorithms (such as Gradient Boosting, Random Forest, Neural Net, Elastic Net GLM etc.),
  • to review the classic bias-variance tradeoff (overfitting)
  • for hyper-parameter tuning using Grid Search
  • to use AutoML to automatically find a bunch of good performing models
  • to use Stacked Ensembles of models to improve performance.

Problem 1

In this problem we shall create an artificial data set, then run random forest / GBM on it with H2O, to create two supervised models for classification, one that is reasonable and another one that shows clear over-fitting. We shall use R client (package) for H2O for this problem.

  1. Let’s first create a data set to predict an employee’s job satisfaction in an organization. Let’s say an employee’s job satisfaction depends on the following factors (there are several other factors in general, but we shall limit us to the following few ones):
    • work environment
    • pay
    • flexibility
    • relationship with manager
    • age

# Let's say an employee's job satisfaction depends on the work environment, pay, flexibility, relationship with manager and age.

N <- 1000                                         # number of samples
d <- data.frame(id = 1:N)
d$workEnvironment <- sample(1:5, N, replace=TRUE) # on a scale of 1-5, 1 being bad and 5 being good
v <- round(rnorm(N, mean=60000, sd=20000))        # 68% are 40-80k
v <- pmax(v, 20000)
v <- pmin(v, 100000) #table(v)
d$pay <- v
d$flexibility <- sample(1:5, N, replace=TRUE)     # on a scale of 1-5, 1 being bad and 5 being good
d$managerRel <- sample(1:5, N, replace=TRUE)      # on a scale of 1-5, 1 being bad and 5 being good
d$age <- round(runif(N, min=20, max=60))

#  id workEnvironment   pay flexibility managerRel age
#1  1               2 20000           2          2  21
#2  2               5 75817           1          2  31
#3  3               5 45649           5          3  25
#4  4               1 47157           1          5  55
#5  5               2 69729           2          4  33
#6  6               1 75101           2          2  39

v <- 125 * (d$pay/1000)^2 # e.g., job satisfaction score is proportional to square of pay (hypothetically)
v <- v + 250 / log(d$age) # e.g., inversely proportional to log of age
v <- v + 5 * d$flexibility
v <- v + 200 * d$workEnvironment
v <- v + 1000 * d$managerRel^3
v <- v + runif(N, 0, 5000)
v <- 100 * (v - 0) / (max(v) - min(v)) # min-max normalization to bring the score in 0-100
d$jobSatScore <- round(v) # Round to nearest integer (percentage)

2. Let’s start h2o, and import the data.

as.h2o(d, destination_frame = "jobsatisfaction")
jobsat <- h2o.getFrame("jobsatisfaction")

#  |===========================================================================================================| 100%
#  id workEnvironment   pay flexibility managerRel age jobSatScore
#1  1               2 20000           2          2  21           5
#2  2               5 75817           1          2  31          55
#3  3               5 45649           5          3  25          22
#4  4               1 47157           1          5  55          30
#5  5               2 69729           2          4  33          51
#6  6               1 75101           2          2  39          54

3. Let’s split the data. Here we plan to use cross-validation.

parts <- h2o.splitFrame(
  ratios = 0.8,
  destination_frames=c("jobsat_train", "jobsat_test"),
  seed = 321)
train <- h2o.getFrame("jobsat_train")
test <- h2o.getFrame("jobsat_test")   
# 794
# 206 rows

y <- "jobSatScore"
x <- setdiff(names(train), c("id", y))

4. Let’s choose the gradient boosting model (gbm), and create a model. It’s a regression model since the output variable is treated to be continuous.

# the reasonable model with 10-fold cross-validation
m_res <- h2o.gbm(x, y, train,
              model_id = "model10foldsreasonable",
              ntrees = 20,
              nfolds = 10,
              seed = 123)
> h2o.performance(m_res, train = TRUE) # RMSE 2.973807
#H2ORegressionMetrics: gbm
#** Reported on training data. **

#MSE:  8.069509
#RMSE:  2.840688
#MAE:  2.266134
#RMSLE:  0.1357181
#Mean Residual Deviance :  8.069509

> h2o.performance(m_res, xval = TRUE)  # RMSE 3.299601
#H2ORegressionMetrics: gbm
#** Reported on cross-validation data. **
#** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

#MSE:  8.84353
#RMSE:  2.973807
#MAE:  2.320899
#RMSLE:  0.1384746
#Mean Residual Deviance :  8.84353

> h2o.performance(m_res, test)         # RMSE 0.6476077
#H2ORegressionMetrics: gbm

#MSE:  10.88737
#RMSE:  3.299601
#MAE:  2.524492
#RMSLE:  0.1409274
#Mean Residual Deviance :  10.88737

5. Let’s try some alternative parameters, to build a different model, and show how the results differ.

# overfitting model with 10-fold cross-validation
m_ovf <- h2o.gbm(x, y, train,
              model_id = "model10foldsoverfitting",
              ntrees = 2000,
              max_depth = 20,
              nfolds = 10,
              seed = 123)

> h2o.performance(m_ovf, train = TRUE) # RMSE 0.004474786
#H2ORegressionMetrics: gbm
#** Reported on training data. **

#MSE:  2.002371e-05
#RMSE:  0.004474786
#MAE:  0.0007455944
#RMSLE:  5.032019e-05
#Mean Residual Deviance :  2.002371e-05

> h2o.performance(m_ovf, xval = TRUE)  # RMSE 0.6801615
#H2ORegressionMetrics: gbm
#** Reported on cross-validation data. **
#** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

#MSE:  0.4626197
#RMSE:  0.6801615
#MAE:  0.4820542
#RMSLE:  0.02323415
#Mean Residual Deviance :  0.4626197

> h2o.performance(m_ovf, test)         # RMSE 0.4969761
#H2ORegressionMetrics: gbm

#MSE:  0.2469853
#RMSE:  0.4969761
#MAE:  0.3749822
#RMSLE:  0.01698435
#Mean Residual Deviance :  0.2469853

Problem 2

Predict Chocolate Makers Location with Deep Learning Model with H2O

The data is available here:

This is a classification problem. We need to predict “Maker Location”. In other words, using the rating, and the other fields, how accurately we can identify if it is Belgian chocolate, French chocolate, and so on. We shall use python client (library) for H2O for this problem.

  1. Let’s start h2o, load the data set, and split it. By the end of this stage we should have
    three variables, pointing to three data frames on h2o: train, valid, test. However, if you are choosing to use
    cross-validation, you will only have two: train and test.
import h2o
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('')
# (1795, 9)

MakerOriginREFReview DateCocoa PercentMaker LocationRatingBean TypeBean Origin
0A. MorinAgua Grande1876201663%France3.75Sao Tome
1A. MorinKpime1676201570%France2.75Togo
2A. MorinAtsane1676201570%France3.00Togo
3A. MorinAkata1680201570%France3.50Togo
4A. MorinQuilla1704201570%France3.50Peru
print(df['Maker Location'].unique())

# ['France' 'U.S.A.' 'Fiji' 'Ecuador' 'Mexico' 'Switzerland' 'Netherlands'
# 'Spain' 'Peru' 'Canada' 'Italy' 'Brazil' 'U.K.' 'Australia' 'Wales'
# 'Belgium' 'Germany' 'Russia' 'Puerto Rico' 'Venezuela' 'Colombia' 'Japan'
# 'New Zealand' 'Costa Rica' 'South Korea' 'Amsterdam' 'Scotland'
# 'Martinique' 'Sao Tome' 'Argentina' 'Guatemala' 'South Africa' 'Bolivia'
# 'St. Lucia' 'Portugal' 'Singapore' 'Denmark' 'Vietnam' 'Grenada' 'Israel'
# 'India' 'Czech Republic' 'Domincan Republic' 'Finland' 'Madagascar'
# 'Philippines' 'Sweden' 'Poland' 'Austria' 'Honduras' 'Nicaragua'
# 'Lithuania' 'Niacragua' 'Chile' 'Ghana' 'Iceland' 'Eucador' 'Hungary'
# 'Suriname' 'Ireland']
print(len(df['Maker Location'].unique()))
# 60

loc_table = df['Maker Location'].value_counts()
#U.S.A.               764
#France               156
#Canada               125
#U.K.                  96
#Italy                 63
#Ecuador               54
#Australia             49
#Belgium               40
#Switzerland           38
#Germany               35
#Austria               26
#Spain                 25
#Colombia              23
#Hungary               22
#Venezuela             20
#Madagascar            17
#Japan                 17
#New Zealand           17
#Brazil                17
#Peru                  17
#Denmark               15
#Vietnam               11
#Scotland              10
#Guatemala             10
#Costa Rica             9
#Israel                 9
#Argentina              9
#Poland                 8
#Honduras               6
#Lithuania              6
#Sweden                 5
#Nicaragua              5
#Domincan Republic      5
#South Korea            5
#Netherlands            4
#Amsterdam              4
#Puerto Rico            4
#Fiji                   4
#Sao Tome               4
#Mexico                 4
#Ireland                4
#Portugal               3
#Singapore              3
#Iceland                3
#South Africa           3
#Grenada                3
#Chile                  2
#St. Lucia              2
#Bolivia                2
#Finland                2
#Martinique             1
#Eucador                1
#Wales                  1
#Czech Republic         1
#Suriname               1
#Ghana                  1
#India                  1
#Niacragua              1
#Philippines            1
#Russia                 1
#Name: Maker Location, dtype: int64


As can be seen from the above table, some of the locations have too few records, which will result in poor accuracy of the model to be learnt on after splitting the dataset into train, validation and test datasets. Let’s get rid of the locations that have small number of (< 40) examples in the dataset, to make the results more easily comprehendable, by reducing number of categories in the output variable.

## filter out the countries for which there is < 40 examples present in the dataset
loc_gt_40_recs = loc_table[loc_table >= 40].index.tolist()
df_sub = df[df['Maker Location'].isin(loc_gt_40_recs)]

# now connect to H2O
h2o.init() # h2o.clusterStatus()

H2O cluster uptime:1 day 14 hours 48 mins
H2O cluster version:
H2O cluster version age:4 years and 9 days !!!
H2O cluster name:H2O_started_from_R_Sandipan.Dey_kpl973
H2O cluster total nodes:1
H2O cluster free memory:2.530 Gb
H2O cluster total cores:4
H2O cluster allowed cores:4
H2O cluster status:locked, healthy
H2O connection url:http://localhost:54321
H2O connection proxy:None
H2O internal security:False
H2O API Extensions:Algos, AutoML, Core V3, Core V4
Python version:3.7.6 final
h2o_df = h2o.H2OFrame(df_sub.values, destination_frame = "cacao_882", 
                      column_names=[x.replace(' ', '_') for x in df.columns.tolist()])

df_cacao_882 = h2o.get_frame('cacao_882') # df_cacao_882.as_data_frame()

0A. MorinAgua Grande1876.02016.063%France3.75<0xA0>Sao Tome
1A. MorinKpime1676.02015.070%France2.75<0xA0>Togo
2A. MorinAtsane1676.02015.070%France3.0<0xA0>Togo
3A. MorinAkata1680.02015.070%France3.5<0xA0>Togo
4A. MorinQuilla1704.02015.070%France3.5<0xA0>Peru
5A. MorinCarenero1315.02014.070%France2.75CriolloVenezuela
6A. MorinCuba1315.02014.070%France3.5<0xA0>Cuba
7A. MorinSur del Lago1315.02014.070%France3.5CriolloVenezuela
8A. MorinPuerto Cabello1319.02014.070%France3.75CriolloVenezuela
9A. MorinPablino1319.02014.070%France4.0<0xA0>Peru
#Maker_Location	Count
#Australia	 49
#Belgium	 40
#Canada	        125
#Ecuador	 54
#France	        156
#Italy	         63
#U.K.	         96
#U.S.A.	        764

train, valid, test = df_cacao_882.split_frame(ratios = [0.8, 0.1], 
                                              destination_frames = ['train', 'valid', 'test'], 
                                              seed = 321)
print("%d/%d/%d" %(train.nrows, valid.nrows, test.nrows))
# 1082/138/127

2. Let’s set x to be the list of columns we shall use to train on, to be the column we shall learn. Here it’s going to be a multi-class classification problem.

ignore_fields = ['Review_Date', 'Bean_Type', 'Maker_Location']
# Specify the response and predictor columns
y = 'Maker_Location' # multinomial Classification
x = [i for i in train.names if not i in ignore_fields]

3. Let’s now create a baseline deep learning model. It is recommended to use all default settings (remembering to
specify either nfolds or validation_frame) for the baseline model.

from h2o.estimators.deeplearning import H2ODeepLearningEstimator

model = H2ODeepLearningEstimator() 

%time model.train(x = x, y = y, training_frame = train, validation_frame = valid)
# deeplearning Model Build progress: |██████████████████████████████████████| 100%
# Wall time: 6.44 s

# 0.05118279569892473
# 0.26888404593884047
perf_test = model.model_performance(test)
print('Mean class error', perf_test.mean_per_class_error())
# Mean class error 0.2149184149184149
print('log loss', perf_test.logloss())
# log loss 0.48864148412056846
print('MSE', perf_test.mse())
# MSE 0.11940531127368789
print('RMSE', perf_test.rmse())
# RMSE 0.3455507361787671
Top-8 Hit Ratios: 

AustraliaBelgiumCanadaEcuadorFranceItalyU.K.U.S.A.ErrorRate / 5 / 3 / 13 / 3 / 11 / 10 / 11 / 71 / 127

4. Now, let’s create a tuned model, that gives superior performance. However we should use no more than 10 times
the running time of your baseline model, so again our script should be timing the model.

model_tuned = H2ODeepLearningEstimator(epochs=200, 

%time model_tuned.train(x, y, training_frame = train, validation_frame = valid)
#deeplearning Model Build progress: |██████████████████████████████████████| 100%
#Wall time: 30.8 s

perf_test = model_tuned.model_performance(test)
print('Mean class error', perf_test.mean_per_class_error())
#Mean class error 0.05909090909090909
print('log loss', perf_test.logloss())
#log loss 0.14153784501504524
print('MSE', perf_test.mse())
#MSE 0.03497231075826773
print('RMSE', perf_test.rmse())
#RMSE 0.18700885208531637

Top-8 Hit Ratios: 
AustraliaBelgiumCanadaEcuadorFranceItalyU.K.U.S.A.ErrorRate / 5 / 3 / 13 / 3 / 11 / 10 / 11 / 71 / 127

As can be seen from the above plot, the early-stopping strategy stopped the model to overfit and the model achieves better accruacy on the test dataset..

5. Let’s save both the models, to the local disk, using save_model(), to export the binary version of the model. (Do not export a POJO.)

h2o.save_model(model, 'base_model')
h2o.save_model(model_tuned, 'tuned_model')

We may want to include a seed in the model function above to get reproducible results.

Problem 3

Predict Price of a house with Stacked Ensemble model with H2O

The data is available at This is a regression problem. We have to predict the “price” of a house given different feature values. We shall use python client for H2O again for this problem.

The data needs to be split into train and test, using 0.9 for the ratio, and a seed of 123. That should give 19,462 training rows and 2,151 test rows. The target is an RMSE below $123,000.

  1. Let’s start h2o, load the chosen dataset and follow the data manipulation steps. For example, we can split date into year and month columns. We can then optionally combine them into a numeric date column. At the end of this step we shall have traintestx and y variables, and possibly valid also. The below shows the code snippet to do this.
import h2o
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from time import time


url = ""
house_df = h2o.import_file(url, destination_frame = "house_data")
# Parse progress: |█████████████████████████████████████████████████████████| 100%


house_df['year'] = house_df['date'].substring(0,4).asnumeric()
house_df['month'] = house_df['date'].substring(4,6).asnumeric()
house_df['day'] = house_df['date'].substring(6,8).asnumeric()
house_df = house_df.drop('date')

plt.hist(house_df.as_data_frame()['price'].tolist(), bins=np.linspace(0,10**6,1000))

We shall use cross-validation and not a validation dataset.

train, test = house_df.split_frame(ratios=[0.9], destination_frames = ['train', 'test'], seed=123)
print("%d/%d" %(train.nrows, test.nrows))
# 19462/2151
ignore_fields = ['id', 'price'] 
x = [i for i in train.names if not i in ignore_fields]
y = 'price'

2. Let’s now train at least four different models on the preprocessed datseet, using at least three different supervised algorithms. Let’s save all the models.

from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator

nfolds = 5 # for cross-validation

Let’s first fit a GLM model. The best performing α hyperparameter value (for controlling L1 vs. L2 regularization) for GLM will be found using GridSearch, as shown in the below code snippet.

g= h2o.grid.H2OGridSearch(
    "alpha":[x * 0.01 for x in range(0,100)],
    "stopping_metric": "rmse",
g.train(x, y, train)

#glm Grid Build progress: |████████████████████████████████████████████████| 100%
#                     alpha  \
#0                   [0.61]   
#1                   [0.78]   
#2                   [0.65]   
#3                   [0.13]   
#4    [0.35000000000000003]   
#5                   [0.05]   
#6                   [0.32]   
#7                   [0.55]   

#                                              model_ids     residual_deviance  
#0  Grid_GLM_train_model_python_1628864392402_41_model_3  2.626981989511134E15  
#1  Grid_GLM_train_model_python_1628864392402_41_model_6  2.626981989511134E15  
#2  Grid_GLM_train_model_python_1628864392402_41_model_5  2.626981989511134E15  
#3  Grid_GLM_train_model_python_1628864392402_41_model_2  2.626981989511134E15  
#4  Grid_GLM_train_model_python_1628864392402_41_model_4  2.626981989511134E15  
#5  Grid_GLM_train_model_python_1628864392402_41_model_7  2.626981989511134E15  
#6  Grid_GLM_train_model_python_1628864392402_41_model_0  2.626981989511134E15  
#7  Grid_GLM_train_model_python_1628864392402_41_model_1  2.626981989511134E15  

Model 1

model_GLM = H2OGeneralizedLinearEstimator(
    family='gaussian', #'gamma',

%time model_GLM.train(x, y, train)
#glm Model Build progress: |███████████████████████████████████████████████| 100%
#Wall time: 259 ms

#ModelMetricsRegressionGLM: glm
#** Reported on test data. **

#MSE: 128806123545.59714
#RMSE: 358895.7000934911
#MAE: 233890.6933813204
#RMSLE: 0.5456714021880726
#R^2: 0.03102347771355851
#Mean Residual Deviance: 128806123545.59714
#Null degrees of freedom: 2150
#Residual degrees of freedom: 2129
#Null deviance: 285935013037402.7
#Residual deviance: 277061971746579.44
#AIC: 61176.23965800522

As can be seen from above, GLM could not achieve the target of RMSE below $123k neither on cross-validation nor on test dataset.

The below models (GBMDRF and DL) and the corresponding parameters were found with AutoML leaderboard and 
GridSearch, along with some manual tuning.

from h2o.automl import H2OAutoML
model_auto = H2OAutoML(max_runtime_secs=60, seed=123)
model_auto.train(x, y, train)
# AutoML progress: |████████████████████████████████████████████████████████| 100%
# Parse progress: |█████████████████████████████████████████████████████████| 100%

# model_auto.leader.explain(test)

#ModelMetricsRegression: gbm
#** Reported on test data. **

#MSE: 17456681023.716145
#RMSE: 132123.73376390839
#MAE: 77000.00253466706
#RMSLE: 0.1899899418603569
#Mean Residual Deviance: 17456681023.716145

model = h2o.get_model(model_auto.leaderboard[4, 'model_id']) # get model by model_id
[(k, v) for (k, v) in model.params.items() if v['default'] != v['actual'] and \
                     not k in ['model_id', 'training_frame', 'validation_frame', 'nfolds',             
                               'keep_cross_validation_predictions', 'seed', 
                               'response_column', 'fold_assignment', 'ignored_columns']]

# GBM_grid_0_AutoML_20210813_201225_model_0
# 235011.60404473927
# [('score_tree_interval', {'default': 0, 'actual': 5}),
#  ('ntrees', {'default': 50, 'actual': 60}),
#  ('max_depth', {'default': 5, 'actual': 6}),
#  ('min_rows', {'default': 10.0, 'actual': 1.0}),
#  ('stopping_tolerance', {'default': 0.001, 'actual': 0.008577452408351779}),
#  ('seed', {'default': -1, 'actual': 123}),
#  ('distribution', {'default': 'AUTO', 'actual': 'gaussian'}),
#  ('sample_rate', {'default': 1.0, 'actual': 0.8}),
#  ('col_sample_rate', {'default': 1.0, 'actual': 0.8}),
#  ('col_sample_rate_per_tree', {'default': 1.0, 'actual': 0.8})]

Model 2

model_GBM = H2OGradientBoostingEstimator(

%time model_GBM.train(x, y, train)
#gbm Model Build progress: |███████████████████████████████████████████████| 100%
#Wall time: 54.9 s


As can be seen from the above table (row 5, column 1), the mean RMSE for cross-validation is 117288.305, which is below $123k.


#ModelMetricsRegression: gbm
#** Reported on test data. **

#MSE: 14243079402.729088
#RMSE: 119344.37315068142
#MAE: 65050.344749203745
#RMSLE: 0.16421689257411975
#Mean Residual Deviance: 14243079402.729088

As can be seen from above, GBM could achieve the target of RMSE below $123k on test dataset.

Now, let’s try random forest model by finding best parameters with Grid Search:

g= h2o.grid.H2OGridSearch(
    "ntrees": [20, 25, 30],
    "stopping_tolerance": [0.005, 0.006, 0.0075],
    "max_depth": [20, 50, 100],
    "min_rows": [5, 7, 10]
    "stopping_metric": "rmse",
g.train(x, y, train)
#drf Grid Build progress: |████████████████████████████████████████████████| 100%
#    max_depth min_rows ntrees stopping_tolerance  \
#0         100      5.0     20              0.006   
#1         100      5.0     20              0.005   
#2         100      5.0     20              0.005   
#3         100      7.0     30              0.006   
#4          50     10.0     25              0.006   
#5          50     10.0     20              0.005   

#                                              model_ids      residual_deviance  
#0  Grid_DRF_train_model_python_1628864392402_40_model_0  2.0205038467456142E10  
#1  Grid_DRF_train_model_python_1628864392402_40_model_5  2.0205038467456142E10  
#2  Grid_DRF_train_model_python_1628864392402_40_model_1  2.0205038467456142E10  
#3  Grid_DRF_train_model_python_1628864392402_40_model_3   2.099520493338354E10  
#4  Grid_DRF_train_model_python_1628864392402_40_model_2   2.260686283035833E10  
#5  Grid_DRF_train_model_python_1628864392402_40_model_4   2.279037520277947E10  

Model 3

model_RF = H2ORandomForestEstimator(

%time model_RF.train(x, y, train)
#drf Model Build progress: |███████████████████████████████████████████████| 100%
#Wall time: 13.2 s

ModelMetricsRegression: drf
** Reported on test data. **

MSE: 16405336914.530426
RMSE: 128083.3202041953
MAE: 71572.37981480274
RMSLE: 0.17712324625977907
Mean Residual Deviance: 16405336914.530426

As can be seen from above, DRF just missed the target of RMSE below $123k for on both the cross-validation and on test dataset.

Now, let’s try to fit a deep learning model, again tuning the parameters with Grid Search.

g= h2o.grid.H2OGridSearch(
    "epochs": [20, 25],
    "hidden": [[20, 20, 20], [25, 25, 25]],
    "stopping_rounds": [0, 5],
    "stopping_tolerance": [0.006]
    "stopping_metric": "rmse",
g.train(x, y, train)
#deeplearning Grid Build progress: |███████████████████████████████████████| 100%

#                 epochs        hidden stopping_rounds stopping_tolerance  \
#0     16.79120554889533  [25, 25, 25]               0              0.006   
#1    3.1976799968879086  [25, 25, 25]               0              0.006   

#                                                       model_ids  \
#0  Grid_DeepLearning_train_model_python_1628864392402_55_model_0   
#1  Grid_DeepLearning_train_model_python_1628864392402_55_model_1   

#       residual_deviance  
#0  1.6484562934855278E10  
#1  2.1652538389322113E10 

Model 4

model_DL = H2ODeepLearningEstimator(epochs=30, 
                                       hidden=[30, 30, 30],
%time model_DL.train(x, y, train)
#deeplearning Model Build progress: |██████████████████████████████████████| 100%
#Wall time: 55.7 s


As can be seen from the above table (row 5, column 1), the mean RMSE for cross-validation is 121793.58, which is below $123k.


#ModelMetricsRegression: deeplearning
#** Reported on test data. **

#MSE: 14781990070.095192
#RMSE: 121581.20771770278
#MAE: 72522.60487846025
#RMSLE: 0.1834924698171073
#Mean Residual Deviance: 14781990070.095192

As can be seen from above, the deep learning model could achieve the target of RMSE below $123k on test dataset.

3. Finally, let’s train a stacked ensemble of the models created in earlier steps. We may need to repeat steps two and three until the best model (which is usually the ensemble model, but does not have to be) has the minimum required performance on the cross-validation dataset. Note: only one model has to achieve the minimum required performance. If multiple models achieve it, so we need to choose the best performing one.

models = [model_GBM.model_id, model_RF.model_id, model_DL.model_id] #model_GLM.model_id,
model_SE = H2OStackedEnsembleEstimator(model_id = 'se_gbm_dl_house', base_models=models)

%time model_SE.train(x, y, train)
#stackedensemble Model Build progress: |███████████████████████████████████| 100%
#Wall time: 2.67 s
#ModelMetricsRegressionGLM: stackedensemble
#** Reported on test data. **

#MSE: 130916347835.45828
#RMSE: 361823.6418967924
#MAE: 236448.3672215734
#RMSLE: 0.5514878971097109
#R^2: 0.015148783736682492
#Mean Residual Deviance: 130916347835.45828
#Null degrees of freedom: 2150
#Residual degrees of freedom: 2147
#Null deviance: 285935013037402.7
#Residual deviance: 281601064194070.75
#AIC: 61175.193832813566

As can be seen from above, the stacked ensemble model could not reach the required performance, neither on the cross-validation, nor on the test dataset.

4. Now let’s get the performance on the test data of the chosen model/ensemble, and confirm that this also reaches the minimum target on the test data.

Best Model

The model that performs best in terms of mean cross-validation RMSE and RMSE on the test dataset (both of them are below the minimum target $123k) is the gradient boositng model (GBM), which is the Model 2 above.

#ModelMetricsRegression: gbm
#** Reported on test data. **

#MSE: 14243079402.729088
#RMSE: 119344.37315068142
#MAE: 65050.344749203745
#RMSLE: 0.16421689257411975
#Mean Residual Deviance: 14243079402.729088

# save the models
h2o.save_model(model_GBM, 'best_model (GBM)') # the final best model
h2o.save_model(model_SE, 'SE_model')
h2o.save_model(model_GBM, 'GBM_model')
h2o.save_model(model_RF, 'RF_model')
h2o.save_model(model_GLM, 'GLM_model')
h2o.save_model(model_DL, 'DL_model')

Neural Translation – Machine Translation with Neural Nets (BiLSTM) with Keras / Python

In this blog, we shall discuss about how to build a neural network to translate from English to German. This problem appeared as the Capstone project for the coursera course “Tensorflow 2: Customising your model“, a part of the specialization “Tensorflow2 for Deep Learning“, by the Imperial College, London. The problem statement / description / steps are taken from the course itself. We shall use the concepts from the course, including building more flexible model architectures, freezing layers, data processing pipeline and sequence modelling.

Image taken from the Capstone project

Here we shall use a language dataset from to build a neural translation model. This dataset consists of over 200k pairs of sentences in English and German. In order to make the training quicker, we will restrict to our dataset to 20k pairs. The below figure shows a few sentence pairs taken from the file.

Our goal is to develop a neural translation model from English to German, making use of a pre-trained English word embedding module.

1. Text preprocessing

We need to start with preprocessing the above input file. Here are the steps that we need to follow:

  • First let’s create separate lists of English and German sentences.
  • Add a special “” and “” token to the beginning and end of every German sentence.
  • Use the Tokenizer class from the tf.keras.preprocessing.text module to tokenize the German sentences, ensuring that no character filters are applied.

The next figure shows 5 randomly chosen examples of (preprocessed) English and German sentence pairs. For the German sentence, the text (with start and end tokens) as well as the tokenized sequence are shown.

  • Pad the end of the tokenized German sequences with zeros, and batch the complete set of sequences into a single numpy array, using the following code snippet.

padded_tokenized_german_sentences = tf.keras.preprocessing.sequence.pad_sequences(tokenized_german_sentences, 
                                                                                  maxlen=14, padding='post', value=0) 
#(20000, 14)

As can be seen from the next code block, the maximum length of a German sentence is 14, whereas there are 5743 unique words in the German sentences from the subset of the corpus. The index of the <start> token is 1.

max([len(tokenized_german_sentences[i]) for i in range(20000)])
# 14
# 5743
# 1

2. Preparing the data with

Loading the embedding layer

As part of the dataset preproceessing for this project we shall use a pre-trained English word-embedding module from TensorFlow Hub. The URL for the module is

This embedding takes a batch of text tokens in a 1-D tensor of strings as input. It then embeds the separate tokens into a 128-dimensional space.

Although this model can also be used as a sentence embedding module (e.g., where the module will process each token by removing punctuation and splitting on spaces and then averages the word embeddings over a sentence to give a single embedding vector), however, we will use it only as a word embedding module here, and will pass each word in the input sentence as a separate token.

The following code snippet shows how an English sentence with 7 words is mapped into a 7×128 tensor in the embedding space.

embedding_layer(tf.constant(["these", "aren't", "the", "droids", "you're", "looking", "for"])).shape
# TensorShape([7, 128])

Now, let’s prepare the training and validation Datasets as follows:

  • Create a random training and validation set split of the data, reserving e.g. 20% of the data for validation (each English dataset example is a single sentence string, and each German dataset example is a sequence of padded integer tokens).
  • Load the training and validation sets into a object, passing in a tuple of English and German data for both training and validation sets, using the following code snippet.

def make_Dataset(input_array, target_array):
    return, target_array)) 

train_data = make_Dataset(input_train, target_train)
valid_data = make_Dataset(input_valid, target_valid)

  • Create a function to map over the datasets that splits each English sentence at spaces. Apply this function to both Dataset objects using the map method, using the following code snippet.

def str_split(e, g):
    e = tf.strings.split(e)
    return e, g

train_data =
valid_data =

  • Create a function to map over the datasets that embeds each sequence of English words using the loaded embedding layer/model. Apply this function to both Dataset objects using the map method, using the following code snippet.

def embed_english(x, y):
    return embedding_layer(x), y

train_data =
valid_data =

  • Create a function to filter out dataset examples where the English sentence is more than 13 (embedded) tokens in length. Apply this function to both Dataset objects using the filter method, using the following code snippet.

def remove_long_sentence(e, g):
    return tf.shape(e)[0] <= 13

train_data = train_data.filter(remove_long_sentence)
valid_data = valid_data.filter(remove_long_sentence)

  • Create a function to map over the datasets that pads each English sequence of embeddings with some distinct padding value before the sequence, so that each sequence is length 13. Apply this function to both Dataset objects using the map method, as shown in the next code block. 

def pad_english(e, g):
    return tf.pad(e, paddings = [[13-tf.shape(e)[0],0], [0,0]], mode='CONSTANT', constant_values=0), g

train_data =
valid_data =

  • Batch both training and validation Datasets with a batch size of 16.

train_data = train_data.batch(16)
valid_data = valid_data.batch(16)

  • Let’s now print the element_spec property for the training and validation Datasets. Also, let’s print the shape of an English data example from the training Dataset and a German data example Tensor from the validation Dataset.

#(TensorSpec(shape=(None, None, 128), dtype=tf.float32, name=None),
# TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

#(TensorSpec(shape=(None, None, 128), dtype=tf.float32, name=None),
 #TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

for e, g in train_data.take(1):
#(16, 13, 128)

for e, g in valid_data.take(1):
#[[   1   11  152    6  458    3    2    0    0    0    0    0    0    0]
# [   1   11  333  429    3    2    0    0    0    0    0    0    0    0]
# [   1   11   59   12    3    2    0    0    0    0    0    0    0    0]
# [   1  990   25   42  444    7    2    0    0    0    0    0    0    0]
# [   1    4   85 1365    3    2    0    0    0    0    0    0    0    0]
# [   1  131    8   22    5  583    3    2    0    0    0    0    0    0]
# [   1    4   85 1401    3    2    0    0    0    0    0    0    0    0]
# [   1   17  381   80    3    2    0    0    0    0    0    0    0    0]
# [   1 2998   13   33    7    2    0    0    0    0    0    0    0    0]
# [   1  242    6  479    3    2    0    0    0    0    0    0    0    0]
# [   1   35   17   40    7    2    0    0    0    0    0    0    0    0]
# [   1   11   30  305   46   47 1913  471    3    2    0    0    0    0]
# [   1    5   48 1184    3    2    0    0    0    0    0    0    0    0]
# [   1    5  287   12  834 5268    3    2    0    0    0    0    0    0]
# [   1    5    6  523    3    2    0    0    0    0    0    0    0    0]
# [   1   13  109   28   29   44  491    3    2    0    0    0    0    0]], shape=(16, 14), dtype=int32)

The custom translation model

The following is a schematic of the custom translation model architecture we shall develop now.

Image taken from the Capstone project

The custom model consists of an encoder RNN and a decoder RNN. The encoder takes words of an English sentence as input, and uses a pre-trained word embedding to embed the words into a 128-dimensional space. To indicate the end of the input sentence, a special end token (in the same 128-dimensional space) is passed in as an input. This token is a TensorFlow Variable that is learned in the training phase (unlike the pre-trained word embedding, which is frozen).

The decoder RNN takes the internal state of the encoder network as its initial state. A start token is passed in as the first input, which is embedded using a learned German word embedding. The decoder RNN then makes a prediction for the next German word, which during inference is then passed in as the following input, and this process is repeated until the special <end> token is emitted from the decoder.

Create the custom layer

Let’s create a custom layer to add the learned end token embedding to the encoder model:

Image taken from the capstone project

Now let’s first build the custom layer, which will be later used to create the encoder.

  • Using layer subclassing, create a custom layer that takes a batch of English data examples from one of the Datasets, and adds a learned embedded ‘end’ token to the end of each sequence.
  • This layer should create a TensorFlow Variable (that will be learned during training) that is 128-dimensional (the size of the embedding space).

from tensorflow.keras.models import  Sequential, Model
from tensorflow.keras.layers import Layer, Concatenate, Input, Masking, LSTM, Embedding, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

class CustomLayer(Layer):

    def __init__(self, **kwargs):
        super(CustomLayer, self).__init__(**kwargs)
        self.embed = tf.Variable(initial_value=tf.zeros(shape=(1,128)), trainable=True, dtype='float32')
    def call(self, inputs):
        x = tf.tile(self.embed, [tf.shape(inputs)[0], 1])
        x = tf.expand_dims(x, axis=1)
        return tf.concat([inputs, x], axis=1)
        <em>#return Concatenate(axis=1)([inputs, x])</em>

  • Let’s extract a batch of English data examples from the training Dataset and print the shape. Test the custom layer by calling the layer on the English data batch Tensor and print the resulting Tensor shape (the layer should increase the sequence length by one).

custom_layer = CustomLayer()
e, g = next(iter(train_data.take(1)))
# (16, 13, 128)
o = custom_layer(e)
# TensorShape([16, 14, 128])

Build the encoder network

The encoder network follows the schematic diagram above. Now let’s build the RNN encoder model.

  • Using the keras functional API, build the encoder network according to the following spec:
    • The model will take a batch of sequences of embedded English words as input, as given by the Dataset objects.
    • The next layer in the encoder will be the custom layer you created previously, to add a learned end token embedding to the end of the English sequence.
    • This is followed by a Masking layer, with the mask_value set to the distinct padding value you used when you padded the English sequences with the Dataset preprocessing above.
    • The final layer is an LSTM layer with 512 units, which also returns the hidden and cell states.
    • The encoder is a multi-output model. There should be two output Tensors of this model: the hidden state and cell states of the LSTM layer. The output of the LSTM layer is unused.

inputs = Input(batch_shape = (<strong>None</strong>, 13, 128), name='input')
x = CustomLayer(name='custom_layer')(inputs)
x = Masking(mask_value=0, name='masking_layer')(x)
x, h, c = LSTM(units=512, return_state=<strong>True</strong>, name='lstm')(x)
encoder_model = Model(inputs = inputs, outputs = [h, c], name='encoder')


# Model: "encoder"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input (InputLayer)           [(None, 13, 128)]         0         
# _________________________________________________________________
# custom_layer (CustomLayer)   (None, 14, 128)           128       
# _________________________________________________________________
# masking_layer (Masking)      (None, 14, 128)           0         
# _________________________________________________________________
# lstm (LSTM)                  [(None, 512), (None, 512) 1312768   
# =================================================================
# Total params: 1,312,896
# Trainable params: 1,312,896
# Non-trainable params: 0
# _________________________________________________________________

Build the decoder network

The decoder network follows the schematic diagram below.

image taken from the capstone project

Now let’s build the RNN decoder model.

  • Using Model subclassing, build the decoder network according to the following spec:
    • The initializer should create the following layers:
      • An Embedding layer with vocabulary size set to the number of unique German tokens, embedding dimension 128, and set to mask zero values in the input.
      • An LSTM layer with 512 units, that returns its hidden and cell states, and also returns sequences.
      • A Dense layer with number of units equal to the number of unique German tokens, and no activation function.
    • The call method should include the usual inputs argument, as well as the additional keyword arguments hidden_state and cell_state. The default value for these keyword arguments should be None.
    • The call method should pass the inputs through the Embedding layer, and then through the LSTM layer. If the hidden_state and cell_state arguments are provided, these should be used for the initial state of the LSTM layer. 
    • The call method should pass the LSTM output sequence through the Dense layer, and return the resulting Tensor, along with the hidden and cell states of the LSTM layer.

class Decoder(Model):
    def __init__(self, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.embed = Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=128, mask_zero=True, name='embedding_layer')
        self.lstm = LSTM(units = 512, return_state = True, return_sequences = True, name='lstm_layer')
        self.dense = Dense(len(tokenizer.index_word)+1, name='dense_layer')
    def call(self, inputs, hidden_state = None, cell_state = None):
        x = self.embed(inputs)
        x, hidden_state, cell_state = self.lstm(x, initial_state = [hidden_state, cell_state]) \
                                                     if hidden_state is not None and cell_state is not None else self.lstm(x)
        x = self.dense(x)
        return x, hidden_state, cell_state

decoder_model = Decoder(name='decoder')
e, g_in = next(iter(train_data.take(1)))
h, c = encoder_model(e)
g_out, h, c = decoder_model(g_in, h, c)

print(g_out.shape, h.shape, c.shape)
# (16, 14, 5744) (16, 512) (16, 512)


#Model: "decoder"
#Layer (type)                 Output Shape              Param #   
#embedding_layer (Embedding)  multiple                  735232    
#lstm_layer (LSTM)            multiple                  1312768   
#dense_layer (Dense)          multiple                  2946672   
#Total params: 4,994,672
#Trainable params: 4,994,672
#Non-trainable params: 0

Create a custom training loop

custom training loop to train your custom neural translation model.

  • Define a function that takes a Tensor batch of German data (as extracted from the training Dataset), and returns a tuple containing German inputs and outputs for the decoder model (refer to schematic diagram above).
  • Define a function that computes the forward and backward pass for your translation model. This function should take an English input, German input and German output as arguments, and should do the following:
    • Pass the English input into the encoder, to get the hidden and cell states of the encoder LSTM.
    • These hidden and cell states are then passed into the decoder, along with the German inputs, which returns a sequence of outputs (the hidden and cell state outputs of the decoder LSTM are unused in this function).
    • The loss should then be computed between the decoder outputs and the German output function argument.
    • The function returns the loss and gradients with respect to the encoder and decoder’s trainable variables.
    • Decorate the function with @tf.function
  • Define and run a custom training loop for a number of epochs (for you to choose) that does the following:
    • Iterates through the training dataset, and creates decoder inputs and outputs from the German sequences.
    • Updates the parameters of the translation model using the gradients of the function above and an optimizer object.
    • Every epoch, compute the validation loss on a number of batches from the validation and save the epoch training and validation losses.
  • Plot the learning curves for loss vs epoch for both training and validation sets.

def forward_backward(encoder_model, decoder_model, e, g_in, g_out, loss):
    with tf.GradientTape() as tape:
        h, c = encoder_model(e)
        d_g_out, _, _ = decoder_model(g_in, h, c)
        cur_loss = loss(g_out, d_g_out)
        grads = tape.gradient(cur_loss, encoder_model.trainable_variables + decoder_model.trainable_variables)
    return cur_loss, grads

def train_encoder_decoder(encoder_model, decoder_model, num_epochs, train_data, valid_data, valid_steps, 
                          optimizer, loss, grad_fn):
    train_losses = []
    val_loasses = []
    for epoch in range(num_epochs):
        train_epoch_loss_avg = tf.keras.metrics.Mean()
        val_epoch_loss_avg = tf.keras.metrics.Mean()
        for e, g in train_data:
            g_in, g_out = get_german_decoder_data(g)
            train_loss, grads = grad_fn(encoder_model, decoder_model, e, g_in, g_out, loss)
            optimizer.apply_gradients(zip(grads, encoder_model.trainable_variables + decoder_model.trainable_variables))
        for e_v, g_v in valid_data.take(valid_steps):
            g_v_in, g_v_out = get_german_decoder_data(g_v)
            val_loss, _ = grad_fn(encoder_model, decoder_model, e_v, g_v_in, g_v_out, loss)
        print(f'epoch: {epoch}, train loss: {train_epoch_loss_avg.result()}, validation loss: {val_epoch_loss_avg.result()}')    
    return train_losses, val_loasses

optimizer_obj = Adam(learning_rate = 1e-3)
loss_obj = SparseCategoricalCrossentropy(from_logits=True)
train_loss_results, valid_loss_results = train_encoder_decoder(encoder_model, decoder_model, 20, train_data, valid_data, 20,
                                                          optimizer_obj, loss_obj, forward_backward)

#epoch: 0, train loss: 4.4570465087890625, validation loss: 4.1102800369262695
#epoch: 1, train loss: 3.540217399597168, validation loss: 3.36271333694458
#epoch: 2, train loss: 2.756622076034546, validation loss: 2.7144060134887695
#epoch: 3, train loss: 2.049957275390625, validation loss: 2.1480133533477783
#epoch: 4, train loss: 1.4586931467056274, validation loss: 1.7304519414901733
#epoch: 5, train loss: 1.0423369407653809, validation loss: 1.4607685804367065
#epoch: 6, train loss: 0.7781839370727539, validation loss: 1.314332127571106
#epoch: 7, train loss: 0.6160411238670349, validation loss: 1.2391613721847534
#epoch: 8, train loss: 0.5013922452926636, validation loss: 1.1840368509292603
#epoch: 9, train loss: 0.424654096364975, validation loss: 1.1716119050979614
#epoch: 10, train loss: 0.37027251720428467, validation loss: 1.1612160205841064
#epoch: 11, train loss: 0.3173922598361969, validation loss: 1.1330692768096924
#epoch: 12, train loss: 0.2803193926811218, validation loss: 1.1394184827804565
#epoch: 13, train loss: 0.24854864180088043, validation loss: 1.1354353427886963
#epoch: 14, train loss: 0.22135266661643982, validation loss: 1.1059410572052002
#epoch: 15, train loss: 0.2019050121307373, validation loss: 1.1111358404159546
#epoch: 16, train loss: 0.1840481162071228, validation loss: 1.1081823110580444
#epoch: 17, train loss: 0.17126116156578064, validation loss: 1.125329852104187
#epoch: 18, train loss: 0.15828527510166168, validation loss: 1.0979799032211304
#epoch: 19, train loss: 0.14451280236244202, validation loss: 1.0899451971054077

import matplotlib.pyplot as plt
plt.xlabel("Epochs", fontsize=14)
plt.ylabel("Loss", fontsize=14)
plt.title('Loss vs epochs')
plt.plot(train_loss_results, label='train')
plt.plot(valid_loss_results, label='valid')

The following figure shows how the training and validation loss decrease with epochs (the model is trained for 20 epochs).

Use the model to translate

Now it’s time to put the model into practice! Let’s run the translation for five randomly sampled English sentences from the dataset. For each sentence, the process is as follows:

  • Preprocess and embed the English sentence according to the model requirements.
  • Pass the embedded sentence through the encoder to get the encoder hidden and cell states.
  • Starting with the special "<start>" token, use this token and the final encoder hidden and cell states to get the one-step prediction from the decoder, as well as the decoder’s updated hidden and cell states.
  • Create a loop to get the next step prediction and updated hidden and cell states from the decoder, using the most recent hidden and cell states. Terminate the loop when the "<end>" token is emitted, or when the sentence has reached a maximum length.
  • Decode the output token sequence into German text and print the English text and the model’s German translation.

indices = np.random.choice(len(english_sentences), 5)
test_data =[english_sentences[i] for i in indices]))
test_data =
test_data =
test_data = test_data.filter(lambda x: tf.shape(x)[0] <= 13)
test_data = x: tf.pad(x, paddings = [[13-tf.shape(x)[0],0], [0,0]], mode='CONSTANT', constant_values=0))
# TensorSpec(shape=(None, 128), dtype=tf.float32, name=None)

start_token = np.array(tokenizer.texts_to_sequences(['']))
end_token = np.array(tokenizer.texts_to_sequences(['']))
for e, i in zip(test_data.take(n), indices):
    h, c = encoder_model(tf.expand_dims(e, axis=0))
    g_t = []
    g_in = start_token
    g_out, h, c = decoder_model(g_in, h, c)
    g_out = tf.argmax(g_out, axis=2)
    while g_out != end_token: 
        g_out, h, c = decoder_model(g_in, h, c)
        g_out = tf.argmax(g_out, axis=2)
        g_in = g_out
        g_t.append(tokenizer.index_word.get(tf.squeeze(g_out).numpy(), 'UNK'))
    print(f'English Text: {english_sentences[i]}')
    print(f'German Translation: {" ".join(g_t)}')

# English Text: i'll see tom .
# German Translation:  ich werde tom folgen . 

# English Text: you're not alone .
# German Translation:  keine nicht allein . 

# English Text: what a hypocrite !
# German Translation:  fuer ein idiot ! 

# English Text: he kept talking .
# German Translation:  sie hat ihn erwuergt . 

# English Text: tom's in charge .
# German Translation:  tom ist im bett . 

The above output shows the sample English sentences and their German translations predicted by the model.

The following animation (click and open on a new tab) shows how the predicted German translation improves (with the decrease in loss) for a few sample English sentences as the deep learning model is trained for more and more epochs.

As can be seen from the above animation, translation gets better as the deep learning model is trained for more and more epochs. The following is the youtube version for the same.

NLP with Bangla: semantic similarity with word2vec, Deep learning (RNN) to generate Bangla song-like texts and to do sentiment analysis on astrological prediction dataset, creating a simple Bangla ChatBot using RASA NLU with Python

In this blog, we shall discuss on a few NLP techniques with Bangla language. We shall start with a demonstration on how to train a word2vec model with Bangla wiki corpus with tensorflow and how to visualize the semantic similarity between words using t-SNE. Next, we shall demonstrate how to train a character / word LSTM on selected Tagore’s songs to generate songs like Tagore with keras. Next, we shall create sentiment analysis dataset by crawling the daily astrological prediction pages of a leading Bangla newspaper and manually labeling the sentiment of each of the predictions corresponding to each moon-sign. We shall train an LSTM sentiment a analysis model to predict the sentiment of a moon-sign prediction. Finally, we shall use RASA NLU (natural language understanding) to build a very simple chatbot in Bangla.

Word2vec model with Bangla wiki corpus with tensorflow

  • Let’s start by importing the required libraries
import collections
import numpy as np
import tensorflow as tf
from matplotlib import pylab
  • Download the Bangla wikipedia corpus from Kaggle. The first few lines from the corpus are shown below:



“রবীন্দ্রনাথ ঠাকুর”

রবীন্দ্রনাথ ঠাকুর (৭ই মে, ১৮৬১ – ৭ই আগস্ট, ১৯৪১) (২৫ বৈশাখ, ১২৬৮ – ২২ শ্রাবণ, ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। তাঁকে বাংলা ভাষার সর্বশ্রেষ্ঠ সাহিত্যিক মনে করা হয়। রবীন্দ্রনাথকে গুরুদেব, কবিগুরু ও বিশ্বকবি অভিধায় ভূষিত করা হয়। রবীন্দ্রনাথের ৫২টি কাব্যগ্রন্থ, ৩৮টি নাটক, ১৩টি উপন্যাস ও ৩৬টি প্রবন্ধ ও অন্যান্য গদ্যসংকলন তাঁর জীবদ্দশায় বা মৃত্যুর অব্যবহিত পরে প্রকাশিত হয়। তাঁর সর্বমোট ৯৫টি ছোটগল্প ও ১৯১৫টি গান যথাক্রমে “”গল্পগুচ্ছ”” ও “”গীতবিতান”” সংকলনের অন্তর্ভুক্ত হয়েছে। রবীন্দ্রনাথের যাবতীয় প্রকাশিত ও গ্রন্থাকারে অপ্রকাশিত রচনা ৩২ খণ্ডে “”রবীন্দ্র রচনাবলী”” নামে প্রকাশিত হয়েছে। রবীন্দ্রনাথের যাবতীয় পত্রসাহিত্য উনিশ খণ্ডে “”চিঠিপত্র”” ও চারটি পৃথক গ্রন্থে প্রকাশিত। এছাড়া তিনি প্রায় দুই হাজার ছবি এঁকেছিলেন। রবীন্দ্রনাথের রচনা বিশ্বের বিভিন্ন ভাষায় অনূদিত হয়েছে। ১৯১৩ সালে “”গীতাঞ্জলি”” কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি সাহিত্যে নোবেল পুরস্কার লাভ করেন।রবীন্দ্রনাথ ঠাকুর কলকাতার এক ধনাঢ্য ও সংস্কৃতিবান ব্রাহ্ম পিরালী ব্রাহ্মণ পরিবারে জন্মগ্রহণ করেন। বাল্যকালে প্রথাগত বিদ্যালয়-শিক্ষা তিনি গ্রহণ করেননি; গৃহশিক্ষক রেখে বাড়িতেই তাঁর শিক্ষার ব্যবস্থা করা হয়েছিল। আট বছর বয়সে তিনি কবিতা লেখা শুরু করেন। ১৮৭৪ সালে “”তত্ত্ববোধিনী পত্রিকা””-এ তাঁর “””” কবিতাটি প্রকাশিত হয়। এটিই ছিল তাঁর প্রথম প্রকাশিত রচনা। ১৮৭৮ সালে মাত্র সতেরো বছর বয়সে রবীন্দ্রনাথ প্রথমবার ইংল্যান্ডে যান। ১৮৮৩ সালে মৃণালিনী দেবীর সঙ্গে তাঁর বিবাহ হয়। ১৮৯০ সাল থেকে রবীন্দ্রনাথ পূর্ববঙ্গের শিলাইদহের জমিদারি এস্টেটে বসবাস শুরু করেন। ১৯০১ সালে তিনি পশ্চিমবঙ্গের শান্তিনিকেতনে ব্রহ্মচর্যাশ্রম প্রতিষ্ঠা করেন এবং সেখানেই পাকাপাকিভাবে বসবাস শুরু করেন। ১৯০২ সালে তাঁর পত্নীবিয়োগ হয়। ১৯০৫ সালে তিনি বঙ্গভঙ্গ-বিরোধী আন্দোলনে জড়িয়ে পড়েন। ১৯১৫ সালে ব্রিটিশ সরকার তাঁকে নাইট উপাধিতে ভূষিত করেন। কিন্তু ১৯১৯ সালে জালিয়ানওয়ালাবাগ হত্যাকাণ্ডের প্রতিবাদে তিনি সেই উপাধি ত্যাগ করেন। ১৯২১ সালে গ্রামোন্নয়নের জন্য তিনি শ্রীনিকেতন নামে একটি সংস্থা প্রতিষ্ঠা করেন। ১৯২৩ সালে আনুষ্ঠানিকভাবে বিশ্বভারতী প্রতিষ্ঠিত হয়। দীর্ঘজীবনে তিনি বহুবার বিদেশ ভ্রমণ করেন এবং সমগ্র বিশ্বে বিশ্বভ্রাতৃত্বের বাণী প্রচার করেন। ১৯৪১ সালে দীর্ঘ রোগভোগের পর কলকাতার পৈত্রিক বাসভবনেই তাঁর মৃত্যু হয়।রবীন্দ্রনাথের কাব্যসাহিত্যের বৈশিষ্ট্য ভাবগভীরতা, গীতিধর্মিতা চিত্ররূপময়তা, অধ্যাত্মচেতনা, ঐতিহ্যপ্রীতি, প্রকৃতিপ্রেম, মানবপ্রেম, স্বদেশপ্রেম, বিশ্বপ্রেম, রোম্যান্টিক সৌন্দর্যচেতনা, ভাব, ভাষা, ছন্দ ও আঙ্গিকের বৈচিত্র্য, বাস্তবচেতনা ও প্রগতিচেতনা। রবীন্দ্রনাথের গদ্যভাষাও কাব্যিক। ভারতের ধ্রুপদি ও লৌকিক সংস্কৃতি এবং পাশ্চাত্য বিজ্ঞানচেতনা ও শিল্পদর্শন তাঁর রচনায় গভীর প্রভাব বিস্তার করেছিল। কথাসাহিত্য ও প্রবন্ধের মাধ্যমে তিনি সমাজ, রাজনীতি ও রাষ্ট্রনীতি সম্পর্কে নিজ মতামত প্রকাশ করেছিলেন। সমাজকল্যাণের উপায় হিসেবে তিনি গ্রামোন্নয়ন ও গ্রামের দরিদ্র মানুষ কে শিক্ষিত করে তোলার পক্ষে মতপ্রকাশ করেন। এর পাশাপাশি সামাজিক ভেদাভেদ, অস্পৃশ্যতা, ধর্মীয় গোঁড়ামি ও ধর্মান্ধতার বিরুদ্ধেও তিনি তীব্র প্রতিবাদ জানিয়েছিলেন। রবীন্দ্রনাথের দর্শনচেতনায় ঈশ্বরের মূল হিসেবে মানব সংসারকেই নির্দিষ্ট করা হয়েছে; রবীন্দ্রনাথ দেববিগ্রহের পরিবর্তে কর্মী অর্থাৎ মানুষ ঈশ্বরের পূজার কথা বলেছিলেন। সংগীত ও নৃত্যকে তিনি শিক্ষার অপরিহার্য অঙ্গ মনে করতেন। রবীন্দ্রনাথের গান তাঁর অন্যতম শ্রেষ্ঠ কীর্তি। তাঁর রচিত “”আমার সোনার বাংলা”” ও “”জনগণমন-অধিনায়ক জয় হে”” গানদুটি যথাক্রমে গণপ্রজাতন্ত্রী বাংলাদেশ ও ভারতীয় প্রজাতন্ত্রের জাতীয় সংগীত।


প্রথম জীবন (১৮৬১–১৯০১).

শৈশব ও কৈশোর (১৮৬১ – ১৮৭৮).
রবীন্দ্রনাথ ঠাকুর কলকাতার জোড়াসাঁকো ঠাকুরবাড়িতে জন্মগ্রহণ করেছিলেন। তাঁর পিতা ছিলেন ব্রাহ্ম ধর্মগুরু দেবেন্দ্রনাথ ঠাকুর (১৮১৭–১৯০৫) এবং মাতা ছিলেন সারদাসুন্দরী দেবী (১৮২৬–১৮৭৫)। রবীন্দ্রনাথ ছিলেন পিতামাতার চতুর্দশ সন্তান। জোড়াসাঁকোর ঠাকুর পরিবার ছিল ব্রাহ্ম আদিধর্ম মতবাদের প্রবক্তা। রবীন্দ্রনাথের পূর্ব পুরুষেরা খুলনা জেলার রূপসা উপজেলা পিঠাভোগে বাস করতেন। ১৮৭৫ সালে মাত্র চোদ্দ বছর বয়সে রবীন্দ্রনাথের মাতৃবিয়োগ ঘটে। পিতা দেবেন্দ্রনাথ দেশভ্রমণের নেশায় বছরের অধিকাংশ সময় কলকাতার বাইরে অতিবাহিত করতেন। তাই ধনাঢ্য পরিবারের সন্তান হয়েও রবীন্দ্রনাথের ছেলেবেলা কেটেছিল ভৃত্যদের অনুশাসনে। শৈশবে রবীন্দ্রনাথ কলকাতার ওরিয়েন্টাল সেমিনারি, নর্ম্যাল স্কুল, বেঙ্গল অ্যাকাডেমি এবং সেন্ট জেভিয়ার্স কলেজিয়েট স্কুলে কিছুদিন করে পড়াশোনা করেছিলেন। কিন্তু বিদ্যালয়-শিক্ষায় অনাগ্রহী হওয়ায় বাড়িতেই গৃহশিক্ষক রেখে তাঁর শিক্ষার ব্যবস্থা করা হয়েছিল। ছেলেবেলায় জোড়াসাঁকোর বাড়িতে অথবা বোলপুর ও পানিহাটির বাগানবাড়িতে প্রাকৃতিক পরিবেশের মধ্যে ঘুরে বেড়াতে বেশি স্বচ্ছন্দবোধ করতেন রবীন্দ্রনাথ।১৮৭৩ সালে এগারো বছর বয়সে রবীন্দ্রনাথের উপনয়ন অনুষ্ঠিত হয়েছিল। এরপর তিনি কয়েক মাসের জন্য পিতার সঙ্গে দেশভ্রমণে বের হন। প্রথমে তাঁরা আসেন শান্তিনিকেতনে। এরপর পাঞ্জাবের অমৃতসরে কিছুকাল কাটিয়ে শিখদের উপাসনা পদ্ধতি পরিদর্শন করেন। শেষে পুত্রকে নিয়ে দেবেন্দ্রনাথ যান পাঞ্জাবেরই (অধুনা ভারতের হিমাচল প্রদেশ রাজ্যে অবস্থিত) ডালহৌসি শৈলশহরের নিকট বক্রোটায়। এখানকার বক্রোটা বাংলোয় বসে রবীন্দ্রনাথ পিতার কাছ থেকে সংস্কৃত ব্যাকরণ, ইংরেজি, জ্যোতির্বিজ্ঞান, সাধারণ বিজ্ঞান ও ইতিহাসের নিয়মিত পাঠ নিতে শুরু করেন। দেবেন্দ্রনাথ তাঁকে বিশিষ্ট ব্যক্তিবর্গের জীবনী, কালিদাস রচিত ধ্রুপদি সংস্কৃত কাব্য ও নাটক এবং উপনিষদ্‌ পাঠেও উৎসাহিত করতেন। ১৮৭৭ সালে “”ভারতী”” পত্রিকায় তরুণ রবীন্দ্রনাথের কয়েকটি গুরুত্বপূর্ণ রচনা প্রকাশিত হয়। এগুলি হল মাইকেল মধুসূদনের “”””, “”ভানুসিংহ ঠাকুরের পদাবলী”” এবং “””” ও “””” নামে দুটি গল্প। এর মধ্যে “”ভানুসিংহ ঠাকুরের পদাবলী”” বিশেষভাবে উল্লেখযোগ্য। এই কবিতাগুলি রাধা-কৃষ্ণ বিষয়ক পদাবলির অনুকরণে “”ভানুসিংহ”” ভণিতায় রচিত। রবীন্দ্রনাথের “”ভিখারিণী”” গল্পটি (১৮৭৭) বাংলা সাহিত্যের প্রথম ছোটগল্প। ১৮৭৮ সালে প্রকাশিত হয় রবীন্দ্রনাথের প্রথম কাব্যগ্রন্থ তথা প্রথম মুদ্রিত গ্রন্থ “”কবিকাহিনী””। এছাড়া এই পর্বে তিনি রচনা করেছিলেন “””” (১৮৮২) কাব্যগ্রন্থটি। রবীন্দ্রনাথের বিখ্যাত কবিতা “””” এই কাব্যগ্রন্থের অন্তর্গত।

যৌবন (১৮৭৮-১৯০১).
১৮৭৮ সালে ব্যারিস্টারি পড়ার উদ্দেশ্যে ইংল্যান্ডে যান রবীন্দ্রনাথ। প্রথমে তিনি ব্রাইটনের একটি পাবলিক স্কুলে ভর্তি হয়েছিলেন। ১৮৭৯ সালে ইউনিভার্সিটি কলেজ লন্ডনে আইনবিদ্যা নিয়ে পড়াশোনা শুরু করেন। কিন্তু সাহিত্যচর্চার আকর্ষণে সেই পড়াশোনা তিনি সমাপ্ত করতে পারেননি। ইংল্যান্ডে থাকাকালীন শেকসপিয়র ও অন্যান্য ইংরেজ সাহিত্যিকদের রচনার সঙ্গে রবীন্দ্রনাথের পরিচয় ঘটে। এই সময় তিনি বিশেষ মনোযোগ সহকারে পাঠ করেন “”রিলিজিও মেদিচি””, “”কোরিওলেনাস”” এবং “”অ্যান্টনি অ্যান্ড ক্লিওপেট্রা””। এই সময় তাঁর ইংল্যান্ডবাসের অভিজ্ঞতার কথা “”ভারতী”” পত্রিকায় পত্রাকারে পাঠাতেন রবীন্দ্রনাথ। উক্ত পত্রিকায় এই লেখাগুলি জ্যেষ্ঠভ্রাতা দ্বিজেন্দ্রনাথ ঠাকুরের সমালোচনাসহ প্রকাশিত হত “””” নামে। ১৮৮১ সালে সেই পত্রাবলি “””” নামে গ্রন্থাকারে ছাপা হয়। এটিই ছিল রবীন্দ্রনাথের প্রথম গদ্যগ্রন্থ তথা প্রথম চলিত ভাষায় লেখা গ্রন্থ। অবশেষে ১৮৮০ সালে প্রায় দেড় বছর ইংল্যান্ডে কাটিয়ে কোনো ডিগ্রি না নিয়ে এবং ব্যারিস্টারি পড়া শুরু না করেই তিনি দেশে ফিরে আসেন।১৮৮৩ সালের ৯ ডিসেম্বর (২৪ অগ্রহায়ণ, ১২৯০ বঙ্গাব্দ) ঠাকুরবাড়ির অধস্তন কর্মচারী বেণীমাধব রায়চৌধুরীর কন্যা ভবতারিণীর সঙ্গে রবীন্দ্রনাথের বিবাহ সম্পন্ন হয়। বিবাহিত জীবনে ভবতারিণীর নামকরণ হয়েছিল মৃণালিনী দেবী (১৮৭৩–১৯০২ )। রবীন্দ্রনাথ ও মৃণালিনীর সন্তান ছিলেন পাঁচ জন: মাধুরীলতা (১৮৮৬–১৯১৮), রথীন্দ্রনাথ (১৮৮৮–১৯৬১), রেণুকা (১৮৯১–১৯০৩), মীরা (১৮৯৪–১৯৬৯) এবং শমীন্দ্রনাথ (১৮৯৬–১৯০৭)। এঁদের মধ্যে অতি অল্প বয়সেই রেণুকা ও শমীন্দ্রনাথের মৃত্যু ঘটে।১৮৯১ সাল থেকে পিতার আদেশে নদিয়া (নদিয়ার উক্ত অংশটি অধুনা বাংলাদেশের কুষ্টিয়া জেলা), পাবনা ও রাজশাহী জেলা এবং উড়িষ্যার জমিদারিগুলির তদারকি শুরু করেন রবীন্দ্রনাথ। কুষ্টিয়ার শিলাইদহের কুঠিবাড়িতে রবীন্দ্রনাথ দীর্ঘ সময় অতিবাহিত করেছিলেন। জমিদার রবীন্দ্রনাথ শিলাইদহে “”পদ্মা”” নামে একটি বিলাসবহুল পারিবারিক বজরায় চড়ে প্রজাবর্গের কাছে খাজনা আদায় ও আশীর্বাদ প্রার্থনা করতে যেতেন। গ্রামবাসীরাও তাঁর সম্মানে ভোজসভার আয়োজন করত।১৮৯০ সালে রবীন্দ্রনাথের অপর বিখ্যাত কাব্যগ্রন্থ “””” প্রকাশিত হয়। কুড়ি থেকে ত্রিশ বছর বয়সের মধ্যে তাঁর আরও কয়েকটি উল্লেখযোগ্য কাব্যগ্রন্থ ও গীতিসংকলন প্রকাশিত হয়েছিল। এগুলি হলো “”””, “”””, “”রবিচ্ছায়া””, “””” ইত্যাদি। ১৮৯১ থেকে ১৮৯৫ সাল পর্যন্ত নিজের সম্পাদিত “”সাধনা”” পত্রিকায় রবীন্দ্রনাথের বেশ কিছু উৎকৃষ্ট রচনা প্রকাশিত হয়। তাঁর সাহিত্যজীবনের এই পর্যায়টি তাই “”সাধনা পর্যায়”” নামে পরিচিত। রবীন্দ্রনাথের “”গল্পগুচ্ছ”” গ্রন্থের প্রথম চুরাশিটি গল্পের অর্ধেকই এই পর্যায়ের রচনা। এই ছোটগল্পগুলিতে তিনি বাংলার গ্রামীণ জনজীবনের এক আবেগময় ও শ্লেষাত্মক চিত্র এঁকেছিলেন।

  • Preprocess the csv files with the following code using regular expressions (to get rid of punctuations). Remember we need to decode to utf-8 first, since we have unicode input files.
from glob import glob
import re
words = []
for f in glob('bangla/wiki/*.csv'):
    words += re.sub('[\r\n—?,;।!‘"’\.:\(\)\[\]…0-9]', ' ', open(f, 'rb').read().decode('utf8').strip()).split(' ')
words = list(filter(lambda x: not x in ['', '-'], words))
# 13964346
# 'ঠাকুর',
# 'রবীন্দ্রনাথ',
# 'ঠাকুর',
# '৭ই',
# 'মে',
# '১৮৬১',
# '৭ই',
# 'আগস্ট',
# '১৯৪১',
# '২৫',
# 'বৈশাখ',
# '১২৬৮',
# '২২',
# 'শ্রাবণ',
# '১৩৪৮',
# 'বঙ্গাব্দ',
# 'ছিলেন',
# 'অগ্রণী',
# 'বাঙালি',
# 'কবি',
# 'ঔপন্যাসিক',
# 'সংগীতস্রষ্টা',
# 'নাট্যকার',
# 'চিত্রকর']

  • Create indices for unique words in the dataset.
vocabulary_size = 25000
def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
# Most common words (+UNK) [['UNK', 1961151], ('এবং', 196916), ('ও', 180042), ('হয়', 160533), ('করে', 131206)]
print('Sample data', data[:10])
#Sample data [1733, 1868, 1733, 1868, 5769, 287, 6855, 5769, 400, 2570]
del words  # Hint to reduce memory.
  • Generate batches to be trained with the word2vec skip-gram model.
  • The target label should be at the center of the buffer each time. That is, given a focus word, our goal will be to learn the most probable context words.
  • The input and the target vector will depend on num_skips and skip_window.
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):
    target = skip_window  # 
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])
# data: ['রবীন্দ্রনাথ', 'ঠাকুর', 'রবীন্দ্রনাথ', 'ঠাকুর', '৭ই', 'মে', '১৮৬১', '৭ই']
for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % 
          (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
    # data: ['রবীন্দ্রনাথ', 'ঠাকুর', 'রবীন্দ্রনাথ', 'ঠাকুর', '৭ই', 'মে',  '১৮৬১', '৭ই']
    # with num_skips = 2 and skip_window = 1:
    # batch: ['ঠাকুর', 'ঠাকুর', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর',  '৭ই', '৭ই']
    # labels: ['রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর', '৭ই', 'রবীন্দ্রনাথ', 'ঠাকুর', 'মে']
    # with num_skips = 4 and skip_window = 2:
    # batch: ['রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর', 'ঠাকুর', 'ঠাকুর']
    # labels: ['রবীন্দ্রনাথ', '৭ই', 'ঠাকুর', 'ঠাকুর', 'মে', 'ঠাকুর', 'রবীন্দ্রনাথ', '৭ই']
  • Pick a random validation set to sample nearest neighbors.
  • Limit the validation samples to the words that have a low numeric ID, which by construction are also the most frequent.
  • Look up embeddings for inputs and compute the softmax loss, using a sample of the negative labels each time (this is known as negative sampling, which is used to make the computation efficient, since the number of labels are often too high).
  • The optimizer will optimize the softmax_weights and the embeddings.
    This is because the embeddings are defined as a variable quantity and the optimizer’s `minimize` method will by default modify all variable quantities that contribute to the tensor it is passed.
  • Compute the similarity between minibatch examples and all embeddings.
 batch_size = 128
 embedding_size = 128 # Dimension of the embedding vector.
 skip_window = 1 # How many words to consider left and right.
 num_skips = 2 # #times to reuse an input to generate a label.
 valid_size = 16 # Random set of words to evaluate similarity on.
 valid_window = 100 # Only pick dev samples in the head of the   
                    # distribution.
 valid_examples = np.array(random.sample(range(valid_window), 
 num_sampled = 64 # Number of negative examples to sample.
 graph = tf.Graph()
 with graph.as_default(), tf.device('/cpu:0'):  
 # Input data.
   train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
   train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
   valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
 # Variables.
   embeddings = tf.Variable( \
     tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
   softmax_weights = tf.Variable( \
     tf.truncated_normal([vocabulary_size, embedding_size], \
                          stddev=1.0 / math.sqrt(embedding_size)))
   softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
 # Model.
   embed = tf.nn.embedding_lookup(embeddings, train_dataset)
   loss = tf.reduce_mean( \
                          tf.nn.sampled_softmax_loss(weights=softmax_weights, \   
                          biases=softmax_biases, inputs=embed, labels=train_labels, \ 
                          num_sampled=num_sampled, num_classes=vocabulary_size))
 # Optimizer.
 optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)  
 # use the cosine distance:
 norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
   normalized_embeddings = embeddings / norm
   valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
   similarity = tf.matmul(valid_embeddings,  tf.transpose(normalized_embeddings))
  • Train the word2vec model with the batches constructed, for 100k steps.
num_steps = 100001with tf.Session(graph=graph) as session:
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, \
                 train_labels : batch_labels}
    _, l =[optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 
      # 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
      # note that this is expensive (~20% slowdown if computed every 
      # 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
  final_embeddings = normalized_embeddings.eval()
  • The following shows how the loss function decreases with the increase in training steps.
  • During the training process, the words that become semantically near come closer in the embedding space.
Image for post
  • Use t-SNE plot to map the following words from 128-dimensional embedding space to 2 dimensional manifold and visualize.
words = ['রাজা', 'রাণী', 'ভারত','বাংলাদেশ','দিল্লী','কলকাতা','ঢাকা',
         'বিদেশ','লাভ','মানুষ', 'এবং', 'ও', 'গান', 'সঙ্গীত', 'বাংলা', 
         'ইংরেজি', 'ভাষা', 'কাজ', 'অনেক', 'জেলার', 'বাংলাদেশের', 
         'এক', 'দুই', 'তিন', 'চার', 'পাঁচ', 'দশ', '১', '৫', '২০', 
         'নবম', 'ভাষার', '১২', 'হিসাবে', 'যদি', 'পান', 'শহরের', 'দল', 
         'যদিও', 'বলেন', 'রান', 'করেছে', 'করে', 'এই', 'করেন', 'তিনি', 
         'একটি', 'থেকে', 'করা', 'সালে', 'এর', 'যেমন', 'সব',  'তার', 
         'খেলা',  'অংশ', 'উপর', 'পরে', 'ফলে',  'ভূমিকা', 'গঠন',  
         'তা', 'দেন', 'জীবন', 'যেখানে', 'খান', 'এতে',  'ঘটে', 'আগে', 
         'ধরনের', 'নেন', 'করতেন', 'তাকে', 'আর', 'যার', 'দেখা', 
         'বছরের', 'উপজেলা', 'থাকেন', 'রাজনৈতিক', 'মূলত', 'এমন', 
         'কিলোমিটার', 'পরিচালনা', '২০১১', 'তারা', 'তিনি', 'যিনি', 'আমি',  
         'তুমি', 'আপনি', 'লেখিকা', 'সুখ', 'বেদনা', 'মাস', 'নীল', 'লাল', 
         'সবুজ', 'সাদা', 'আছে', 'নেই', 'ছুটি', 'ঠাকুর',
         'দান', 'মণি', 'করুণা', 'মাইল', 'হিন্দু', 'মুসলমান','কথা', 'বলা',     
         'সেখানে', 'তখন', 'বাইরে', 'ভিতরে', 'ভগবান' ]
indices = []
for word in words:
    #print(word, dictionary[word])
two_d_embeddings = tsne.fit_transform(final_embeddings[indices, :])
plot(two_d_embeddings, words)
  • The following figure shows how the words similar in meaning are mapped to embedding vectors that are close to each other.
  • Also, note that arithmetic property of the word embeddings: e.g., the words ‘রাজা’ and ‘রাণী’ are approximately along the same distance and direction as the words ‘লেখক’ and ‘লেখিকা’, reflecting the fact that the nature of the semantic relatedness in terms of gender is same.
Image for post
  • The following animation shows how the embedding is learnt to preserve the semantic similarity in the 2D-manifold more and more as training proceeds.

Generating song-like texts with LSTM from Tagore’s Bangla songs

Text generation with Character LSTM

  • Let’s import the required libraries first.
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop, Adam
import io, re
  • Read the input file, containing few selected songs of Tagore in Bangla.
raw_text = open('rabindrasangeet.txt','rb').read().decode('utf8')


অগ্নিবীণা বাজাও তুমি

অগ্নিবীণা বাজাও তুমি কেমন ক’রে !
আকাশ কাঁপে তারার আলোর গানের ঘোরে ।।
তেমনি ক’রে আপন হাতে ছুঁলে আমার বেদনাতে,
নূতন সৃষ্টি জাগল বুঝি জীবন-‘পরে ।।
বাজে ব’লেই বাজাও তুমি সেই গরবে,
ওগো প্রভু, আমার প্রাণে সকল সবে ।
বিষম তোমার বহ্নিঘাতে বারে বারে আমার রাতে
জ্বালিয়ে দিলে নূতন তারা ব্যথায় ভ’রে ।।

অচেনাকে ভয় কী
অচেনাকে ভয় কী আমার ওরে?
অচেনাকেই চিনে চিনে উঠবে জীবন ভরে ।।
জানি জানি আমার চেনা কোনো কালেই ফুরাবে না,
চিহ্নহারা পথে আমায় টানবে অচিন ডোরে ।।
ছিল আমার মা অচেনা, নিল আমায় কোলে ।
সকল প্রেমই অচেনা গো, তাই তো হৃদয় দোলে ।
অচেনা এই ভুবন-মাঝে কত সুরেই হৃদয় বাজে-
অচেনা এই জীবন আমার, বেড়াই তারি ঘোরে ।।অন্তর মম
অন্তর মম বিকশিত করো অন্তরতর হে-
নির্মল করো, উজ্জ্বল করো, সুন্দর করো হে ।।
জাগ্রত করো, উদ্যত করো, নির্ভয় করো হে ।।
মঙ্গল করো, নিরলস নিঃসংশয় করো হে ।।
যুক্ত করো হে সবার সঙ্গে, মুক্ত করো হে বন্ধ ।
সঞ্চার করো সকল কর্মে শান্ত তোমার ছন্দ ।
চরণপদ্মে মম চিত নিস্পন্দিত করো হে ।
নন্দিত করো, নন্দিত করো, নন্দিত করো হে ।।

অন্তরে জাগিছ অন্তর্যামী
অন্তরে জাগিছ অন্তর্যামী ।
  • Here we shall be using a many-to-many RNN as shown in the next figure.
  • Pre-process the text and create character indices to be used as the input in the model.
processed_text = raw_text.lower()
print('corpus length:', len(processed_text))
# corpus length: 207117
chars = sorted(list(set(processed_text)))
print('total chars:', len(chars))
# total chars: 89
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
  • Cut the text in semi-redundant sequences of maxlen characters.
def is_conjunction(c):
  h = ord(c) # print(hex(ord(c)))
  return (h >= 0x980 and h = 0x9bc and h = 0x9f2)
maxlen = 40
step = 2
sentences = []
next_chars = []
i = 0
while i < len(processed_text) - maxlen:
  if is_conjunction(processed_text[i]):
    i += 1
  sentences.append(processed_text[i: i + maxlen])
  next_chars.append(processed_text[i + maxlen])
  i += step
  print('nb sequences:', len(sentences))
  # nb sequences: 89334
  • Create one-hot-encodings.
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
    x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
  • Build a model, a single LSTM.
model = Sequential()
model.add(LSTM(256, input_shape=(maxlen, len(chars))))
model.add(Dense(128, activation='relu'))
model.add(Dense(len(chars), activation='softmax'))
optimizer = Adam(lr=0.01) #RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
  • The following figure how the model architecture looks like:
Image for post
  • Print the model summary.

Model: "sequential" 
Layer (type)                 Output Shape              Param #    
lstm (LSTM)                  (None, 256)               354304     
dense (Dense)                (None, 128)               32896      
dense_1 (Dense)              (None, 89)                11481      
Total params: 398,681 Trainable params: 398,681 Non-trainable params: 0 
  • Use the following helper function to sample an index from a probability array.
def sample(preds, temperature=1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)
  • Fit the model and register a callback to print the text generated by the model at the end of each epoch.
print_callback = LambdaCallback(on_epoch_end=on_epoch_end), y, batch_size=128, epochs=60, callbacks=[print_callback])
  • The following animation shows how the model generates song-like texts with given seed texts, for different values of the temperature parameter.
Image for post

Text Generation with Word LSTM

  • Pre-process the input text, split by punctuation characters and create word indices to be used as the input in the model.
processed_text = raw_text.lower()
from string import punctuation
r = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
words = r.split(processed_text)
# ['পূজা',
# 'অগ্নিবীণা',
# 'বাজাও',
# 'তুমি',
# 'অগ্নিবীণা',
# 'বাজাও',
# 'তুমি',
# 'কেমন',
# 'ক’রে',
# 'আকাশ',
# 'কাঁপে',
# 'তারার',
# 'আলোর',
# 'গানের',
# 'ঘোরে',
# '।।']

unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))
index_unique_word = dict((i, c) for i, c in enumerate(unique_words))
  • Create a word-window of length 5 to predict the next word.
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
    prev_words.append(words[i:i + WORD_LENGTH])
    next_words.append(words[i + WORD_LENGTH])
# ['অগ্নিবীণা', 'বাজাও', 'তুমি', 'অগ্নিবীণা', 'বাজাও']
# তুমি
# 7847
  • Create OHE for input and output words as done for character-RNN. Fit the model on the pre-rpocessed data.
print_callback = LambdaCallback(on_epoch_end=on_epoch_end), Y,
  • The following animation shows the song -like text generated by the word-LSTM at the end of an epoc.

Bangla Sentiment Analysis using LSTM with Daily Astrological Prediction Dataset

  • Let’s first create sentiment analysis dataset by crawling the daily astrological predictions (রাশিফল) page of the online edition of আনন্দবাজার পত্রিকা (e.g., for the year 2013), a leading Bangla newspaper and then manually labeling the sentiment of each of the predictions corresponding to each moon-sign.
  • Read the csv dataset, the first few lines look like the following.
df = pd.read_csv('horo_2013_labeled.csv')
pd.set_option('display.max_colwidth', 135) 
  • Transform each text in texts in a sequence of integers.
tokenizer = Tokenizer(num_words=2000, split=' ')
tokenizer.fit_on_texts(df['আপনার আজকের দিনটি'].values)
X = tokenizer.texts_to_sequences(df['আপনার আজকের দিনটি'].values)
X = pad_sequences(X)
#array([[   0,    0,    0, ...,   26,  375,    3],        
#       [   0,    0,    0, ...,   54,    8,    1],        
#       [   0,    0,    0, ...,  108,   42,   43],        
#       ...,        
#       [   0,    0,    0, ..., 1336,  302,   82],        
#       [   0,    0,    0, ..., 1337,  489,  218],        
#       [   0,    0,    0, ...,    2,  316,   87]])
  • Here we shall use a many-to-one RNN for sentiment analysis as shown below.
  • Build an LSTM model that takes a sentence as input and outputs the sentiment label.
model = Sequential()
model.add(Embedding(2000, 128,input_length = X.shape[1]))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])

Layer (type)                 Output Shape              Param #   
embedding_10 (Embedding)     (None, 12, 128)           256000    
spatial_dropout1d_10 (Spatia (None, 12, 128)           0         
lstm_10 (LSTM)               (None, 128)               131584    
dense_10 (Dense)             (None, 2)                 258       
Total params: 387,842
Trainable params: 387,842
Non-trainable params: 0
  • Divide the dataset into train and validation (test) dataset and train the LSTM model on the dataset.
Y = pd.get_dummies(df['sentiment']).values
X_train, X_test, Y_train, Y_test, _, indices = train_test_split(X,Y, np.arange(len(X)), test_size = 0.33, random_state = 5), Y_train, epochs = 5, batch_size=32, verbose = 2)

#Epoch 1/5  - 3s - loss: 0.6748 - acc: 0.5522 
#Epoch 2/5  - 1s - loss: 0.5358 - acc: 0.7925 
#Epoch 3/5  - 1s - loss: 0.2368 - acc: 0.9418 
#Epoch 4/5  - 1s - loss: 0.1011 - acc: 0.9761 
#Epoch 5/5  - 1s - loss: 0.0578 - acc: 0.9836 
  • Predict the sentiment labels of the (held out) test dataset.
result = model.predict(X[indices],batch_size=1,verbose = 2)
df1 = df.iloc[indices]
df1['neg_prob'] = result[:,0]
df1['pos_prob'] = result[:,1]
df1['pred'] = np.array(['negative', 'positive'])[np.argmax(result, axis=1)]
  • Finally, compute the accuracy of the model for the positive and negative ground-truth sentiment corresponding to daily astrological predictions.
df2 = df1[df1.sentiment == 'positive']
print('positive accuracy:' + str(np.mean(df2.sentiment == df2.pred)))
#positive accuracy:0.9177215189873418
df2 = df1[df1.sentiment == 'negative']
print('negative accuracy:' + str(np.mean(df2.sentiment == df2.pred)))
#negative accuracy:0.9352941176470588

Building a very simple Bangla Chatbot with RASA NLU

  • The following figure shows how to design a very simple Bangla chatbot to order food from restaurants using RASA NLU.
  • We need to design the intents, entities and slots to extract the entities properly and then design stories to define how the chatbot will respond to user inputs (core / dialog).
  • The following figure shows how the nlu, domain and stories files are written for the simple chatbot.
  • A sequence-to-sequence deep learning model is trained under the hood for intent classification. The next code block shows how the model can be trained.
import rasa
model_path = rasa.train('domain.yml', 'config.yml', ['data/'], 'models/')
  • The following gif demonstrates how the chatbot responds to user inputs.


Travelling Salesman Problem (TSP) with Python

In this blog we shall discuss on the Travelling Salesman Problem (TSP) — a very famous NP-hard problem and will take a few attempts to solve it (either by considering special cases such as Bitonic TSP and solving it efficiently or by using algorithms to improve runtime, e.g., using Dynamic programming, or by using approximation algorithms, e.g., for Metric TSP and heuristics, to obtain not necessarily optimal but good enough solutions, e.g., with Simulated Annealing and Genetic Algorithms) and work on the corresponding python implementations. Few of the problems discussed here appeared as programming assignments in the Coursera course Advanced Algorithms and Complexity and some of the problem statements are taken from the course.

Improving the runtime of the Travelling Salesman Problem with Dynamic Programming

In this problem we shall deal with a classical NP-complete problem called Traveling Salesman Problem. Given a graph with weighted edges, you need to find the shortest cycle visiting each vertex exactly once. Vertices correspond to cities. Edges weights correspond to the cost (e.g., time) to get from one vertex to another one. Some vertices may not be connected by an edge in the general case.

  • Here we shall use dynamic programming to solve TSP: instead of solving one problem we will solve a collection of (overlapping) subproblems.
  • A subproblem refers to a partial solution
  • A reasonable partial solution in case of TSP is the initial part of a cycle
  • To continue building a cycle, we need to know the last vertex as well as the set of already visited vertices
  • It will be convenient to assume that vertices are integers from 1 to n and that the salesman starts his trip in (and also returns back to) vertex 1.
  • The following figure shows the Dynamic programming subproblems, the recurrence relation and the algorithm for TSP with DP.
Image for post
Image for post

Implementation tips

  • In order to iterate through all subsets of {1, . . . , n}, it will be helpful to notice that there is a natural one-to-one correspondence between integers in the range from 0 and 2^n − 1 and subsets of {0, . . . , n − 1}: k ↔ {i : i -th bit of k is 1}.
  • For example, k = 1 (binary 001) corresponds to the set {0}, where k = 5 (binary 101) corresponds to the set {0,2}
  • In order to find out the integer corresponding to S − {j} (for j ∈ S), we need to flip the j-th bit of k (from 1 to 0). For this, in turn, we can compute a bitwise XOR of k and 2^j (that has 1 only in j-th position)
  • In order to compute the optimal path along with the cost, we need to maintain back-pointers to store the path.

The following python code shows an implementation of the above algorithm.

import numpy as np
from itertools import combinationsdef TSP(G):
   n = len(G)
   C = [[np.inf for _ in range(n)] for __ in range(1 << n)]
   C[1][0] = 0 # {0} <-> 1
   for size in range(1, n):
      for S in combinations(range(1, n), size):
	  S = (0,) + S
	  k = sum([1 << i for i in S])
	  for i in S:
	      if i == 0: continue
	      for j in S:
		 if j == i: continue
		    cur_index = k ^ (1 << i)
		    C[k][i] = min(C[k][i], C[cur_index][j]+ G[j][i])     
   all_index = (1 << n) - 1
   return min([(C[all_index][i] + G[0][i], i) \
                            for i in range(n)])
  • The following animation shows how the least cost solution cycle is computed with the DP for a graph with 4 vertices. Notice that in order to represent C(S,i) from the algorithm, the vertices that belong to the set S are colored with red circles, the vertex i where the path that traverses through all the nodes in S ends at is marked with a red double-circle.
  • The next animation also shows how the DP table gets updated. The DP table for a graph with 4 nodes will be of size 2⁴ X 4, since there are 2⁴=16 subsets of the vertex set V={0,1,2,3} and a path going through a subset of the vertices in V may end in any of the 4 vertex.
  • The transposed DP table is shown in the next animation, here the columns correspond to the subset of the vertices and rows correspond to the vertex the TSP ends at.
Image for post
Image by Author
  • The following animation shows how the least cost solution cycle is computed with the DP for a graph with 5 nodes.
Image for post
Image by Author

The following animation / figure shows the TSP optimal path is computed for increasing number of nodes (where the weights for the input graphs are randomly generated) and the exponential increase in the time taken.

Image for post
Image by Author
Image for post
Image by Author

Solving TSP with Integer Linear Program

Image for post

Solving with the mip package using the following python code, produces the output shown by the following animation, for a graph with randomly generated edge-weights.

from mip import Model, xsum, minimize, BINARY
def TSP_ILP(G):
   start = time()
   V1 =  range(len(G))
   n, V = len(G), set(V1)
   model = Model()   # binary variables indicating if arc (i,j) is used 
   # on the route or not
   x = [[model.add_var(var_type=BINARY) for j in V] for i in V]   # continuous variable to prevent subtours: each city will have a
   # different sequential id in the planned route except the 1st one
   y = [model.add_var() for i in V]   # objective function: minimize the distance
    model.objective = minimize(xsum(G[i][j]*x[i][j] \
                               for i in V for j in V))
   # constraint : leave each city only once
   for i in V:
      model += xsum(x[i][j] for j in V - {i}) == 1   # constraint : enter each city only once
   for i in V:
      model += xsum(x[j][i] for j in V - {i}) == 1   # subtour elimination
   for (i, j) in product(V - {0}, V - {0}):
      if i != j:
         model += y[i] - (n+1)*x[i][j] >= y[j]-n   # optimizing
   model.optimize()   # checking if a solution was found
   if model.num_solutions:
      print('Total distance {}'.format(model.objective_value))
      nc = 0 # cycle starts from vertex 0
      cycle = [nc]
      while True:
         nc = [i for i in V if x[nc][i].x >= 0.99][0]
	 if nc == 0:
   return (model.objective_value, cycle)

The constraint to prevent the subtours to appear in the solution is necessary, if we run without the constraint, we get a solution with subtours instead of a single cycle going through all the nodes, as shown below:

Image for post
Image by Author
Image for post
Image by Author

Comparing with Dynamic programming based solution, we can see that ILP is much more efficient for higher n values.

Image for post
Image by Author
Image for post
Image by Author

Bitonic TSP

Image for post
Image by Author
Image for post
Image by Author

The following python code snippet implements the above DP algorithm.

def dist(P, i, j):
   return np.sqrt((P[i][0]-P[j][0])**2+(P[i][1]-P[j][1])**2)
def BTSP(P):
   n = len(P)
   D = np.ones((n,n))*np.inf
   path = np.ones((n,n), dtype=int)*(-1)
   D[n-2,n-1] = dist(P, n-2, n-1)
   path[n-2,n-1] = n-1
   for i in range(n-3,-1,-1):
      m = np.inf
      for k in range(i+2,n):
	 if m > D[i+1,k] + dist(P,i,k):
	    m, mk = D[i+1,k] + dist(P,i,k), k
      D[i,i+1] = m
      path[i,i+1] = mk
      for j in range(i+2,n):
	 D[i,j] = D[i+1,j] + dist(P,i,i+1)
	 path[i,j] = i+1	
      D[0,0] = D[0,1] + dist(P,0,1)
      path[0,0] = 1
      return D, path
def get_tsp_path(path, i, j, n):
    if n < 0:
	return []
    if i <= j:
 	k = path[i,j]
	return [k] + get_tsp_path(path, k, j, n-1)
	k = path[j,i]
	return get_tsp_path(path, i, k, n-1) + [k]

The following animation shows how the DP table is computed and the optimal path for Bitonic TSP is constructed. It also shows the final optimal path.

Image for post
Image by Author
Image for post
Image by Author

2-OPT Approximation Algorithm for Metric TSP

Image for post

The next code snippet implements the above 2-OPT approximation algorithm.

import numpy as np
import queue	def dfs(adj, x):
   visited = [False]*len(adj)
   stack = [x]
   visited[x] = True
   path = []
   while len(stack) > 0:
      u = stack.pop(-1)
      for v in adj[u]:
	 if not visited[v]:
   	    visited[v] = True
	return pathdef mst(adj):
   inf = np.inf
   c = [inf]*n
   s = 0
   c[s] = 0
   visited = [False]*n
   parent = [None]*n
   h = queue.PriorityQueue()
   for v in range(n):
      h.put((c[v], v))
   edges = []
   while not h.empty():
      w, u = h.get()
      if visited[u]: continue
           visited[u] = True
	   if parent[u] != None:
		edges.append((parent[u], u))
		for v in range(n):
    		   if v == u: continue
		   if (not visited[v]) and (c[v] > adj[u][v]):
			c[v] = adj[u][v]
			parent[v] = u
			h.put((c[v], v))
   adj = [[] for _ in range(n)]
   for i in range(n):
      if parent[i] != None:
   path = dfs(adj, 0)
   path += [path[0]]
   return path

The following animation shows the TSP path computed with the above approximation algorithm and compares with the OPT path computed using ILP for 20 points on 2D plane. The MST is computed with Prim’s algorithm.

Image for post
Image by Author

TSP with Simulated Annealing

Image for post
Image for post

The following python code snippet shows how to implement the Simulated Annealing to solve TSP, here G represents the adjacency matrix of the input graph.

def TSP_SA(G):
   s = list(range(len(G)))
   c = cost(G, s)
   ntrial = 1
   T = 30
   alpha = 0.99
   while ntrial <= 1000:
      n = np.random.randint(0, len(G))
      while True:
         m = np.random.randint(0, len(G))
         if n != m:
      s1 = swap(s, m, n)
      c1 = cost(G, s1)
      if c1 < c:
         s, c = s1, c1
         if np.random.rand() < np.exp(-(c1 - c)/T):
            s, c = s1, c1
      T = alpha*T
      ntrial += 1def swap(s, m, n):
   i, j = min(m, n), max(m, n)
   s1 = s.copy()
   while i < j:
      s1[i], s1[j] = s1[j], s1[i]
      i += 1
      j -= 1
   return s1
def cost(G, s):
   l = 0
   for i in range(len(s)-1):
      l += G[s[i]][s[i+1]]
   l += G[s[len(s)-1]][s[0]] 
   return l

The following animations show how the algorithm works:

Image for post
Image by Author
Image for post
Image by Author

The following animation shows the TSP path computed with SA for 100 points in 2D.

Image for post
Image by Author

TSP with Genetic Algorithm

Image for post

Here in the following implementation of the above algorithm we shall have the following assumptions:

  • We shall assume the crossover rate is 1.0, i.e., all individuals in a population participate in crossover. The mutation probability to be used is 0.1.
  • With each crossover operation between two parent chromosomes, couple of children are generated, cant just swap portions of parents chromosomes, need to be careful to make sure that the offspring represents valid TSP path.
  • Mutation is similar to swap operation implemented earlier.
  • For each generation we shall keep a constant k=20 (or 30) chromosomes (representing candidate solutions for TSP).
  • The fitness function will be the cost of the TSP path represented by each chromosome. Hence, we want to minimize the value of the fitness function — i.e., less the value of a chromosome, more fit is it to survive.
  • We shall use rank selection, i.e., after crossover and mutation, only the top k fittest offspring (i.e., with least fitness function value) will survive for the next generation.
  • The following python code shows the implementation of the above algorithm with the above assumptions.

import numpy as np

def do_crossover(s1, s2, m):
   s1, s2 = s1.copy(), s2.copy()
   c1 = s2.copy()
   for i in range(m, len(s1)): c1.remove(s1[i])
   for i in range(m, len(s1)): c1.append(s1[i])
   c2 = s1.copy()
   for i in range(m, len(s2)): c2.remove(s2[i])
   for i in range(m, len(s2)): c2.append(s2[i])	
   return (c1, c2)
def do_mutation(s, m, n):
   i, j = min(m, n), max(m, n)
   s1 = s.copy()
   while i < j:
	s1[i], s1[j] = s1[j], s1[i]
	i += 1
	j -= 1
   return s1
def compute_fitness(G, s):
   l = 0
   for i in range(len(s)-1):
	l += G[s[i]][s[i+1]]
	l += G[s[len(s)-1]][s[0]]	
   return l
def get_elite(G, gen, k):
   gen = sorted(gen, key=lambda s: compute_fitness(G, s))
   return gen[:k]
def TSP_GA(G, k=20, ntrial = 200):
    n_p = k
    mutation_prob = 0.1
    gen = []
    path = list(range(len(G)))
    while len(gen) < n_p:
  	path1 = path.copy()
	if not path1 in gen:
    for trial in range(ntrial):
	gen = get_elite(G, gen, k)
	gen_costs = [(round(compute_fitness(G, s),3), s) \
                      for s in gen]
	next_gen = []
	for i in range(len(gen)):
   	   for j in range(i+1, len(gen)):
   	       c1, c2 = do_crossover(gen[i], gen[j], \
		        np.random.randint(0, len(gen[i])))
	   if np.random.rand() < mutation_prob:
	       m = np.random.randint(0, len(gen[i]))
	       while True:
	          n = np.random.randint(0, len(gen[i]))
		  if m != n:
	       c = do_mutation(gen[i], m, n)
	gen = next_gen
Image for post
Image by Author

The following animation shows the TSP path computed with GA for 100 points in 2D.

Image for post

Coping with a few NP-Hard Problems with Python

Solving classic NP-hard problems such as 3-Coloring and Hamiltonian path with SAT solvers

In this blog we shall continue our discussion on a few NP-complete / NP-hard problems and will attempt to solve them (e.g., encoding the problem to satisfiability problem and solving with a SAT-solver) and the corresponding python implementations. The problems discussed here appeared as programming assignments in the coursera course Advanced Algorithms and Complexity. The problem statements are taken from the course itself.

Coloring a Graph with 3 colors using a SAT solver

Given a graph, we need to color its vertices into 3 different colors, so that any two vertices connected by an edge need to be of different colors. Graph coloring is an NP-complete problem, so we don’t currently know an efficient solution to it, and you need to reduce it to an instance of SAT problem which, although it is NP-complete, can often be solved efficiently in practice using special programs called SAT-solvers.

We can reduce the real-world problem about assigning frequencies to the transmitting towers of the cells in a GSM network to a problem of proper coloring a graph into 3 colors.Colors correspond to frequencies, vertices correspond to cells, and edges connect neighboring cells, as shown below.

Image for post
Image taken from here
Image for post
  • We need to output a boolean formula in the conjunctive normal form (CNF) in a specific format. If it is possible to color the vertices of the input graph in 3 colors such that any two vertices connected by an edge are of different colors, the formula must be satisfiable. Otherwise, the formula must be unsatisfiable.
  • We shall use pySAT SAT-solver to solve the clauses generated from the graph.
  • In particular, if there are n nodes and m edges in the graph that is to be colored with 3 colors, we need to generate 3*n variables and 3*m + n clauses.
  • The i-th vertex will correspond to 3 variables, with id i, n+i and 2*n+i. They will represent whether the node is to be colored by red, green or blue.
  • Any two vertices forming an edge must not have the same color.
  • Each vertex must be colored by one from the 3 colors.

The following python code snippet shows how the encoding needs to be implemented to output the desired clauses for the CNF, subsequently solved by the SAT-solver and the solution assignments is used to color the vertices of the graph. Again, there are following 3 basic steps:

  • encode the input graph into SAT formula (reduce the 3-coloring problem to a satisfiability problem)
  • solve with a SAT-solver and obtain the solution in terms of variable assignments
  • Decode the solution — use the solution assignments by SAT-solver to color the vertices of the graph

import numpy as np
from pysat.solvers import Glucose3def get_colors(assignments):
    all_colors = np.array(['red', 'green', 'blue'])
    colors = {}
    for v in range(n):
	colors[v+1] = all_colors[[assignments[v]>0, \
	      assignments[n+v]>0, assignments[2*n+v]>0]][0]
	return colorsdef print_clauses(clauses):
    for c in clauses:
	vars = []
	for v in c:
    	   vars.append('{}x{}'.format('¬' if v < 0 else '', abs(v)))
	print('(' + ' OR '.join(vars) + ')')
def print_SAT_solution(assignments):
    sol = ''
    for x in assignments:
	sol += 'x{}={} '.format(abs(x),x>0)
def solve_graph_coloring():
    # encode the input graph into SAT formula
    n_clauses = 3*m+n
    n_vars = 3*n
    clauses = []
    for u, v in edges:
 	clauses.append((-u, -v)) # corresponding to red color
	clauses.append((-(n+u), -(n+v))) # corresponding to green 
 	clauses.append((-(2*n+u), -(2*n+v))) # corresponds to blue
    for v in range(1, n+1):
        clauses.append((v, n+v, 2*n+v)) # at least one color
    # solve SAT and obtain solution in terms of variable assignments
    g = Glucose3()
    for c in clauses:
    status = g.solve()
    assignments = g.get_model()
    # use the solution assignment by SAT to color the graph
    colors = get_colors(assignments)

The following animations show the input graphs, the corresponding variables and clauses generated for the CNF to be solved with the SAT-solver, the solution obtained in terms of truth-assignments of the variables and then how they are used solve the graph-coloring problem, to color the graph with 3 colors. Each row in the right subgraph represents a clause in the CNF.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

The last graph is the Petersen graph, that has chromatic number 3, so can be colored with 3 colors and a 3-coloring solution is obtained by the SAT-solver output assignments.

Now, the following example shows an input graph which is NOT 3-colorable, with the CNF formed by the clauses generated to be solved by the SAT-solver being unsatisfiable / inconsistent.

Image for post
Image by Author

Solving the Hamiltonian Path problem with SAT-solver

In this problem, we shall learn how to solve the classic Hamiltonian Path problem, by designing and implementing an efficient algorithm to reduce it to SAT.

Image for post

The following python snippet shows

  1. how the Hamiltonian Path problem is reduced to SAT

2. Then it’s solved by the pysat SAT-solver.

3. The solution is interpreted to construct the Hamiltonian path for the following input graphs.

def get_hamiltonian_path(assignments):
    path = [None]*n
    for i in range(n):
	for j in range(n):
  	   if assignments[i*n+j] > 0: # True
		path[i] = j+1
    return path
def reduce_Hamiltonian_Path_to_SAT_and_solve(edges):
    def index(i, j):
	return n*i + j + 1
    m = len(edges)
    n_clauses = 2*n + (2*n*n-n-m)*(n-1)
    n_vars = n*n
    clauses = []
    for j in range(n):
  	clause = []
	for i in range(n):
    for i in range(n):
	clause = []
	for j in range(n):
    for j in range(n):
 	for i in range(n):
   	    for k in range(i+1, n):
		clauses.append((-index(i,j), -index(k,j)))
    for i in range(n):
        for j in range(n):
   	    for k in range(j+1, n):
	        clauses.append((-index(i,j), -index(i,k)))
    for k in range(n-1):
        for i in range(n):
  	    for j in range(n):
		if i == j: continue
	  	    if not [i+1, j+1] in edges:
			clauses.append((-index(k,i), -index(k+1,j)))
    g = Glucose3()
    for c in clauses:
    status = g.solve()
    assignments = g.get_model()
    path = get_hamiltonian_path(assignments)

The following figure shows the output Hamiltonian Path obtained for the line input graph using the solution obtained by SAT-solver.

Image for post
Image by Author

The following figure shows the Hamiltonian Path obtained with the SAT-solver for the input Petersen’s graph, which indeed has a Hamiltonian Path.

Image for post
Image by Author

To be continued…

In medium:

Graph Algorithms with Python

In this blog we shall discuss about a few popular graph algorithms and their python implementations. The problems discussed here appeared as programming assignments in the coursera course Algorithms on Graphs and on Rosalind. The problem statements are taken from the course itself.

The basic building blocks of graph algorithms such as computing the number of connected components, checking whether there is a path between the given two vertices, checking whether there is a cycle, etc are used practically in many applications working with graphs: for example, finding shortest paths on maps, analyzing social networks, analyzing biological data.

For all the problems it can be assumed that the given input graph is simple, i.e., it does not contain self-loops (edges going from a vertex to itself) and parallel edges.

Checking if two vertices are Reachable

Given an undirected graph G=(V,E) and two distinct vertices 𝑢 and 𝑣, check if there is a path between 𝑢 and 𝑣.


  1. Starting from the node u, we can simply use breadth first search (bfs) or depth-first search (dfs) to explore the nodes reachable from u.
  2. As soon as we find v we can return the nodes are reachable from one-another.
  3. If v is not there in the nodes explored, we can conclude v is not reachable from u.
  4. The following implementation (demonstrated using the following animations) uses iterative dfs (with stack) to check if two nodes (initially colored pink) are reachable.
  5. We can optionally implement coloring of the nodes w.r.t. the following convention: initially the nodes are all white, when they are visited (pushed onto stack) they are marked as gray and finally when all the adjacent (children) nodes for a given node are visited, the node can be marked black.
  6. We can store the parent of each node in an array and we can extract the path between two given nodes using the parent array, if they are reachable.
Image for post
Image taken from this lecture notes

A very basic python implementation of the iterative dfs are shown below (here adj represents the adjacency list representation of the input graph):

def reach(adj, x, y): 
 visited = [False]*len(adj)
 stack = [x]
 visited[x] = True
 while len(stack) > 0:
  u = stack.pop(-1)
  for v in adj[u]:
   if not visited[v]:
    visited[v] = True
    if v == y:
     return 1
 return 0

The following animations demonstrate how the algorithm works, the stack is also shown at different points in time during the execution. Finally the path between the nodes are shown if they are reachable.

Image for post
Image for post

The same algorithm can be used for finding an Exit from a Maze (check whether there is a path from a given cell to a given exit).

Find Connected Components in an Undirected Graph

Given an undirected graph with 𝑛 vertices and 𝑚 edges, compute the number of connected components in it.


  1. The following simple modification in dfs can be used to find the number of connected components in an undirected graph, as shown in the following figure.
  2. From each node we need to find all the nodes yet to be explored.
  3. We can find the nodes in a given component by finding all the nodes reachable from a given node.
  4. The same iterative dfs implementation was used (demonstrated with the animations below).
  5. The nodes in a component found were colored using the same color.
Image for post
Image taken from this lecture note

Python Code

def number_of_components(adj):
 result = 0
 n = len(adj)
 visited = [False]*len(adj)
 for x in range(n):
  if not visited[x]:
   result += 1
   stack = [x]
   visited[x] = True
   while len(stack) > 0:
    u = stack.pop(-1)
    for v in adj[u]:
     if not visited[v]:
      visited[v] = True
 return result
Image for post
Image for post

The same algorithm can be used to decide that there are no dead zones in a maze, that is, that at least one exit is reachable from each cell.

Find Euler Tour / Circuit in a Directed Graph

Given A directed graph that contains an Eulerian tour, where the graph is given in the form of an adjacency list.


  1. While solving the famous Königsberg Bridge Problem, Euler proved that an Euler circuit in an undirected graph exists iff all its nodes have even degree.
  2. For an Euler tour to exist in an undirected graph, if there exists odd-degree nodes in the graph, there must be exactly 2 nodes with odd degrees — the tour will start from one such node and end in another node.
  3. The tour must visit all the edges in the graph exactly once.
  4. The degree test can be extended to directed graphs (with in-degrees and out-degrees) and can be used to determine the existence of Euler tour / circuit in a Digraph.
  5. Flury’s algorithm can be used to iteratively remove the edges (selecting the edges not burning the bridges in the graph as much as possible) from the graph and adding them to the tour.
  6. DFS can be modified to obtain the Euler tour (circuit) in a DiGraph.
  7. The following figure shows these couple of algorithms for finding the Euler tour / circuit in a graph if one exists (note in the figure path is meant to indicate tour). We shall use DFS to find Euler tour.
Image for post
Image created from Youtube Videos cited inside the image

The following code snippets represent the functions used to find Euler tour in a graph. Here, variables n and m represent the number of vertices and edges of the input DiGraph, respectively, whereas adj represents the corresponding adjacency list.

Python Code

def count_in_out_degrees(adj):
 n = len(adj)
 in_deg, out_deg = [0]*n, [0]*n
 for u in range(n):
  for v in adj[u]:
   out_deg[u] += 1
   in_deg[v] += 1
 return in_deg, out_degdef get_start_if_Euler_tour_present(in_deg, out_deg):
 start, end, tour = None, None, True
 for i in range(len(in_deg)): 
  d = out_deg[i] - in_deg[i]
  if abs(d) > 1:
   tour = False
  elif d == 1:
   start = i
  elif d == -1:
   end = i
 tour = (start != None and end != None) or \
        (start == None and end == None)
 if tour and start == None: # a circuit
  start = 0
 return (tour, start)def dfs(adj, v, out_deg, tour):
 while out_deg[v] > 0:
  out_deg[v] -= 1
  dfs(adj, adj[v][out_deg[v]], out_deg, tour)
 tour[:] = [v] + tourdef compute_Euler_tour(adj):
 n, m = len(adj), sum([len(adj[i]) for i in range(len(adj))])
 in_deg, out_deg = count_in_out_degrees(adj)
 tour_present, start = get_start_if_Euler_tour_present(in_deg, \
 if not tour_present:
  return None
 tour = []
 dfs(adj, start, out_deg, tour)
 if len(tour) == m+1:
  return tour
return None

The following animations show how the algorithm works.

Image for post
Image for post

Cycle Detection in a Directed Graph

Check whether a given directed graph with 𝑛 vertices and 𝑚 edges contains a cycle.

The following figure shows the classification of the edges encountered in DFS:

Image for post
Image taken from this lecture notes

It can be shown that whether a Directed Graph is acyclic (DAG) or not (i.e. it contains a cycle), can be checked using the presence of a back-edge while DFS traversal.


  1. Use the recursive DFS implementation (pseudo-code shown in the below figure)
  2. Track if a node to be visited is already on the stack, if it’s there, it forms a back edge.
  3. Use parents array to obtain the directed cycle, if found.
Image for post
Image taken from this lecture notes

Python code

def cyclic(adj):
 n = len(adj)
 visited = [False]*n
 parents = [None]*n
 on_stack = [False]*n
 cycle = []
 def dfs_visit(adj, u, cycle):
  visited[u] = True
  on_stack[u] = True
  for v in adj[u]:
   if not visited[v]:
    parents[v] = u
    dfs_visit(adj, v, cycle)
   elif on_stack[v]:
    x = u
    while x != v:
     x = parents[x]
    cycle = [v] + cycle
  on_stack[u] = False
 for v in range(n):
  if not visited[v]:
   dfs_visit(adj, v, cycle)
 return int(len(cycle) > 0)
Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

The above algorithm can be used to check consistency in a curriculum (e.g., there is no cyclic dependency in the prerequisites for each course listed).

Topologically Order a Directed Graph

Compute a topological ordering of a given directed acyclic graph (DAG) with 𝑛 vertices and 𝑚 edges.

The following idea can be used to obtain such an ordering in a DiGraph G:

  • Find sink.
  • Put at end of order.
  • Remove from graph.
  • Repeat.

It can be implemented efficiently with dfs by keeping track of the time of pre and post visiting the nodes.


  1. Use the recursive implementation of dfs.
  2. When visiting a node with a recursive call, record the pre and post visit times, at the beginning and the end (once all the children of the given node already visited) of the recursive call, respectively.
  3. Reorder the nodes by their post-visit times in descending order, as shown in the following figure.
Image for post
Image taken from this lecture notes

Python code

def toposort(adj):
 order = []
 n = len(adj)
 visited = [False]*n
 previsit = [0]*n
 postvisit = [0]*n
 def dfs_visit(adj, u):
  global clock
  visited[u] = True
  previsit[u] = clock
  clock += 1
  for v in adj[u]:
   if not visited[v]:
    dfs_visit(adj, v)
  postvisit[u] = clock
  clock += 1
 for v in range(n):
  if not visited[v]:
   dfs_visit(adj, v)
 order = [x for _, x in sorted(zip(postvisit, range(n)), \
                   key=lambda pair: pair[0], reverse=True)]
 return order
Image for post
Image for post

The above algorithm can be used to determine the order of the courses in a curriculum, taking care of the pre-requisites dependencies.

KosaRaju’s Algorithm to find the Strongly Connected Components (SCCs) in a Digraph

Compute the number of strongly connected components of a given directed graph with 𝑛 vertices and 𝑚 edges.

Note the following:

  • dfs can be used to find SCCs in a Digraph.
  • We need to make it sure that dfs traversal does not leave a such component with an outgoing edge.
  • The sink component is one that does not have an outgoing edge. We need to find a sink component first.
  • The vertex with the largest post-visit time is a source component for dfs.
  • The reverse (or transpose) graph of G has same SCC as the original graph.
  • Source components of the transpose of G are sink components of G.

With all the above information, the following algorithm can be implemented to obtain the SCCs in a digraph G.

Image for post
Image taken from this lecture notes

The runtime of the algorithm is again O(|V |+|E|). Alternatively, the algorithm can be represented as follows:

Image for post
Taken from these lecture notes

Python Code

def number_of_strongly_connected_components(adj):
 result = 0
 visited = [False]*n
 previsit = [0]*n
 postvisit = [0]*n
 def reverse_graph(adj):
  n = len(adj)
  new_adj = [ [] for _ in range(n)]
  for i in range(n):
   for j in adj[i]:
  return new_adj
 def dfs_visit(adj, u):
  global clock
  visited[u] = True
  previsit[u] = clock
  clock += 1
  for v in adj[u]:
   if not visited[v]:
    dfs_visit(adj, v)
  postvisit[u] = clock
  clock += 1
 for v in range(n):
  if not visited[v]:
   dfs_visit(adj, v)
 post_v = [x for _, x in sorted(zip(postvisit, range(n)), \
                   key=lambda pair: pair[0], reverse=True)]
 rev_adj = reverse_graph(adj)
 visited = [False]*n
 for v in post_v:
  if not visited[v]:
   dfs_visit(rev_adj, v)
   result += 1return result

The following animations show how the algorithm works:

Image for post
Image for post
Image for post

Shortest Path in an Undirected Graph with BFS

Given an undirected graph with 𝑛 vertices and 𝑚 edges and two vertices 𝑢 and 𝑣, compute the length of a shortest path between 𝑢 and 𝑣 (that is, the minimum number of edges in a path from 𝑢 to 𝑣).

  • The following figure shows the algorithm for bfs.
  • It uses a queue (FIFO) instead of a stack (LIFO) to store the nodes to be explored.
  • The traversal is also called level-order traversal, since the nodes reachable with smaller number of hops (shorter distances) from the source / start node are visited earlier than the higher-distance nodes.
  • The running time of bfs is again O(|E| + |V |).
Image for post
Image taken from this lecture notes

Python code

def distance(adj, s, t):
 inf = 10**9 #float('Inf')
 d = [inf]*len(adj)
 queue = [s]
 d[s] = 0
 while len(queue) > 0:
  u = queue.pop(0)
  for v in adj[u]:
   if d[v] ==  inf:
    d[v] = d[u] + 1
    if v == t:
     return d[t]
 return -1

The following animations demonstrate how the algorithm works. The queue used to store the vertices to be traversed is also shown.

Image for post
Image for post
Image for post
Image for post

The above algorithm can be used to compute the minimum number of flight segments to get from one city to another one.

Checking if a Graph is Bipartite (2-Colorable)

Given an undirected graph with 𝑛 vertices and 𝑚 edges, check whether it is bipartite.


  1. Note that a graph bipartite iff its vertices can be colored using 2 colors (so that no adjacent vertices have the same color).
  2. Use bfs to traverse the graph from a starting node and color nodes in the alternate levels (measured by distances from source) with red (even level) and blue (odd level).
  3. If at any point in time two adjacent vertices are found that are colored using the same color, then the graph is not 2-colorable (hence not bipartite).
  4. If no such cases occur, then the graph is bipartite
  5. For a graph with multiple components, use bfs on each of them.
  6. Also, note that a graph is bipartite iff it contains an odd length cycle.

Python code

def bipartite(adj):
 color = [None]*len(adj)
 for vertex in range(len(adj)):
  if not color[vertex]:
   queue = [vertex]
   color[vertex] = 'red'
   while len(queue) > 0:
    u = queue.pop(0)
    for v in adj[u]:
     if color[v] ==  color[u]:
      return 0
     if not color[v]:
      color[v] = 'red' if color[u] == 'blue' else 'blue'
 return 1

The following animations demonstrate how the algorithm works.

Image for post
Image for post
Image for post
Image for post

Notice that the last graph contained a length of cycle 4 (even length cycle) and it was bipartite graph. The graph prior to this one contained a triangle in it (cycle of odd length 3), it was not bipartite.

Shortest Path in a Weighted Graph with Dijkstra

Given an directed graph with positive edge weights and with 𝑛 vertices and 𝑚 edges as well as two vertices 𝑢 and 𝑣, compute the weight of a shortest path between 𝑢 and 𝑣 (that is, the minimum total weight of a path from 𝑢 to 𝑣).

  • Optimal substructure property : any subpath of an optimal path is also optimal.
  • Initially, we only know the distance to source node S and relax all the edges from S.
  • Maintain a set R of vertices for which dist is already set correctly (known region).
  • On each iteration take a vertex outside of R with the minimum dist-value, add it to R, and relax all its outgoing edges.
  • The next figure shows the pseudocode of the algorithm.
  • The algorithm works for any graph with non-negative edge weights.
  • The algorithm does not work if the input graph has negative edge weights (known region has the assumption that the dist can be reduced further, this assumption does not hold if the graph has negative edges).
Image for post
Image created from this lecture notes

Python code

import queue

def distance(adj, cost, s, t):
 inf = 10**19
 n = len(adj)
 d = [inf]*n
 d[s] = 0
 visited = [0]*n
 h = queue.PriorityQueue()
 for v in range(n):
  h.put((d[v], v))
 while not h.empty():
  u = h.get()[1]
  if visited[u]: continue
  visited[u] = True
  for i in range(len(adj[u])):
   v = adj[u][i]
   if d[v] > d[u] + cost[u][i]:
    d[v] = d[u] + cost[u][i]
    h.put((d[v], v))
 return d[t] if d[t] != inf else -1

The following animations show how the algorithm works.

Image for post
Image for post

The algorithm works for the undirected graphs as well. The following animations show how the shortest path can be found on undirected graphs.

Image for post
Image for post
Image for post
Image for post

Detecting Negative Cycle in a Directed Graph with Bellman-Ford

Given an directed graph with possibly negative edge weights and with 𝑛 vertices and 𝑚 edges, check whether it contains a cycle of negative weight. Also, given a vertex 𝑠, compute the length of shortest paths from 𝑠 to all other vertices of the graph.

  • For a Digraph with n nodes (without a negative cycle), the shortest path length in between two nodes (e.g., the source node and any other node) can be at most n-1.
  • Relax edges while dist changes (at most n-1 times, most of the times the distances will stop changing much before that).
  • This algorithm works even for negative edge weights.
  • The following figure shows the pseudocode of the algorithm.
Image for post
Image created from this lecture notes

The above algorithm can also be used to detect a negative cycle in the input graph.

  • Run n = |V | iterations of Bellman–Ford algorithm (i.e., just run the edge-relaxation for once more, for the n-th time).
  • If there exists a vertex, the distance of which still decreases, it implies that there exist a negative-weight cycle in the graph and the vertex is reachable from that cycle.
  • Save node v relaxed on the last iteration v is reachable from a negative cycle
  • Start from x ← v, follow the link x ← prev[x] for |V | times — will be definitely on the cycle
  • Save y ← x and go x ← prev[x] until x = y again

The above algorithm can be used to detect infinite arbitrage, with the following steps:

  • Do |V | iterations of Bellman–Ford, save all nodes relaxed on V-th iteration-set A
  • Put all nodes from A in queue Q
  • Do breadth-first search with queue Q and find all nodes reachable from A
  • All those nodes and only those can have infinite arbitrage

Python code

def negative_cycle(adj, cost):
 inf = 10**19 # float('Inf')
 n = len(adj)
 d = [inf]*n
 d[0] = 0
 for k in range(n-1):
  for u in range(n):
   for i in range(len(adj[u])):
    v = adj[u][i]
    if d[v] > d[u] + cost[u][i]:
     d[v] = d[u] + cost[u][i]
 for u in range(n):
  for i in range(len(adj[u])):
   v = adj[u][i]
   if d[v] > d[u] + cost[u][i]:
    d[v] = d[u] + cost[u][i]
    return 1
 return 0

The following animations demonstrate how the algorithm works.

Image for post
Image for post
Image for post
Image for post
Image for post
Image for post

The last animation shows how the shortest paths can be computed with the algorithm in the presence of negative edge weights.

Find Minimum Spanning Tree in a Graph with Prim’s Greedy Algorithm

Given 𝑛 points on a plane, connect them with segments of minimum total length such that there is a path between any two points.


  • Let’s construct the following undirected graph G=(V, E): each node (in V) in the graph corresponds to a point in the plane and the weight of an edge (in E) in between any two nodes corresponds to the distance between the nodes.
  • The length of a segment with endpoints (𝑥1, 𝑦1) and (𝑥2, 𝑦2) is equal to the Euclidean distance √︀((𝑥1 − 𝑥2)² + (𝑦1 − 𝑦2)²).
  • Then the problem boils down to finding a Minimum Spanning Tree in the graph G.
  • The following figure defines the MST problem and two popular greedy algorithms Prim and Kruskal to solve the problem efficiently.
Image for post
Image created from this lecture notes

We shall use Prim’s algorithm to find the MST in the graph. Here are the algorithm steps:

  • It starts with an initial vertex and is grown to a spanning tree X.
  • X is always a subtree, grows by one edge at each iteration
  • Add a lightest edge between a vertex of the tree and a vertex not in the tree
  • Very similar to Dijkstra’s algorithm
  • The pseudocode is shown in the following figure.
Image for post
Image taken from this lecture notes

Python code

def minimum_distance(x, y):
 result = 0.
 inf = 10**19
 n = len(x)
 adj = [[0 for _ in range(n)] for _ in range(n)]
 for i in range(n):
  for j in range(i+1, n):
   adj[i][j] = adj[j][i] = math.sqrt((x[i]-x[j])**2 + \
 c = [inf]*n
 s = 0
 c[s] = 0
 visited = [False]*n
 parent = [None]*n
 h = queue.PriorityQueue()
 for v in range(n):
  h.put((c[v], v))
 while not h.empty():
  w, u = h.get()
  if visited[u]: continue
  visited[u] = True
  for v in range(n):
   if v == u: continue
   if (not visited[v]) and (c[v] > adj[u][v]):
    c[v] = adj[u][v]
    parent[v] = u
    h.put((c[v], v))
 spanning_tree = []
 for i in range(n):
  spanning_tree.append((i, parent[i]))
  if parent[i] != None: 
   result += adj[i][parent[i]]
 return result

The following animations demonstrate how the algorithm works for a few different set of input points in 2-D plane.

Image for post

The following two animations show the algorithm steps on the input points and the corresponding graph formed, respectively.

Image for post
Image for post

The following animation shows how Prim’s algorithm outputs an MST for 36 input points in a 6×6 grid.

Image for post

As expected, the cost of the MCST formed is 35 for the 36 points, since a spanning tree with n nodes has n-1 edges, each with unit length.

The above algorithm can be used to build roads between some pairs of the given cities such that there is a path between any two cities and the total length of the roads is minimized.

Finding Minimum Spanning Tree and Hierarchical Clustering with Kruskal

Given 𝑛 points on a plane and an integer 𝑘, compute the largest possible value of 𝑑 such that the given points can be partitioned into 𝑘 non-empty subsets in such a way that the distance between any two points from different subsets is at least 𝑑.


  • We shall use Kruskal’s algorithm to solve the above problem. Each point can be thought of a node in the graph, as earlier, with the edge weights between the nodes being equal to the Euclidean distance between them.
  • Start with n components, each node representing one.
  • Run iterations of the Kruskal’s algorithm and merge the components till there are exactly k (< n) components left.
  • These k components will be the k desired clusters of the points and we can compute d to be the largest distance in between two points belonging to different clusters.

Here are the steps for the Kruskal’s algorithm to compute MST:

  • Start with each node in the graph as a single-node tree in a forest X.
  • Repeatedly add to X the next lightest edge e that doesn’t produce a cycle.
  • At any point of time, the set X is a forest (i.e., a collection of trees)
  • The next edge e connects two different trees — say, T1 and T2
  • The edge e is the lightest between T1 and V / T1, hence adding e is safe
Image for post
Image taken from this lecture notes
  • Use disjoint sets data structure for implementation
  • Initially, each vertex lies in a separate set
  • Each set is the set of vertices of a connected component
  • To check whether the current edge {u, v} produces a cycle, we check whether u and v belong to the same set (find).
  • Merge two sets using union operation.
  • The following figure shows a few algorithms to implement union-find abstractions for disjoint sets.
Image for post
Image created from this lecture notes

Python code

def clustering(x, y, k):
 result = 0.
 inf = float('Inf') #10**19
 n = len(x)
 adj = []
 for i in range(n):
  for j in range(i+1, n):
   adj.append((math.sqrt((x[i]-x[j])**2 + (y[i]-y[j])**2), i, j))
 adj = sorted(adj)
 indices = {i:i for i in range(n)}
 while len(set(indices.values())) > k:
  d, u, v = adj.pop(0)
  iu, iv = indices[u], indices[v]
  # implemented quick-find here
  # To-do: implement weighted union with path-compression heuristic      
  if iu != iv:
   indices[u] = indices[v] = min(iu, iv) 
   for j in range(n):
    if indices[j] == max(iu, iv):
      indices[j] = min(iu, iv) 
 clusters = {}
 for i in range(n):
  ci = indices[i]
  clusters[ci] = clusters.get(ci, []) + [i]
 d = inf
 for i in list(clusters.keys()):
  for j in list(clusters.keys()):
   if i == j: continue
   for vi in clusters[i]:
    for vj in clusters[j]:
     d = min(d, math.sqrt((x[vi]-x[vj])**2 + (y[vi]-y[vj])**2))
 return d

The following animations demonstrate how the algorithm works. Next couple of animations show how 8 points in a plane are (hierarchically) clustered using Kruskal’s algorithm and finally the MST is computed. The first animation shows how the points are clustered and the next one shows how Kruskal works on the corresponding graph created from the points.

Image for post
Image for post

The next animation again shows how a set of points in 2-D are clustered using Kruskal and MST is computed.

Image for post

Clustering is a fundamental problem in data mining. The goal is to partition a given set of objects into subsets (or clusters) in such a way that any two objects from the same subset are close (or similar) to each other, while any two objects from different subsets are far apart.

Now we shall use Kruskal’s algorithm to cluster a small real-world dataset named the Iris dataset (can be downloaded from the UCI Machine learning repository), often used as a test dataset in Machine learning.

Next 3 animations show how Kruskal can be used to cluster Iris dataset . The first few rows of the dataset is shown below:

Image for post

Let’s use the first two features (SepalLength and SepalWidth), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

Image for post

Now, let’s use the second and third feature variables (SepalWidth and PetalLength), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

Image for post

Finally, let’s use the third and fourth feature variables (PetalLength and PetalWidth), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

Image for post

Friend Suggestion in Social Networks — Bidirectional Dijkstra

Compute the distance between several pairs of nodes in the network.


  • Build reverse graph GR
  • Start Dijkstra from s in G and from t in GR
  • Alternate between Dijkstra steps in G and in GR
  • Stop when some vertex v is processed both in G and in GR
  • Compute the shortest path between s and t


  • More general idea, not just for graphs
  • Instead of searching for all possible objects, search for first halves and for second halves separately
  • Then find compatible halves
  • Typically roughly O(√N) instead of O(N)

Friends suggestions in social networks

  • Find the shortest path from Michael to Bob via friends connections
  • For the two “farthest” people, Dijkstra has to look through 2 billion people
  • If we only consider friends of friends of friends for both Michael and Bob, we will find a connection
  • Roughly 1M friends of friends of friends
  • 1M + 1M = 2M people – 1000 times less
  • Dijkstra goes in circles
  • Bidirectional search idea can reduce the search space
  • Bidirectional Dijkstra can be 1000s times faster than Dijkstra for social networks
  • The following figure shows the algorithm
Image for post
Created from this lecture note

The following python function implements the bidirectional Dijkstra, the arguments adj and adjR represents the adjacency lists for the original and the reverse graph, respectively.

Python code

def distance(adj, cost, adjR, costR, s, t):
 def process(u, adj, cost, h, d, prev, visited):
  for i in range(len(adj[u])):
   v = adj[u][i]
   if d[v] > d[u] + cost[u][i]:
    d[v] = d[u] + cost[u][i]
    h.put((d[v], v))
    prev[v] = u
  visited[u] = True
 def shortest_path(s, dist, prev, t, distR, prevR, visited_any):
  distance = inf
  ubest = None
  for u in visited_any:
   if dist[u] + distR[u] < distance:
    ubest = u
    distance = dist[u] + distR[u]
  return distance if ubest != None else -1
 inf = 10**19
 n = len(adj)
 d, dR = [inf]*n, [inf]*n
 d[s] = 0
 dR[t] = 0
 visited, visitedR = [False]*n, [False]*n
 visited_any = set([])
 prev, prevR = [None]*n, [None]*n
 h = queue.PriorityQueue()
 h.put((d[s], s))
 hR = queue.PriorityQueue()
 hr.put((dr[t], t)) while True:
  u = h.get()[1]
  if visited[u]: continue
  process(u, adj, cost, h, d, prev, visited)
  if visitedR[u]:
   return shortest_path(s, d, prev, t, dr, prevR, visited_any)
  uR = hR.get()[1]
  if visitedR[uR]: continue
  process(ur, adjR, costR, hR, dR, prevR, visitedR)
  if visited[uR]:
   return shortest_path(s, d, prev, t, dR, prevR, visited_any)
  if h.empty() or hr.empty():
   return -1

The following animation shows how the algorithm works:

Image for post
Image for post
Image for post

Computing Distance Faster Using Coordinates with A* search

Compute the distance between several pairs of nodes in the network. The length l between any two nodes u an v is guaranteed to satisfy 𝑙 ≥ √︀((𝑥(𝑢) − 𝑥(𝑣))2 + (𝑦(𝑢) − 𝑦(𝑣))2).

  • Let’s say we are submitting a query to find a shortest path between nodes (s,t) to a graph.
  • A potential function 𝜋(v) for a node v in a is an estimation of d(v, t) — how far is it from here to t?
  • If we have such estimation, we can often avoid going wrong direction directed search
  • Take some potential function 𝜋 and run Dijkstra algorithm with edge weights ℓ𝜋
  • For any edge (u, v), the new length ℓ𝜋(u, v) must be non-negative, such 𝜋 is called feasible
  • Typically 𝜋(v) is a lower bound on d(v, t)
  • A* is a directed search algorithm based on Dijkstra and potential functions
  • Run Dijkstra with the potential 𝜋 to find the shortest path — the resulting algorithm is A* search, described in the following figure
  • On a real map a path from v to t cannot be shorter than the straight line segment from v to t
  • For each v, compute 𝜋(v) = dE (v, t)
  • If 𝜋(v) gives lower bound on d(v, t)
  • Worst case: 𝜋(v) = 0 for all v the same as Dijkstra
  • Best case: 𝜋(v) = d(v, t) for all v then ℓ𝜋(u, v) = 0 iff (u, v) is on a shortest path to t, so search visits only the edges of shortest s − t paths
  • It can be shown that the tighter are the lower bounds the fewer vertices will be scanned
Image for post

The following animations show how the A* search algorithm works on road networks (and the corresponding graph representations). Each node in the graph has the corresponding (distance, potential) pairs.

Image for post
Image for post

The following animations show how the shortest path is computed between two nodes for the USA road network for the San Francisco Bay area (the corresponding graph containing 321270 nodes and 800172 edges), using Dijkstra, bidirectional Dijkstra and A* algorithms respectively, the data can be downloaded from 9th DIMACS Implementation Challenge — Shortest Paths. Here, with bidirectional Dijkstra, the nodes from forward and reverse priority queues are popped in the same iteration.

Image for post
Image for post
Image for post

In medium: