# Probabilistic Deep Learning with Tensorflow

In this blog, we shall discuss on how to implement probabilistic deep learning models using Tensorflow. The problems to be discussed in this blog appeared in the exercises / projects in the coursera course “Probabilistic Deep Learning“, by Imperial College, London, as a part of TensorFlow 2 for Deep Learning Specialization. The problem statements / descriptions are taken from the course itself.

# Naive Bayes and logistic regression with Tensorflow Probability

In this problem, we shall develop a Naive Bayes classifier model to the Iris dataset using Distribution objects from TensorFlow Probability. We shall also explore the connection between the Naive Bayes classifier and logistic regression.

#### The Iris dataset

In this problem, we shall use the Iris dataset. It consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. For a reference, see the following papers:

• R. A. Fisher. “The use of multiple measurements in taxonomic problems”. Annals of Eugenics. 7 (2): 179–188, 1936.

Our goal will be to construct a Naive Bayes classifier model that predicts the correct class from the sepal length and sepal width features. Under certain assumptions about this classifier model, we shall explore the relation to logistic regression.

The following figures show the 3 categories of flowers from the Iris dataset, namely, setosa, versicolor and verginica, respectively.

Let’s start by importing the required libraries.

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection
%matplotlib inline

#### Load and prepare the data

We will first read in the Iris dataset, and split the dataset into training and test sets, using the following code snippet.

iris = datasets.load_iris()
# Use only the first two features: sepal length and width

data = iris.data[:, :2]
targets = iris.target

x_train, x_test, y_train, y_test = model_selection.train_test_split(data, targets, test_size=0.2)

labels = {0: 'Iris-Setosa', 1: 'Iris-Versicolour', 2: 'Iris-Virginica'}
label_colours = ['blue', 'orange', 'green']

def plot_data(x, y, labels, colours):
for c in np.unique(y):
inx = np.where(y == c)
plt.scatter(x[inx, 0], x[inx, 1], label=labels[c], c=colours[c])
plt.title("Training set")
plt.xlabel("Sepal length (cm)")
plt.ylabel("Sepal width (cm)")
plt.legend()

plt.figure(figsize=(8, 5))
plot_data(x_train, y_train, labels, label_colours)
plt.show()


### Naive Bayes classifier

We will briefly review the Naive Bayes classifier model. The fundamental equation for this classifier is Bayes’ rule:

In the above, dd is the number of features or dimensions in the inputs X (in our case d=2), and K is the number of classes (in our case K=3). The distribution P(Y) is the class prior distribution, which is a discrete distribution over K classes. The distribution P(X|Y) is the class-conditional distribution over inputs.

The Naive Bayes classifier makes the assumption that the data features Xi are conditionally independent give the class Y (the ‘naive’ assumption). In this case, the class-conditional distribution decomposes as follows: This simplifying assumption means that we typically need to estimate far fewer parameters for each of the distributions P(Xi|Y=yk) instead of the full joint distribution P(X|Y=yk).

Once the class prior distribution and class-conditional densities are estimated, the Naive Bayes classifier model can then make a class prediction for a new data input according to

#### Define the class prior distribution

We will begin by defining the class prior distribution. To do this we will simply take the maximum likelihood estimate, given by

where the superscript (n) indicates the n-th dataset example, and N is the total number of examples in the dataset. The above is simply the proportion of data examples belonging to class k.

Let’s now define a function that builds the prior distribution from the training data, and returns it as a Categorical Distribution object.

• The input to your function y will be a numpy array of shape (num_samples,)
• The entries in y will be integer labels k=0,1,…,K−1
• The function should build and return the prior distribution as a Categorical distribution object
• The probabilities for this distribution will be a length-KK vector, with entries corresponding to P(Y=yk) for k=0,1,…,K−1
• Your function should work for any value of K≥1
• This Distribution will have an empty batch shape and empty event shape
def get_prior(y):
return tfd.Categorical(probs=[np.mean(y == i) for i in range(3)])

prior = get_prior(y_train)

# Plot the prior distribution

labels = ['Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica']
plt.bar([0, 1, 2], prior.probs.numpy(), color=label_colours)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1, 2], labels)
plt.show() Define the class-conditional densitiesLet’s now turn to the definition of the class-conditional distributions P(Xi|Y=yk) for i=0,1 and k=0,1,2. In our model, we will assume these distributions to be univariate Gaussian:

with mean parameters μik and standard deviation parameters σik, twelve parameters in all. We will again estimate these parameters using maximum likelihood. In this case, the estimates are given by

Note that the above are just the means and variances of the sample data points for each class.

Let’s now implement a function the computes the class-conditional Gaussian densities, using the maximum likelihood parameter estimates given above, and returns them in a single, batched MultivariateNormalDiag Distribution object.

• The inputs to the function are
• a numpy array x of shape (num_samples, num_features) for the data inputs
• a numpy array y of shape (num_samples,) for the target labels
• The function should work for any number of classes K≥1 and any number of features d≥1
def get_class_conditionals(x, y):
mu = [np.mean(x[y == k], axis=0) for k in range(3)]
sigma = [np.sqrt(np.mean((x[y == k]-mu[k])**2, axis=0)) for k in range(3)]
return tfd.MultivariateNormalDiag(loc=mu, scale_diag=sigma)

class_conditionals = get_class_conditionals(x_train, y_train)


We can visualise the class-conditional densities with contour plots by running the cell below. Notice how the contours of each distribution correspond to a Gaussian distribution with diagonal covariance matrix, since the model assumes that each feature is independent given the class.

def get_meshgrid(x0_range, x1_range, num_points=100):
x0 = np.linspace(x0_range, x0_range, num_points)
x1 = np.linspace(x1_range, x1_range, num_points)
return np.meshgrid(x0, x1)

def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
#print(Z.shape, batch_shape, 'x', *X0.shape)
Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
for batch in np.arange(batch_shape):
if levels:
plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
else:
plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals.prob, 3, label_colours)
plt.title("Training set with class-conditional density contours")
plt.show()

#### Make predictions from the model

Now the prior and class-conditional distributions are defined, you can use them to compute the model’s class probability predictions for an unknown test input, according to The class prediction can then be taken as the class with the maximum probability:

Let’s now implement a function to return the model’s class probabilities for a given batch of test inputs of shape (batch_shape, 2), where the batch_shape has rank at least one.

• The inputs to the function are the prior and class_conditionals distributions, and the inputs x
• The function should use these distributions to compute the probabilities for each class k as above
• As before, the function should work for any number of classes K≥1
• It should then compute the prediction by taking the class with the highest probability
• The predictions should be returned in a numpy array of shape (batch_shape)
def predict_class(prior, class_conditionals, x):
x = x[:, np.newaxis, :]
return tf.argmax(tf.cast(class_conditionals.prob(x),tf.float32)*tf.cast(prior.probs,tf.float32),axis=1).numpy()

predictions = predict_class(prior, class_conditionals, x_test)

# Evaluate the model accuracy on the test set
accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))
# Test accuracy: 0.8000

# Plot the model's decision regions

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max),
lambda x: predict_class(prior, class_conditionals, x),
3, label_colours, levels=[-0.5, 0.5, 1.5, 2.5],
num_points=500)
plt.title("Training set with decision regions")
plt.show()

### Binary classifier

We will now draw a connection between the Naive Bayes classifier and logistic regression.

First, we will update our model to be a binary classifier. In particular, the model will output the probability that a given input data sample belongs to the ‘Iris-Setosa’ class: P(Y=y0|X1,…,Xd). The remaining two classes will be pooled together with the label y1.

# Redefine the dataset to have binary labels

y_train_binary = np.array(y_train)
y_train_binary[np.where(y_train_binary == 2)] = 1

y_test_binary = np.array(y_test)
y_test_binary[np.where(y_test_binary == 2)] = 1

# Plot the training data

labels_binary = {0: 'Iris-Setosa', 1: 'Iris-Versicolour / Iris-Virginica'}
label_colours_binary = ['blue', 'red']

plt.figure(figsize=(8, 5))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
plt.show()

We will also make an extra modelling assumption that for each class kk, the class-conditional distribution P(Xi|Y=yk) for each feature i=0,1, has standard deviation σi, which is the same for each class k.

This means there are now six parameters in total: four for the means μik and two for the standard deviations σiσi (i,k=0,1).

We will again use maximum likelihood to estimate these parameters. The prior distribution will be as before, with the class prior probabilities given by

We will use our previous function get_prior to redefine the prior distribution.

prior_binary = get_prior(y_train_binary)
# Plot the prior distribution
plt.bar([0, 1], prior_binary.probs.numpy()[:-1], color=label_colours_binary)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1], labels_binary)
plt.show()



For the class-conditional densities, the maximum likelihood estimate for the means are again given by

However, the estimate for the standard deviations σi is updated. There is also a closed-form solution for the shared standard deviations, but we will instead learn these from the data.

Let’s now implement a function that takes the training inputs and target labels as input, as well as an optimizer object, number of epochs and a TensorFlow Variable. This function should be written according to the following spec:

• The inputs to the function are:
• a numpy array x of shape (num_samples, num_features) for the data inputs
• a numpy array y of shape (num_samples,) for the target labels
• tf.Variable object scales of length 2 for the standard deviations σiσi
• optimiser: an optimiser object
• epochs: the number of epochs to run the training for
• The function should first compute the means μik of the class-conditional Gaussians according to the above equation
• Then create a batched multivariate Gaussian distribution object using MultivariateNormalDiag with the means set to μik and the scales set to scales
• Run a custom training loop for epochs number of epochs, in which:
• the average per-example negative log likelihood for the whole dataset is computed as the loss
• the gradient of the loss with respect to the scales variables is computed
• the scales variables are updated by the optimiser object
• At each iteration, save the values of the scales variable and the loss
• The function should return a tuple of three objects:
• a numpy array of shape (epochs,) of loss values
• a numpy array of shape (epochs, 2) of values for the scales variable at each iteration
• the final learned batched MultivariateNormalDiag distribution object
mu = [np.mean(x_train[y_train_binary == k], axis=0) for k in range(2)]

def learn_stdevs(x, y, scales, optimiser, epochs):

def nll(x, y, distribution):
predictions = - distribution.log_prob(x)
return tf.reduce_sum(predictions[y==0][:,0]) + tf.reduce_sum(predictions[y==1][:,1])

@tf.function
tape.watch(distribution.trainable_variables)
loss = nll(x, y, distribution)

shape = (len(set(y)), x.shape[-1])
loc = np.zeros(shape, dtype=np.float32)

for feature in range(shape):
for category in range(shape[-1]):
data_point = x[y == category][:, feature]
loc[category, feature] = np.mean(data_point)

distribution = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scales) #b(2), e(2)
x = np.expand_dims(x , 1).astype('float32')

train_loss_results = []
train_scale_results = []

for epoch in range(epochs):
scales = distribution.parameters['scale_diag'].numpy()
train_loss_results.append(loss)
train_scale_results.append(scales)
if epoch % 10 == 0:
print(f'epoch: {epoch}, loss: {loss}')

return np.array(train_loss_results), np.array(train_scale_results), distribution

scales = tf.Variable([1., 1.])
epochs = 500
# run the training loop
nlls, scales_arr, class_conditionals_binary = learn_stdevs(x_train, y_train_binary, scales, opt, epochs)

epoch: 0, loss: 246.33450317382812
epoch: 10, loss: 227.07168579101562
epoch: 20, loss: 207.1158905029297
epoch: 30, loss: 187.12120056152344
epoch: 40, loss: 168.60015869140625
epoch: 50, loss: 153.5633087158203
epoch: 60, loss: 143.8475341796875
epoch: 70, loss: 142.80393981933594
epoch: 80, loss: 142.56259155273438
epoch: 90, loss: 142.23074340820312
epoch: 100, loss: 142.25711059570312
epoch: 110, loss: 142.18955993652344
epoch: 120, loss: 142.1979217529297
epoch: 130, loss: 142.18882751464844
epoch: 140, loss: 142.18991088867188
epoch: 150, loss: 142.1887664794922
epoch: 160, loss: 142.1888885498047
epoch: 170, loss: 142.18875122070312
epoch: 180, loss: 142.1887664794922
epoch: 190, loss: 142.1887664794922
epoch: 200, loss: 142.1887664794922
epoch: 210, loss: 142.18875122070312
epoch: 220, loss: 142.1887664794922
epoch: 230, loss: 142.18873596191406
epoch: 240, loss: 142.18878173828125
epoch: 250, loss: 142.18875122070312
epoch: 260, loss: 142.18875122070312
epoch: 270, loss: 142.18875122070312
epoch: 280, loss: 142.18875122070312
epoch: 290, loss: 142.18875122070312
epoch: 300, loss: 142.18878173828125
epoch: 310, loss: 142.18875122070312
epoch: 320, loss: 142.18875122070312
epoch: 330, loss: 142.18875122070312
epoch: 340, loss: 142.18875122070312
epoch: 350, loss: 142.1887664794922
epoch: 360, loss: 142.1887664794922
epoch: 370, loss: 142.1887664794922
epoch: 380, loss: 142.1887664794922
epoch: 390, loss: 142.1887664794922
epoch: 400, loss: 142.1887664794922
epoch: 410, loss: 142.1887664794922
epoch: 420, loss: 142.1887664794922
epoch: 430, loss: 142.1887664794922
epoch: 440, loss: 142.1887664794922
epoch: 450, loss: 142.1887664794922
epoch: 460, loss: 142.1887664794922
epoch: 470, loss: 142.1887664794922
epoch: 480, loss: 142.1887664794922
epoch: 490, loss: 142.1887664794922
print("Class conditional means:")
print(class_conditionals_binary.loc.numpy())
print("\nClass conditional standard deviations:")
print(class_conditionals_binary.stddev().numpy())
Class conditional means:
[[4.9692307 3.3820512]
[6.2172837 2.8814814]]

Class conditional standard deviations:
[[0.5590086  0.34253535]
[0.5590086  0.34253535]]
# Plot the loss and convergence of the standard deviation parameters

fig, ax = plt.subplots(1, 2, figsize=(14, 5))
ax.plot(nlls)
ax.set_title("Loss vs epoch")
ax.set_xlabel("Epoch")
ax.set_ylabel("Average negative log-likelihood")
for k in [0, 1]:
ax.plot(scales_arr[:, k], color=label_colours_binary[k], label=labels_binary[k])
ax.set_title("Standard deviation ML estimates vs epoch")
ax.set_xlabel("Epoch")
ax.set_ylabel("Standard deviation")
plt.legend()
plt.show()

We can also plot the contours of the class-conditional Gaussian distributions as before, this time with just binary labelled data. Notice the contours are the same for each class, just with a different centre location.

We can also plot the decision regions for this binary classifier model, notice that the decision boundary is now linear.

The following animation shows how we can learn the standard deviation parameters for the class-conditional distributions for Naive Bayes using tensorflow, with the above code snippet.

In fact, we can see that our predictive distribution P(Y=y0|X) can be written as follows: With our additional modelling assumption of a shared covariance matrix Σ, it can be shown (using the Gaussian pdf) that a is in fact a linear function of X:

The model therefore takes the form P(Y=y0|X)=σ(wTX+w0), with weights w∈R2 and bias w0∈R. This is the form used by logistic regression, and explains why the decision boundary above is linear.

In the above we have outlined the derivation of the generative logistic regression model. The parameters are typically estimated with maximum likelihood, as we have done.

Finally, we will use the above equations to directly parameterize the output Bernoulli distribution of the generative logistic regression model.

Let’s now write the following function, according to the following specification:

• The inputs to the function are:
• the prior distribution prior over the two classes
• the (batched) class-conditional distribution class_conditionals
• The function should use the parameters of the above distributions to compute the weights and bias terms w and w0 as above
• The function should then return a tuple of two numpy arrays for w and w0
def get_logistic_regression_params(prior, class_conditionals):
Sigma = class_conditionals.covariance().numpy()
SI = np.linalg.inv(Sigma)
p = prior.probs.numpy()
mu = class_conditionals.parameters['loc'] #.numpy()
w = SI @ (mu - mu)
w0 = -0.5*mu.T@SI@mu + 0.5*mu.T@SI@mu + np.log(p/p)
return w, w0

w, w0 = get_logistic_regression_params(prior_binary, class_conditionals_binary)

We can now use these parameters to make a contour plot to display the predictive distribution of our logistic regression model.

# Probabilistic generative models

Let’s start with generative models, using normalizing flow networks and the variational autoencoder algorithm. We shall create a synthetic dataset with a normalizing flow with randomised parameters. This dataset will then be used to train a variational autoencoder, and the trained model will be used to interpolate between the generated images. The concepts to be used will be

• Distribution objects
• Probabilistic layers
• Bijectors
• ELBO optimization
• KL divergence Regularizers.

The next figure represents the theory required for the implementation:

Let’s start by running importing the required libraries first.

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
tfpl = tfp.layers

import numpy as np
import matplotlib.pyplot as plt

We shall create our own image dataset from contour plots of a transformed distribution using a random normalizing flow network and then use the variational autoencoder algorithm to train generative and inference networks, and synthesise new images by interpolating in the latent space.

#### The normalising flow

• To construct the image dataset, you will build a normalizing flow to transform the 2-D Gaussian random variable z = (z1, z2), which has mean 0 and covariance matrix Σ=σ^2.I2, with σ=0.3.
• This normalizing flow uses bijectors that are parameterized by the following random variables:
• θ ∼ U[0,2π)
• a ∼ N(3,1)

The complete normalising flow is given by the following chain of transformations:

• f1(z)=(z1,z2−2)
• f2(z)=(z1,z2/2),
• f3(z)=(z1,z2+a.z1^2),
• f4(z)=R.z, where R is a rotation matrix with angle θ,
• f5(z)=tanh(z), where the tanh function is applied elementwise.

The transformed random variable x is given by x=f5(f4(f3(f2(f1(z))))).

• We need to use or construct bijectors for each of the transformations fi, i=1,…,5 and use tfb.Chain and tfb.TransformedDistribution to construct the final transformed distribution.
• Ensure to implement the log_det_jacobian methods for any subclassed bijectors that you write.
• Display a scatter plot of samples from the base distribution.
• Display 4 scatter plot images of the transformed distribution from your random normalizing flow, using samples of θ and a. Fix the axes of these 4 plots to the range [−1,1].

The following code block shows how to implement the above steps:

def plot_distribution(samples, ax, title, col='red'):
ax.scatter(samples[:, 0], samples[:, 1], marker='.', c=col, alpha=0.5)
ax.set_xlim([-1,1])
ax.set_ylim([-1,1])
ax.set_title(title, size=15)
# f3(𝑧)=(𝑧1,𝑧2+𝑎𝑧1^2)
class Degree2Polynomial(tfb.Bijector):

def __init__(self, a):
self.a = a
super(Degree2Polynomial, self).__init__(forward_min_event_ndims=1, is_constant_jacobian=True)

def _forward(self, x):
return tf.concat([x[..., :1], x[..., 1:] + self.a * tf.square(x[..., :1])], axis=-1)

def _inverse(self, y):
return tf.concat([y[..., :1], y[..., 1:] - self.a * tf.square(y[..., :1])], axis=-1)

def _forward_log_det_jacobian(self, x):
return tf.constant(0., dtype=x.dtype)

# f4(𝑧)=Rz
class Rotation(tfb.Bijector):

def __init__(self, theta):
self.R = tf.constant([[np.cos(theta), -np.sin(theta)],
[np.sin(theta), np.cos(theta)]], dtype=tf.float32)
super(Rotation, self).__init__(forward_min_event_ndims=1, is_constant_jacobian=True)

def _forward(self, x):
return tf.linalg.matvec(self.R, x)

def _inverse(self, y):
return tf.linalg.matvec(tf.transpose(self.R), y)

def _forward_log_det_jacobian(self, x):
return tf.constant(0., x.dtype)
def get_normalizing_flow_dist(a, theta):
bijectors = [
tfb.Shift([0.,-2]),   # f1
tfb.Scale([1,1/2]),   # f2
Degree2Polynomial(a), # f3
Rotation(theta),      # f4
tfb.Tanh()            # f5
]
flow_bijector = tfb.Chain(list(reversed(bijectors)))
return tfd.TransformedDistribution(distribution=base_distribution,
bijector=flow_bijector)
nsamples= 10000
sigma = 0.3
base_distribution = tfd.MultivariateNormalDiag(loc=tf.zeros(2), scale_diag=sigma*tf.ones(2))
samples = base_distribution.sample(nsamples)
fig, ax = plt.subplots(figsize=(8,8))
plot_distribution(samples, ax, 'Base distribution', 'blue')
plt.show()
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15,15))
axes = axes.flatten()
plt.subplots_adjust(0, 0, 1, 0.925, 0.05, 0.05)
colors = ['red', 'green', 'orange', 'magenta']
for i in range(4):
a = tfd.Normal(loc=3, scale=1).sample(1).numpy()
theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1).numpy()
transformed_distribution = get_normalizing_flow_dist(a, theta)
samples = transformed_distribution.sample(nsamples)
plot_distribution(samples, axes[i], r'$\theta$={:.02f}, a={:.02f}'.format(theta, a), colors[i])
plt.suptitle('Transformed Distribution with Normalizing Flow', size=20)
plt.show()

## Create the image dataset

• Let’s now use your random normalizing flow to generate an image dataset of contour plots from our random normalising flow network.
• First, let’s display a sample of 4 contour plot images from our normalizing flow network using 4 independently sampled sets of parameters, using the following get_densities function: this function calculates density values for a (batched) Distribution for use in a contour plot.
• The dataset should consist of at least 1000 images, stored in a numpy array of shape (N, 36, 36, 3). Each image in the dataset should correspond to a contour plot of a transformed distribution from a normalizing flow with an independently sampled set of parameters. It will take a few minutes to create the dataset.
• As well as the get_densities function, the following get_image_array_from_density_values function will help to generate the dataset. This function creates a numpy array for an image of the contour plot for a given set of density values Z. Feel free to choose your own options for the contour plots.
• Let’s display a sample of 20 images from your generated dataset in a figure.
X, Y = np.meshgrid(np.linspace(-1, 1, 100), np.linspace(-1, 1, 100))
inputs = np.transpose(np.stack((X, Y)), [1, 2, 0])

def get_densities(transformed_distribution):
batch_shape = transformed_distribution.batch_shape
Z = transformed_distribution.prob(np.expand_dims(inputs, 2))
Z = np.transpose(Z, list(range(2, 2+len(batch_shape))) + [0, 1])
return Z
import numpy as np
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
from matplotlib.figure import Figure

def get_image_array_from_density_values(Z):
assert Z.shape == (100, 100)
fig = Figure(figsize=(0.5, 0.5))
canvas = FigureCanvas(fig)
ax = fig.gca()
ax.contourf(X, Y, Z, cmap='hot', levels=100)
ax.axis('off')

ax.margins(0)
fig.canvas.draw()
image_from_plot = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
image_from_plot = image_from_plot.reshape(fig.canvas.get_width_height()[::-1] + (3,))
return image_from_plot
plt.figure(figsize=(5,5))
plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
for i in range(4):
a = tfd.Normal(loc=3, scale=1).sample(1).numpy()
theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1).numpy()
transformed_distribution = get_normalizing_flow_dist(a, theta)
transformed_distribution = tfd.BatchReshape(transformed_distribution, )
Z = get_densities(transformed_distribution)
image = get_image_array_from_density_values(Z.squeeze())
plt.subplot(2,2,i+1), plt.imshow(image), plt.axis('off')
plt.title(r'$\theta$={:.02f}, a={:.02f}'.format(theta, a), size=10)
plt.show()
N = 1000
image_dataset = np.zeros((N, 36, 36, 3))
for i in range(N):
a = tfd.Normal(loc=3, scale=1).sample(1).numpy()
theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1).numpy()
transformed_distribution = tfd.BatchReshape(get_normalizing_flow_dist(a, theta), )
image_dataset[i,...] = get_image_array_from_density_values(get_densities(transformed_distribution).squeeze())
image_dataset = tf.convert_to_tensor(image_dataset, dtype=tf.float32)
image_dataset.shape
# TensorShape([1000, 36, 36, 3])

plt.figure(figsize=(20,4))
plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
indices = np.random.choice(N, 20)
for i in range(20):
image = image_dataset[indices[i]].numpy()
image = image / image.max()
plt.subplot(2,10,i+1), plt.imshow(image), plt.axis('off')
plt.show()

## Create tf.data.Dataset objects

• Let’s now split your dataset to create tf.data.Dataset objects for training and validation data.
• Using the map method, ;et’s normalize the pixel values so that they lie between 0 and 1.
• These Datasets will be used to train a variational autoencoder (VAE). Use the map method to return a tuple of input and output Tensors where the image is duplicated as both input and output.
• Randomly shuffle the training Dataset.
• Batch both datasets with a batch size of 20, setting drop_remainder=True.
• Print the element_spec property for one of the Dataset objects.
n = len(image_dataset)
tf_image_dataset = tf.data.Dataset.from_tensor_slices(image_dataset)
tf_image_dataset = tf_image_dataset.shuffle(3)
tf_image_dataset = tf_image_dataset.map(lambda x : x / tf.reduce_max(x))
tf_image_dataset = tf_image_dataset.map(lambda x: (x, x))
train_sz = int(0.8*n)
training = tf_image_dataset.take(train_sz)
validation = tf_image_dataset.skip(train_sz)
training = training.batch(batch_size=20, drop_remainder=True)
validation = validation.batch(batch_size=20, drop_remainder=True)
training.element_spec
#(TensorSpec(shape=(20, 36, 36, 3), dtype=tf.float32, name=None),
# TensorSpec(shape=(20, 36, 36, 3), dtype=tf.float32, name=None))

## Build the encoder and decoder networks

• Let’s now create the encoder and decoder for the variational autoencoder algorithm.
• Let’s design these networks, subject to the following constraints:
• The encoder and decoder networks should be built using the Sequential class.
• The encoder and decoder networks should use probabilistic layers where necessary to represent distributions.
• The prior distribution should be a zero-mean, isotropic Gaussian (identity covariance matrix).
• The encoder network should add the KL divergence loss to the model.
• Print the model summary for the encoder and decoder networks.
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import (Dense, Flatten, Reshape, Concatenate, Conv2D, UpSampling2D, BatchNormalization)
latent_dim = 2 #50
prior = tfd.MultivariateNormalDiag(loc=tf.zeros(latent_dim))

def get_kl_regularizer(prior_distribution):
return tfpl.KLDivergenceRegularizer(prior_distribution,
weight=1.0,
use_exact_kl=False,
test_points_fn=lambda q: q.sample(3),
test_points_reduce_axis=(0,1))

kl_regularizer = get_kl_regularizer(prior)

def get_encoder(latent_dim, kl_regularizer):
return Sequential([
Conv2D(filters=32, kernel_size=3, activation='relu', strides=2, padding='same', input_shape=(36,36,3)),
BatchNormalization(),
BatchNormalization(),
BatchNormalization(),
Flatten(),
Dense(tfpl.MultivariateNormalTriL.params_size(latent_dim)),
tfpl.MultivariateNormalTriL(latent_dim, activity_regularizer=kl_regularizer)
], name='encoder')

def get_decoder(latent_dim):
return Sequential([
Dense(1152, activation='relu', input_shape=(latent_dim,)),
Reshape((3,3,128)),
UpSampling2D(size=(3,3)),
UpSampling2D(size=(2,2)),
UpSampling2D(size=(2,2)),
Flatten(),
tfpl.IndependentBernoulli(event_shape=(36,36,3))
], name='decoder')

encoder = get_encoder(latent_dim=2, kl_regularizer=kl_regularizer)
#encoder.losses
encoder.summary()

Model: "encoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
conv2d (Conv2D)              (None, 18, 18, 32)        896
_________________________________________________________________
batch_normalization (BatchNo (None, 18, 18, 32)        128
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 9, 9, 64)          18496
_________________________________________________________________
batch_normalization_1 (Batch (None, 9, 9, 64)          256
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 128)         73856
_________________________________________________________________
batch_normalization_2 (Batch (None, 3, 3, 128)         512
_________________________________________________________________
flatten (Flatten)            (None, 1152)              0
_________________________________________________________________
dense (Dense)                (None, 5)                 5765
_________________________________________________________________
multivariate_normal_tri_l (M multiple                  0
=================================================================
Total params: 99,909
Trainable params: 99,461
Non-trainable params: 448
_________________________________________________________________

decoder = get_decoder(latent_dim=2)
decoder.summary()

Model: "decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
dense_1 (Dense)              (None, 1152)              3456
_________________________________________________________________
reshape (Reshape)            (None, 3, 3, 128)         0
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 9, 9, 128)         0
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 9, 9, 64)          73792
_________________________________________________________________
up_sampling2d_1 (UpSampling2 (None, 18, 18, 64)        0
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 18, 18, 32)        8224
_________________________________________________________________
up_sampling2d_2 (UpSampling2 (None, 36, 36, 32)        0
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 36, 36, 128)       16512
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 36, 36, 3)         1539
_________________________________________________________________
flatten_1 (Flatten)          (None, 3888)              0
_________________________________________________________________
independent_bernoulli (Indep multiple                  0
=================================================================
Total params: 103,523
Trainable params: 103,523
Non-trainable params: 0
____________________________
def reconstruction_loss(batch_of_images, decoding_dist):
return -tf.reduce_mean(decoding_dist.log_prob(batch_of_images))

## Train the variational autoencoder

• Let’s now train the variational autoencoder. Build the VAE using the Model class and the encoder and decoder models. Print the model summary.
• Compile the VAE with the negative log likelihood loss and train with the fit method, using the training and validation Datasets.
• Plot the learning curves for loss vs epoch for both training and validation sets.
vae = Model(inputs=encoder.inputs, outputs=decoder(encoder.outputs))
vae.compile(optimizer=optimizer, loss=reconstruction_loss)
history = vae.fit(training, validation_data=validation, epochs=20)

Epoch 1/20
40/40 [==============================] - 34s 777ms/step - loss: 1250.2296 - val_loss: 1858.7103
Epoch 2/20
40/40 [==============================] - 29s 731ms/step - loss: 661.8687 - val_loss: 1605.1261
Epoch 3/20
40/40 [==============================] - 29s 720ms/step - loss: 545.2802 - val_loss: 1245.0518
Epoch 4/20
40/40 [==============================] - 28s 713ms/step - loss: 489.1101 - val_loss: 1024.5863
Epoch 5/20
40/40 [==============================] - 29s 718ms/step - loss: 453.3464 - val_loss: 841.4725
Epoch 6/20
40/40 [==============================] - 29s 733ms/step - loss: 438.8413 - val_loss: 742.0212
Epoch 7/20
40/40 [==============================] - 30s 751ms/step - loss: 433.2563 - val_loss: 657.4024
Epoch 8/20
40/40 [==============================] - 30s 751ms/step - loss: 417.5353 - val_loss: 602.7039
Epoch 9/20
40/40 [==============================] - 29s 726ms/step - loss: 409.8351 - val_loss: 545.5004
Epoch 10/20
40/40 [==============================] - 30s 741ms/step - loss: 406.3284 - val_loss: 507.9868
Epoch 11/20
40/40 [==============================] - 30s 741ms/step - loss: 402.9056 - val_loss: 462.0777
Epoch 12/20
40/40 [==============================] - 29s 733ms/step - loss: 397.4801 - val_loss: 444.4444
Epoch 13/20
40/40 [==============================] - 30s 741ms/step - loss: 398.2078 - val_loss: 423.1287
Epoch 14/20
40/40 [==============================] - 29s 723ms/step - loss: 395.5187 - val_loss: 411.3030
Epoch 15/20
40/40 [==============================] - 30s 739ms/step - loss: 397.3987 - val_loss: 407.5134
Epoch 16/20
40/40 [==============================] - 29s 721ms/step - loss: 399.3271 - val_loss: 402.7288
Epoch 17/20
40/40 [==============================] - 29s 736ms/step - loss: 393.4259 - val_loss: 401.4711
Epoch 18/20
40/40 [==============================] - 29s 726ms/step - loss: 390.5508 - val_loss: 399.1924
Epoch 19/20
40/40 [==============================] - 29s 736ms/step - loss: 389.3187 - val_loss: 401.1656
Epoch 20/20
40/40 [==============================] - 29s 728ms/step - loss: 389.4718 - val_loss: 393.5178
nepochs = 20
plt.figure(figsize=(8,5))
plt.plot(range(nepochs), history.history['loss'], label='train-loss')
plt.plot(range(nepochs), history.history['val_loss'], label='valid-loss')
plt.legend()
plt.xlabel('epochs')
plt.ylabel('loss')
plt.show()

## Use the encoder and decoder networks

• Let’s now put your encoder and decoder networks into practice!
• Randomly sample 1000 images from the dataset, and pass them through the encoder. Display the embeddings in a scatter plot (project to 2 dimensions if the latent space has dimension higher than two).
• Randomly sample 4 images from the dataset and for each image, display the original and reconstructed image from the VAE in a figure.
• Use the mean of the output distribution to display the images.
• Randomly sample 6 latent variable realisations from the prior distribution, and display the images in a figure.
• Again use the mean of the output distribution to display the images.
def reconstruct(encoder, decoder, batch_of_images):
approx_distribution = encoder(batch_of_images)
decoding_dist = decoder(approx_distribution.mean())
return decoding_dist.mean()

embedding = encoder(image_dataset / 255).mean()
fig, ax = plt.subplots(figsize=(8,8))
plt.scatter(embedding[:,0], embedding[:,1], c='red', s=50, edgecolor='k')
plt.title('Embedding', size=20)
plt.show()
plt.figure(figsize=(6,12))
plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
indices = np.random.choice(len(image_dataset), 4)
for i in range(4):
image = image_dataset[indices[i]].numpy()
image = image / image.max()
plt.subplot(4,2,2*i+1), plt.imshow(image), plt.axis('off')
reconstructions = reconstruct(encoder, decoder, np.expand_dims(image, axis=0))
plt.subplot(4,2,2*i+2), plt.imshow(reconstructions.numpy()), plt.axis('off')
plt.suptitle('original (left column) vs. VAE-reconstructed (right column)', size=15)
plt.show()
nsample = 6
samples = np.random.uniform(-10, 10, (nsample, latent_dim)) #prior.sample(6)
fig, ax = plt.subplots(figsize=(8,8))
plt.scatter(samples[:,0], samples[:,1], color='blue')
for i in range(nsample):
plt.text(samples[i,0] + 0.05, samples[i,1] + 0.05, 'embedding {}'.format(i), fontsize=15)
plt.title('Embeddings', size=20)
plt.show()
reconstructions = decoder(samples).mean()
#print(samples.shape, reconstructions.shape)
plt.figure(figsize=(8,6))
plt.subplots_adjust(0, 0, 1, 0.9, 0.05, 0.08)
indices = np.random.choice(len(image_dataset), 4)
for i in range(nsample):
plt.subplot(2,3,i+1), plt.imshow(reconstructions[i]), plt.title('image {}'.format(i)), plt.axis('off')
plt.suptitle('VAE-reconstructions', size=20)
plt.show()


The following animation of latent space interpolation shows the decoder’s generations, depending on the latent space.

To be continued…

# Machine learning with H2O in R / Python

In this blog, we shall discuss about how to use H2O to build a few supervised Machine learning models. H2O is a Java-based software for data modeling and general computing, with the primary purpose of it being a distributed, parallel, in memory processing engine. It needs to be installed first (instructions) and by default an H2O instance will run on localhost:54321. Additionally, one needs to install R/python clients to to communicate with the H2O instance. Every new R / python session first needs to initialize a connection between the python client and the H2O cluster.

The problems to be described in this blog appeared in the exercises / projects in the coursera course “Practical Machine Learning on H2O“, by H2O. The problem statements / descriptions / steps are taken from the course itself. We shall use the concepts from the course, in order to, e.g.,

• to build a few machine learning / deep learning models using different algorithms (such as Gradient Boosting, Random Forest, Neural Net, Elastic Net GLM etc.),
• to review the classic bias-variance tradeoff (overfitting)
• for hyper-parameter tuning using Grid Search
• to use AutoML to automatically find a bunch of good performing models
• to use Stacked Ensembles of models to improve performance.

## Problem 1

In this problem we shall create an artificial data set, then run random forest / GBM on it with H2O, to create two supervised models for classification, one that is reasonable and another one that shows clear over-fitting. We shall use R client (package) for H2O for this problem.

1. Let’s first create a data set to predict an employee’s job satisfaction in an organization. Let’s say an employee’s job satisfaction depends on the following factors (there are several other factors in general, but we shall limit us to the following few ones):
• work environment
• pay
• flexibility
• relationship with manager
• age
set.seed(321)

# Let's say an employee's job satisfaction depends on the work environment, pay, flexibility, relationship with manager and age.

N <- 1000                                         # number of samples
d <- data.frame(id = 1:N)
d$workEnvironment <- sample(1:5, N, replace=TRUE) # on a scale of 1-5, 1 being bad and 5 being good v <- round(rnorm(N, mean=60000, sd=20000)) # 68% are 40-80k v <- pmax(v, 20000) v <- pmin(v, 100000) #table(v) d$pay <- v
d$flexibility <- sample(1:5, N, replace=TRUE) # on a scale of 1-5, 1 being bad and 5 being good d$managerRel <- sample(1:5, N, replace=TRUE)      # on a scale of 1-5, 1 being bad and 5 being good
d$age <- round(runif(N, min=20, max=60)) head(d) # id workEnvironment pay flexibility managerRel age #1 1 2 20000 2 2 21 #2 2 5 75817 1 2 31 #3 3 5 45649 5 3 25 #4 4 1 47157 1 5 55 #5 5 2 69729 2 4 33 #6 6 1 75101 2 2 39 v <- 125 * (d$pay/1000)^2 # e.g., job satisfaction score is proportional to square of pay (hypothetically)
v <- v + 250 / log(d$age) # e.g., inversely proportional to log of age v <- v + 5 * d$flexibility
v <- v + 200 * d$workEnvironment v <- v + 1000 * d$managerRel^3
v <- v + runif(N, 0, 5000)
v <- 100 * (v - 0) / (max(v) - min(v)) # min-max normalization to bring the score in 0-100
d$jobSatScore <- round(v) # Round to nearest integer (percentage) 2. Let’s start h2o, and import the data. library(h2o) h2o.init() as.h2o(d, destination_frame = "jobsatisfaction") jobsat <- h2o.getFrame("jobsatisfaction") # |===========================================================================================================| 100% # id workEnvironment pay flexibility managerRel age jobSatScore #1 1 2 20000 2 2 21 5 #2 2 5 75817 1 2 31 55 #3 3 5 45649 5 3 25 22 #4 4 1 47157 1 5 55 30 #5 5 2 69729 2 4 33 51 #6 6 1 75101 2 2 39 54  3. Let’s split the data. Here we plan to use cross-validation. parts <- h2o.splitFrame( jobsat, ratios = 0.8, destination_frames=c("jobsat_train", "jobsat_test"), seed = 321) train <- h2o.getFrame("jobsat_train") test <- h2o.getFrame("jobsat_test") norw(train) # 794 norw(test) # 206 rows y <- "jobSatScore" x <- setdiff(names(train), c("id", y))  4. Let’s choose the gradient boosting model (gbm), and create a model. It’s a regression model since the output variable is treated to be continuous. # the reasonable model with 10-fold cross-validation m_res <- h2o.gbm(x, y, train, model_id = "model10foldsreasonable", ntrees = 20, nfolds = 10, seed = 123) > h2o.performance(m_res, train = TRUE) # RMSE 2.973807 #H2ORegressionMetrics: gbm #** Reported on training data. ** #MSE: 8.069509 #RMSE: 2.840688 #MAE: 2.266134 #RMSLE: 0.1357181 #Mean Residual Deviance : 8.069509 > h2o.performance(m_res, xval = TRUE) # RMSE 3.299601 #H2ORegressionMetrics: gbm #** Reported on cross-validation data. ** #** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) ** #MSE: 8.84353 #RMSE: 2.973807 #MAE: 2.320899 #RMSLE: 0.1384746 #Mean Residual Deviance : 8.84353 > h2o.performance(m_res, test) # RMSE 0.6476077 #H2ORegressionMetrics: gbm #MSE: 10.88737 #RMSE: 3.299601 #MAE: 2.524492 #RMSLE: 0.1409274 #Mean Residual Deviance : 10.88737  5. Let’s try some alternative parameters, to build a different model, and show how the results differ. # overfitting model with 10-fold cross-validation m_ovf <- h2o.gbm(x, y, train, model_id = "model10foldsoverfitting", ntrees = 2000, max_depth = 20, nfolds = 10, seed = 123) > h2o.performance(m_ovf, train = TRUE) # RMSE 0.004474786 #H2ORegressionMetrics: gbm #** Reported on training data. ** #MSE: 2.002371e-05 #RMSE: 0.004474786 #MAE: 0.0007455944 #RMSLE: 5.032019e-05 #Mean Residual Deviance : 2.002371e-05 > h2o.performance(m_ovf, xval = TRUE) # RMSE 0.6801615 #H2ORegressionMetrics: gbm #** Reported on cross-validation data. ** #** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) ** #MSE: 0.4626197 #RMSE: 0.6801615 #MAE: 0.4820542 #RMSLE: 0.02323415 #Mean Residual Deviance : 0.4626197 > h2o.performance(m_ovf, test) # RMSE 0.4969761 #H2ORegressionMetrics: gbm #MSE: 0.2469853 #RMSE: 0.4969761 #MAE: 0.3749822 #RMSLE: 0.01698435 #Mean Residual Deviance : 0.2469853 ## Problem 2 #### Predict Chocolate Makers Location with Deep Learning Model with H2O The data is available here: http://coursera.h2o.ai/cacao.882.csv This is a classification problem. We need to predict “Maker Location”. In other words, using the rating, and the other fields, how accurately we can identify if it is Belgian chocolate, French chocolate, and so on. We shall use python client (library) for H2O for this problem. 1. Let’s start h2o, load the data set, and split it. By the end of this stage we should have three variables, pointing to three data frames on h2o: train, valid, test. However, if you are choosing to use cross-validation, you will only have two: train and test. import h2o import pandas as pd import numpy as np import matplotlib.pyplot as plt df = pd.read_csv('http://coursera.h2o.ai/cacao.882.csv') print(df.shape) # (1795, 9) df.head()  print(df['Maker Location'].unique()) # ['France' 'U.S.A.' 'Fiji' 'Ecuador' 'Mexico' 'Switzerland' 'Netherlands' # 'Spain' 'Peru' 'Canada' 'Italy' 'Brazil' 'U.K.' 'Australia' 'Wales' # 'Belgium' 'Germany' 'Russia' 'Puerto Rico' 'Venezuela' 'Colombia' 'Japan' # 'New Zealand' 'Costa Rica' 'South Korea' 'Amsterdam' 'Scotland' # 'Martinique' 'Sao Tome' 'Argentina' 'Guatemala' 'South Africa' 'Bolivia' # 'St. Lucia' 'Portugal' 'Singapore' 'Denmark' 'Vietnam' 'Grenada' 'Israel' # 'India' 'Czech Republic' 'Domincan Republic' 'Finland' 'Madagascar' # 'Philippines' 'Sweden' 'Poland' 'Austria' 'Honduras' 'Nicaragua' # 'Lithuania' 'Niacragua' 'Chile' 'Ghana' 'Iceland' 'Eucador' 'Hungary' # 'Suriname' 'Ireland'] print(len(df['Maker Location'].unique())) # 60 loc_table = df['Maker Location'].value_counts() print(loc_table) #U.S.A. 764 #France 156 #Canada 125 #U.K. 96 #Italy 63 #Ecuador 54 #Australia 49 #Belgium 40 #Switzerland 38 #Germany 35 #Austria 26 #Spain 25 #Colombia 23 #Hungary 22 #Venezuela 20 #Madagascar 17 #Japan 17 #New Zealand 17 #Brazil 17 #Peru 17 #Denmark 15 #Vietnam 11 #Scotland 10 #Guatemala 10 #Costa Rica 9 #Israel 9 #Argentina 9 #Poland 8 #Honduras 6 #Lithuania 6 #Sweden 5 #Nicaragua 5 #Domincan Republic 5 #South Korea 5 #Netherlands 4 #Amsterdam 4 #Puerto Rico 4 #Fiji 4 #Sao Tome 4 #Mexico 4 #Ireland 4 #Portugal 3 #Singapore 3 #Iceland 3 #South Africa 3 #Grenada 3 #Chile 2 #St. Lucia 2 #Bolivia 2 #Finland 2 #Martinique 1 #Eucador 1 #Wales 1 #Czech Republic 1 #Suriname 1 #Ghana 1 #India 1 #Niacragua 1 #Philippines 1 #Russia 1 #Name: Maker Location, dtype: int64 loc_table.hist()  As can be seen from the above table, some of the locations have too few records, which will result in poor accuracy of the model to be learnt on after splitting the dataset into train, validation and test datasets. Let’s get rid of the locations that have small number of (< 40) examples in the dataset, to make the results more easily comprehendable, by reducing number of categories in the output variable. ## filter out the countries for which there is < 40 examples present in the dataset loc_gt_40_recs = loc_table[loc_table >= 40].index.tolist() df_sub = df[df['Maker Location'].isin(loc_gt_40_recs)] # now connect to H2O h2o.init() # h2o.clusterStatus()  h2o_df = h2o.H2OFrame(df_sub.values, destination_frame = "cacao_882", column_names=[x.replace(' ', '_') for x in df.columns.tolist()]) #h2o_df.head() #h2o_df.summary() df_cacao_882 = h2o.get_frame('cacao_882') # df_cacao_882.as_data_frame() #df_cacao_882.head() df_cacao_882.describe()  df_cacao_882['Maker_Location'].table() #Maker_Location Count #Australia 49 #Belgium 40 #Canada 125 #Ecuador 54 #France 156 #Italy 63 #U.K. 96 #U.S.A. 764 train, valid, test = df_cacao_882.split_frame(ratios = [0.8, 0.1], destination_frames = ['train', 'valid', 'test'], seed = 321) print("%d/%d/%d" %(train.nrows, valid.nrows, test.nrows)) # 1082/138/127  2. Let’s set x to be the list of columns we shall use to train on, to be the column we shall learn. Here it’s going to be a multi-class classification problem. ignore_fields = ['Review_Date', 'Bean_Type', 'Maker_Location'] # Specify the response and predictor columns y = 'Maker_Location' # multinomial Classification x = [i for i in train.names if not i in ignore_fields] 3. Let’s now create a baseline deep learning model. It is recommended to use all default settings (remembering to specify either nfolds or validation_frame) for the baseline model. from h2o.estimators.deeplearning import H2ODeepLearningEstimator model = H2ODeepLearningEstimator() %time model.train(x = x, y = y, training_frame = train, validation_frame = valid) # deeplearning Model Build progress: |██████████████████████████████████████| 100% # Wall time: 6.44 s model.model_performance(train).mean_per_class_error() # 0.05118279569892473 model.model_performance(valid).mean_per_class_error() # 0.26888404593884047 perf_test = model.model_performance(test) print('Mean class error', perf_test.mean_per_class_error()) # Mean class error 0.2149184149184149 print('log loss', perf_test.logloss()) # log loss 0.48864148412056846 print('MSE', perf_test.mse()) # MSE 0.11940531127368789 print('RMSE', perf_test.rmse()) # RMSE 0.3455507361787671 perf_test.hit_ratio_table() Top-8 Hit Ratios:  perf_test.confusion_matrix().as_data_frame()  model.plot()  4. Now, let’s create a tuned model, that gives superior performance. However we should use no more than 10 times the running time of your baseline model, so again our script should be timing the model. model_tuned = H2ODeepLearningEstimator(epochs=200, distribution="multinomial", activation="RectifierWithDropout", stopping_rounds=5, stopping_tolerance=0, stopping_metric="logloss", input_dropout_ratio=0.2, l1=1e-5, hidden=[200,200,200]) %time model_tuned.train(x, y, training_frame = train, validation_frame = valid) #deeplearning Model Build progress: |██████████████████████████████████████| 100% #Wall time: 30.8 s model_tuned.model_performance(train).mean_per_class_error() #0.0 model_tuned.model_performance(valid).mean_per_class_error() #0.07696485401964853 perf_test = model_tuned.model_performance(test) print('Mean class error', perf_test.mean_per_class_error()) #Mean class error 0.05909090909090909 print('log loss', perf_test.logloss()) #log loss 0.14153784501504524 print('MSE', perf_test.mse()) #MSE 0.03497231075826773 print('RMSE', perf_test.rmse()) #RMSE 0.18700885208531637 perf_test.hit_ratio_table() Top-8 Hit Ratios:  perf_test.confusion_matrix().as_data_frame() model_tuned.plot()  As can be seen from the above plot, the early-stopping strategy stopped the model to overfit and the model achieves better accruacy on the test dataset.. 5. Let’s save both the models, to the local disk, using save_model(), to export the binary version of the model. (Do not export a POJO.) h2o.save_model(model, 'base_model') h2o.save_model(model_tuned, 'tuned_model') We may want to include a seed in the model function above to get reproducible results. ## Problem 3 #### Predict Price of a house with Stacked Ensemble model with H2O The data is available at http://coursera.h2o.ai/house_data.3487.csv. This is a regression problem. We have to predict the “price” of a house given different feature values. We shall use python client for H2O again for this problem. The data needs to be split into train and test, using 0.9 for the ratio, and a seed of 123. That should give 19,462 training rows and 2,151 test rows. The target is an RMSE below$123,000.

1. Let’s start h2o, load the chosen dataset and follow the data manipulation steps. For example, we can split date into year and month columns. We can then optionally combine them into a numeric date column. At the end of this step we shall have traintestx and y variables, and possibly valid also. The below shows the code snippet to do this.
import h2o
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from time import time

h2o.init()

url = "http://coursera.h2o.ai/house_data.3487.csv"
house_df = h2o.import_file(url, destination_frame = "house_data")
# Parse progress: |█████████████████████████████████████████████████████████| 100%

#### Preporcessing

house_df['year'] = house_df['date'].substring(0,4).asnumeric()
house_df['month'] = house_df['date'].substring(4,6).asnumeric()
house_df['day'] = house_df['date'].substring(6,8).asnumeric()
house_df = house_df.drop('date')
house_df.head()
house_df.describe()

plt.hist(house_df.as_data_frame()['price'].tolist(), bins=np.linspace(0,10**6,1000))
plt.show()

We shall use cross-validation and not a validation dataset.

train, test = house_df.split_frame(ratios=[0.9], destination_frames = ['train', 'test'], seed=123)
print("%d/%d" %(train.nrows, test.nrows))
# 19462/2151
ignore_fields = ['id', 'price']
x = [i for i in train.names if not i in ignore_fields]
y = 'price'


2. Let’s now train at least four different models on the preprocessed datseet, using at least three different supervised algorithms. Let’s save all the models.

from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator

nfolds = 5 # for cross-validation


Let’s first fit a GLM model. The best performing α hyperparameter value (for controlling L1 vs. L2 regularization) for GLM will be found using GridSearch, as shown in the below code snippet.

g= h2o.grid.H2OGridSearch(
H2OGeneralizedLinearEstimator(family="gaussian",
nfolds=nfolds,
fold_assignment="Modulo",
keep_cross_validation_predictions=True,
lambda_search=True),
hyper_params={
"alpha":[x * 0.01 for x in range(0,100)],
},
search_criteria={
"strategy":"RandomDiscrete",
"max_models":8,
"stopping_metric": "rmse",
"max_runtime_secs":60
}
)
g.train(x, y, train)
g

#glm Grid Build progress: |████████████████████████████████████████████████| 100%
#                     alpha  \
#0                   [0.61]
#1                   [0.78]
#2                   [0.65]
#3                   [0.13]
#4    [0.35000000000000003]
#5                   [0.05]
#6                   [0.32]
#7                   [0.55]

#                                              model_ids     residual_deviance
#0  Grid_GLM_train_model_python_1628864392402_41_model_3  2.626981989511134E15
#1  Grid_GLM_train_model_python_1628864392402_41_model_6  2.626981989511134E15
#2  Grid_GLM_train_model_python_1628864392402_41_model_5  2.626981989511134E15
#3  Grid_GLM_train_model_python_1628864392402_41_model_2  2.626981989511134E15
#4  Grid_GLM_train_model_python_1628864392402_41_model_4  2.626981989511134E15
#5  Grid_GLM_train_model_python_1628864392402_41_model_7  2.626981989511134E15
#6  Grid_GLM_train_model_python_1628864392402_41_model_0  2.626981989511134E15
#7  Grid_GLM_train_model_python_1628864392402_41_model_1  2.626981989511134E15  

### Model 1

model_GLM= H2OGeneralizedLinearEstimator(
family='gaussian', #'gamma',
model_id='glm_house',
nfolds=nfolds,
alpha=0.61,
fold_assignment="Modulo",
keep_cross_validation_predictions=True)

%time model_GLM.train(x, y, train)
#glm Model Build progress: |███████████████████████████████████████████████| 100%
#Wall time: 259 ms

model_GLM.cross_validation_metrics_summary().as_data_frame()
model_GLM.model_performance(test)
#ModelMetricsRegressionGLM: glm
#** Reported on test data. **

#MSE: 128806123545.59714
#RMSE: 358895.7000934911
#MAE: 233890.6933813204
#RMSLE: 0.5456714021880726
#R^2: 0.03102347771355851
#Mean Residual Deviance: 128806123545.59714
#Null degrees of freedom: 2150
#Residual degrees of freedom: 2129
#Null deviance: 285935013037402.7
#Residual deviance: 277061971746579.44
#AIC: 61176.23965800522

As can be seen from above, GLM could not achieve the target of RMSE below $123k neither on cross-validation nor on test dataset. The below models (GBMDRF and DL) and the corresponding parameters were found with AutoML leaderboard and GridSearch, along with some manual tuning. from h2o.automl import H2OAutoML model_auto = H2OAutoML(max_runtime_secs=60, seed=123) model_auto.train(x, y, train) # AutoML progress: |████████████████████████████████████████████████████████| 100% # Parse progress: |█████████████████████████████████████████████████████████| 100% model_auto.leaderboard model_auto.leader.model_performance(test) # model_auto.leader.explain(test) #ModelMetricsRegression: gbm #** Reported on test data. ** #MSE: 17456681023.716145 #RMSE: 132123.73376390839 #MAE: 77000.00253466706 #RMSLE: 0.1899899418603569 #Mean Residual Deviance: 17456681023.716145 model = h2o.get_model(model_auto.leaderboard[4, 'model_id']) # get model by model_id print(model.params['model_id']['actual']['name']) print(model.model_performance(test).rmse()) [(k, v) for (k, v) in model.params.items() if v['default'] != v['actual'] and \ not k in ['model_id', 'training_frame', 'validation_frame', 'nfolds', 'keep_cross_validation_predictions', 'seed', 'response_column', 'fold_assignment', 'ignored_columns']] # GBM_grid_0_AutoML_20210813_201225_model_0 # 235011.60404473927 # [('score_tree_interval', {'default': 0, 'actual': 5}), # ('ntrees', {'default': 50, 'actual': 60}), # ('max_depth', {'default': 5, 'actual': 6}), # ('min_rows', {'default': 10.0, 'actual': 1.0}), # ('stopping_tolerance', {'default': 0.001, 'actual': 0.008577452408351779}), # ('seed', {'default': -1, 'actual': 123}), # ('distribution', {'default': 'AUTO', 'actual': 'gaussian'}), # ('sample_rate', {'default': 1.0, 'actual': 0.8}), # ('col_sample_rate', {'default': 1.0, 'actual': 0.8}), # ('col_sample_rate_per_tree', {'default': 1.0, 'actual': 0.8})] ## Model 2 model_GBM = H2OGradientBoostingEstimator( model_id='gbm_house', nfolds=nfolds, ntrees=500, fold_assignment="Modulo", keep_cross_validation_predictions=True, seed=123) %time model_GBM.train(x, y, train) #gbm Model Build progress: |███████████████████████████████████████████████| 100% #Wall time: 54.9 s model_GBM.cross_validation_metrics_summary().as_data_frame() As can be seen from the above table (row 5, column 1), the mean RMSE for cross-validation is 117288.305, which is below$123k.

model_GBM.model_performance(test)

#ModelMetricsRegression: gbm
#** Reported on test data. **

#MSE: 14243079402.729088
#RMSE: 119344.37315068142
#MAE: 65050.344749203745
#RMSLE: 0.16421689257411975
#Mean Residual Deviance: 14243079402.729088

As can be seen from above, GBM could achieve the target of RMSE below $123k on test dataset. Now, let’s try random forest model by finding best parameters with Grid Search: g= h2o.grid.H2OGridSearch( H2ORandomForestEstimator( nfolds=nfolds, fold_assignment="Modulo", keep_cross_validation_predictions=True, seed=123), hyper_params={ "ntrees": [20, 25, 30], "stopping_tolerance": [0.005, 0.006, 0.0075], "max_depth": [20, 50, 100], "min_rows": [5, 7, 10] }, search_criteria={ "strategy":"RandomDiscrete", "max_models":10, "stopping_metric": "rmse", "max_runtime_secs":60 } ) g.train(x, y, train) #drf Grid Build progress: |████████████████████████████████████████████████| 100% g # max_depth min_rows ntrees stopping_tolerance \ #0 100 5.0 20 0.006 #1 100 5.0 20 0.005 #2 100 5.0 20 0.005 #3 100 7.0 30 0.006 #4 50 10.0 25 0.006 #5 50 10.0 20 0.005 # model_ids residual_deviance #0 Grid_DRF_train_model_python_1628864392402_40_model_0 2.0205038467456142E10 #1 Grid_DRF_train_model_python_1628864392402_40_model_5 2.0205038467456142E10 #2 Grid_DRF_train_model_python_1628864392402_40_model_1 2.0205038467456142E10 #3 Grid_DRF_train_model_python_1628864392402_40_model_3 2.099520493338354E10 #4 Grid_DRF_train_model_python_1628864392402_40_model_2 2.260686283035833E10 #5 Grid_DRF_train_model_python_1628864392402_40_model_4 2.279037520277947E10  ## Model 3 model_RF = H2ORandomForestEstimator( model_id='rf_house', nfolds=nfolds, ntrees=20, fold_assignment="Modulo", keep_cross_validation_predictions=True, seed=123) %time model_RF.train(x, y, train) #drf Model Build progress: |███████████████████████████████████████████████| 100% #Wall time: 13.2 s model_RF.cross_validation_metrics_summary().as_data_frame() model_RF.model_performance(test) ModelMetricsRegression: drf ** Reported on test data. ** MSE: 16405336914.530426 RMSE: 128083.3202041953 MAE: 71572.37981480274 RMSLE: 0.17712324625977907 Mean Residual Deviance: 16405336914.530426 As can be seen from above, DRF just missed the target of RMSE below$123k for on both the cross-validation and on test dataset.

Now, let’s try to fit a deep learning model, again tuning the parameters with Grid Search.

g= h2o.grid.H2OGridSearch(
H2ODeepLearningEstimator(
nfolds=nfolds,
fold_assignment="Modulo",
keep_cross_validation_predictions=True,
reproducible=True,
seed=123),
hyper_params={
"epochs": [20, 25],
"hidden": [[20, 20, 20], [25, 25, 25]],
"stopping_rounds": [0, 5],
"stopping_tolerance": [0.006]
},
search_criteria={
"strategy":"RandomDiscrete",
"max_models":10,
"stopping_metric": "rmse",
"max_runtime_secs":60
}
)
g.train(x, y, train)
g
#deeplearning Grid Build progress: |███████████████████████████████████████| 100%

#                 epochs        hidden stopping_rounds stopping_tolerance  \
#0     16.79120554889533  [25, 25, 25]               0              0.006
#1    3.1976799968879086  [25, 25, 25]               0              0.006

#                                                       model_ids  \
#0  Grid_DeepLearning_train_model_python_1628864392402_55_model_0
#1  Grid_DeepLearning_train_model_python_1628864392402_55_model_1

#       residual_deviance
#0  1.6484562934855278E10
#1  2.1652538389322113E10


## Model 4

model_DL = H2ODeepLearningEstimator(epochs=30,
model_id='dl_house',
nfolds=nfolds,
stopping_rounds=7,
stopping_tolerance=0.006,
hidden=[30, 30, 30],
reproducible=True,
fold_assignment="Modulo",
keep_cross_validation_predictions=True,
seed=123
)
%time model_DL.train(x, y, train)
#deeplearning Model Build progress: |██████████████████████████████████████| 100%
#Wall time: 55.7 s

model_DL.cross_validation_metrics_summary().as_data_frame()

As can be seen from the above table (row 5, column 1), the mean RMSE for cross-validation is 121793.58, which is below $123k. model_DL.model_performance(test) #ModelMetricsRegression: deeplearning #** Reported on test data. ** #MSE: 14781990070.095192 #RMSE: 121581.20771770278 #MAE: 72522.60487846025 #RMSLE: 0.1834924698171073 #Mean Residual Deviance: 14781990070.095192 As can be seen from above, the deep learning model could achieve the target of RMSE below$123k on test dataset.

3. Finally, let’s train a stacked ensemble of the models created in earlier steps. We may need to repeat steps two and three until the best model (which is usually the ensemble model, but does not have to be) has the minimum required performance on the cross-validation dataset. Note: only one model has to achieve the minimum required performance. If multiple models achieve it, so we need to choose the best performing one.

models = [model_GBM.model_id, model_RF.model_id, model_DL.model_id] #model_GLM.model_id,
model_SE = H2OStackedEnsembleEstimator(model_id = 'se_gbm_dl_house', base_models=models)

%time model_SE.train(x, y, train)
#stackedensemble Model Build progress: |███████████████████████████████████| 100%
#Wall time: 2.67 s
#model_SE.model_performance(test)
#ModelMetricsRegressionGLM: stackedensemble
#** Reported on test data. **

#MSE: 130916347835.45828
#RMSE: 361823.6418967924
#MAE: 236448.3672215734
#RMSLE: 0.5514878971097109
#R^2: 0.015148783736682492
#Mean Residual Deviance: 130916347835.45828
#Null degrees of freedom: 2150
#Residual degrees of freedom: 2147
#Null deviance: 285935013037402.7
#Residual deviance: 281601064194070.75
#AIC: 61175.193832813566

As can be seen from above, the stacked ensemble model could not reach the required performance, neither on the cross-validation, nor on the test dataset.

4. Now let’s get the performance on the test data of the chosen model/ensemble, and confirm that this also reaches the minimum target on the test data.

### Best Model

The model that performs best in terms of mean cross-validation RMSE and RMSE on the test dataset (both of them are below the minimum target \$123k) is the gradient boositng model (GBM), which is the Model 2 above.

model_GBM.model_performance(test)
#ModelMetricsRegression: gbm
#** Reported on test data. **

#MSE: 14243079402.729088
#RMSE: 119344.37315068142
#MAE: 65050.344749203745
#RMSLE: 0.16421689257411975
#Mean Residual Deviance: 14243079402.729088

# save the models
h2o.save_model(model_GBM, 'best_model (GBM)') # the final best model
h2o.save_model(model_SE, 'SE_model')
h2o.save_model(model_GBM, 'GBM_model')
h2o.save_model(model_RF, 'RF_model')
h2o.save_model(model_GLM, 'GLM_model')
h2o.save_model(model_DL, 'DL_model')

# Neural Translation – Machine Translation with Neural Nets with Keras / Python

In this blog, we shall discuss about how to build a neural network to translate from English to German. This problem appeared as the Capstone project for the coursera course “Tensorflow 2: Customising your model“, a part of the specialization “Tensorflow2 for Deep Learning“, by the Imperial College, London. The problem statement / description / steps are taken from the course itself. We shall use the concepts from the course, including building more flexible model architectures, freezing layers, data processing pipeline and sequence modelling.

Here we shall use a language dataset from http://www.manythings.org/anki/ to build a neural translation model. This dataset consists of over 200k pairs of sentences in English and German. In order to make the training quicker, we will restrict to our dataset to 20k pairs. The below figure shows a few sentence pairs taken from the file.

Our goal is to develop a neural translation model from English to German, making use of a pre-trained English word embedding module.

## 1. Text preprocessing

We need to start with preprocessing the above input file. Here are the steps that we need to follow:

• First let’s create separate lists of English and German sentences.
• Add a special “” and “” token to the beginning and end of every German sentence.
• Use the Tokenizer class from the tf.keras.preprocessing.text module to tokenize the German sentences, ensuring that no character filters are applied.

The next figure shows 5 randomly chosen examples of (preprocessed) English and German sentence pairs. For the German sentence, the text (with start and end tokens) as well as the tokenized sequence are shown.

• Pad the end of the tokenized German sequences with zeros, and batch the complete set of sequences into a single numpy array, using the following code snippet.

padded_tokenized_german_sentences = tf.keras.preprocessing.sequence.pad_sequences(tokenized_german_sentences,
#(20000, 14)


As can be seen from the next code block, the maximum length of a German sentence is 14, whereas there are 5743 unique words in the German sentences from the subset of the corpus. The index of the <start> token is 1.

max([len(tokenized_german_sentences[i]) for i in range(20000)])
# 14
len(tokenizer.index_word)
# 5743
tokenizer.word_index['']
# 1


## 2. Preparing the data with tf.data.Dataset

As part of the dataset preproceessing for this project we shall use a pre-trained English word-embedding module from TensorFlow Hub. The URL for the module is https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1.

This embedding takes a batch of text tokens in a 1-D tensor of strings as input. It then embeds the separate tokens into a 128-dimensional space.

Although this model can also be used as a sentence embedding module (e.g., where the module will process each token by removing punctuation and splitting on spaces and then averages the word embeddings over a sentence to give a single embedding vector), however, we will use it only as a word embedding module here, and will pass each word in the input sentence as a separate token.

The following code snippet shows how an English sentence with 7 words is mapped into a 7×128 tensor in the embedding space.


embedding_layer(tf.constant(["these", "aren't", "the", "droids", "you're", "looking", "for"])).shape
# TensorShape([7, 128])



Now, let’s prepare the training and validation Datasets as follows:

• Create a random training and validation set split of the data, reserving e.g. 20% of the data for validation (each English dataset example is a single sentence string, and each German dataset example is a sequence of padded integer tokens).
• Load the training and validation sets into a tf.data.Dataset object, passing in a tuple of English and German data for both training and validation sets, using the following code snippet.

def make_Dataset(input_array, target_array):
return tf.data.Dataset.from_tensor_slices((input_array, target_array))

train_data = make_Dataset(input_train, target_train)
valid_data = make_Dataset(input_valid, target_valid)


• Create a function to map over the datasets that splits each English sentence at spaces. Apply this function to both Dataset objects using the map method, using the following code snippet.

def str_split(e, g):
e = tf.strings.split(e)
return e, g

train_data = train_data.map(str_split)
valid_data = valid_data.map(str_split)


• Create a function to map over the datasets that embeds each sequence of English words using the loaded embedding layer/model. Apply this function to both Dataset objects using the map method, using the following code snippet.

def embed_english(x, y):
return embedding_layer(x), y

train_data = train_data.map(embed_english)
valid_data = valid_data.map(embed_english)


• Create a function to filter out dataset examples where the English sentence is more than 13 (embedded) tokens in length. Apply this function to both Dataset objects using the filter method, using the following code snippet.

def remove_long_sentence(e, g):
return tf.shape(e) <= 13

train_data = train_data.filter(remove_long_sentence)
valid_data = valid_data.filter(remove_long_sentence)


• Create a function to map over the datasets that pads each English sequence of embeddings with some distinct padding value before the sequence, so that each sequence is length 13. Apply this function to both Dataset objects using the map method, as shown in the next code block.



• Batch both training and validation Datasets with a batch size of 16.

train_data = train_data.batch(16)
valid_data = valid_data.batch(16)


• Let’s now print the element_spec property for the training and validation Datasets. Also, let’s print the shape of an English data example from the training Dataset and a German data example Tensor from the validation Dataset.

train_data.element_spec
#(TensorSpec(shape=(None, None, 128), dtype=tf.float32, name=None),
# TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

valid_data.element_spec
#(TensorSpec(shape=(None, None, 128), dtype=tf.float32, name=None),
#TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

for e, g in train_data.take(1):
print(e.shape)
#(16, 13, 128)

for e, g in valid_data.take(1):
print(g)
#tf.Tensor(
#[[   1   11  152    6  458    3    2    0    0    0    0    0    0    0]
# [   1   11  333  429    3    2    0    0    0    0    0    0    0    0]
# [   1   11   59   12    3    2    0    0    0    0    0    0    0    0]
# [   1  990   25   42  444    7    2    0    0    0    0    0    0    0]
# [   1    4   85 1365    3    2    0    0    0    0    0    0    0    0]
# [   1  131    8   22    5  583    3    2    0    0    0    0    0    0]
# [   1    4   85 1401    3    2    0    0    0    0    0    0    0    0]
# [   1   17  381   80    3    2    0    0    0    0    0    0    0    0]
# [   1 2998   13   33    7    2    0    0    0    0    0    0    0    0]
# [   1  242    6  479    3    2    0    0    0    0    0    0    0    0]
# [   1   35   17   40    7    2    0    0    0    0    0    0    0    0]
# [   1   11   30  305   46   47 1913  471    3    2    0    0    0    0]
# [   1    5   48 1184    3    2    0    0    0    0    0    0    0    0]
# [   1    5  287   12  834 5268    3    2    0    0    0    0    0    0]
# [   1    5    6  523    3    2    0    0    0    0    0    0    0    0]
# [   1   13  109   28   29   44  491    3    2    0    0    0    0    0]], shape=(16, 14), dtype=int32)



## The custom translation model

The following is a schematic of the custom translation model architecture we shall develop now.

The custom model consists of an encoder RNN and a decoder RNN. The encoder takes words of an English sentence as input, and uses a pre-trained word embedding to embed the words into a 128-dimensional space. To indicate the end of the input sentence, a special end token (in the same 128-dimensional space) is passed in as an input. This token is a TensorFlow Variable that is learned in the training phase (unlike the pre-trained word embedding, which is frozen).

The decoder RNN takes the internal state of the encoder network as its initial state. A start token is passed in as the first input, which is embedded using a learned German word embedding. The decoder RNN then makes a prediction for the next German word, which during inference is then passed in as the following input, and this process is repeated until the special <end> token is emitted from the decoder.

## Create the custom layer

Let’s create a custom layer to add the learned end token embedding to the encoder model:

Now let’s first build the custom layer, which will be later used to create the encoder.

• Using layer subclassing, create a custom layer that takes a batch of English data examples from one of the Datasets, and adds a learned embedded ‘end’ token to the end of each sequence.
• This layer should create a TensorFlow Variable (that will be learned during training) that is 128-dimensional (the size of the embedding space).

from tensorflow.keras.models import  Sequential, Model
from tensorflow.keras.layers import Layer, Concatenate, Input, Masking, LSTM, Embedding, Dense
from tensorflow.keras.losses import SparseCategoricalCrossentropy

class CustomLayer(Layer):

def __init__(self, **kwargs):
super(CustomLayer, self).__init__(**kwargs)
self.embed = tf.Variable(initial_value=tf.zeros(shape=(1,128)), trainable=True, dtype='float32')

def call(self, inputs):
x = tf.tile(self.embed, [tf.shape(inputs), 1])
x = tf.expand_dims(x, axis=1)
return tf.concat([inputs, x], axis=1)
<em>#return Concatenate(axis=1)([inputs, x])</em>


• Let’s extract a batch of English data examples from the training Dataset and print the shape. Test the custom layer by calling the layer on the English data batch Tensor and print the resulting Tensor shape (the layer should increase the sequence length by one).

custom_layer = CustomLayer()
e, g = next(iter(train_data.take(1)))
print(e.shape)
# (16, 13, 128)
o = custom_layer(e)
o.shape
# TensorShape([16, 14, 128])



## Build the encoder network

The encoder network follows the schematic diagram above. Now let’s build the RNN encoder model.

• Using the keras functional API, build the encoder network according to the following spec:
• The model will take a batch of sequences of embedded English words as input, as given by the Dataset objects.
• The next layer in the encoder will be the custom layer you created previously, to add a learned end token embedding to the end of the English sequence.
• This is followed by a Masking layer, with the mask_value set to the distinct padding value you used when you padded the English sequences with the Dataset preprocessing above.
• The final layer is an LSTM layer with 512 units, which also returns the hidden and cell states.
• The encoder is a multi-output model. There should be two output Tensors of this model: the hidden state and cell states of the LSTM layer. The output of the LSTM layer is unused.

inputs = Input(batch_shape = (<strong>None</strong>, 13, 128), name='input')
x = CustomLayer(name='custom_layer')(inputs)
x, h, c = LSTM(units=512, return_state=<strong>True</strong>, name='lstm')(x)
encoder_model = Model(inputs = inputs, outputs = [h, c], name='encoder')

encoder_model.summary()

# Model: "encoder"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #
# =================================================================
# input (InputLayer)           [(None, 13, 128)]         0
# _________________________________________________________________
# custom_layer (CustomLayer)   (None, 14, 128)           128
# _________________________________________________________________
# _________________________________________________________________
# lstm (LSTM)                  [(None, 512), (None, 512) 1312768
# =================================================================
# Total params: 1,312,896
# Trainable params: 1,312,896
# Non-trainable params: 0
# _________________________________________________________________



## Build the decoder network

The decoder network follows the schematic diagram below.

Now let’s build the RNN decoder model.

• Using Model subclassing, build the decoder network according to the following spec:
• The initializer should create the following layers:
• An Embedding layer with vocabulary size set to the number of unique German tokens, embedding dimension 128, and set to mask zero values in the input.
• An LSTM layer with 512 units, that returns its hidden and cell states, and also returns sequences.
• A Dense layer with number of units equal to the number of unique German tokens, and no activation function.
• The call method should include the usual inputs argument, as well as the additional keyword arguments hidden_state and cell_state. The default value for these keyword arguments should be None.
• The call method should pass the inputs through the Embedding layer, and then through the LSTM layer. If the hidden_state and cell_state arguments are provided, these should be used for the initial state of the LSTM layer.
• The call method should pass the LSTM output sequence through the Dense layer, and return the resulting Tensor, along with the hidden and cell states of the LSTM layer.

class Decoder(Model):

def __init__(self, **kwargs):
super(Decoder, self).__init__(**kwargs)
self.embed = Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=128, mask_zero=True, name='embedding_layer')
self.lstm = LSTM(units = 512, return_state = True, return_sequences = True, name='lstm_layer')
self.dense = Dense(len(tokenizer.index_word)+1, name='dense_layer')

def call(self, inputs, hidden_state = None, cell_state = None):
x = self.embed(inputs)
x, hidden_state, cell_state = self.lstm(x, initial_state = [hidden_state, cell_state]) \
if hidden_state is not None and cell_state is not None else self.lstm(x)
x = self.dense(x)
return x, hidden_state, cell_state

decoder_model = Decoder(name='decoder')
e, g_in = next(iter(train_data.take(1)))
h, c = encoder_model(e)
g_out, h, c = decoder_model(g_in, h, c)

print(g_out.shape, h.shape, c.shape)
# (16, 14, 5744) (16, 512) (16, 512)

decoder_model.summary()

#Model: "decoder"
#_________________________________________________________________
#Layer (type)                 Output Shape              Param #
#=================================================================
#embedding_layer (Embedding)  multiple                  735232
#_________________________________________________________________
#lstm_layer (LSTM)            multiple                  1312768
#_________________________________________________________________
#dense_layer (Dense)          multiple                  2946672
#=================================================================
#Total params: 4,994,672
#Trainable params: 4,994,672
#Non-trainable params: 0



## Create a custom training loop

custom training loop to train your custom neural translation model.

• Define a function that takes a Tensor batch of German data (as extracted from the training Dataset), and returns a tuple containing German inputs and outputs for the decoder model (refer to schematic diagram above).
• Define a function that computes the forward and backward pass for your translation model. This function should take an English input, German input and German output as arguments, and should do the following:
• Pass the English input into the encoder, to get the hidden and cell states of the encoder LSTM.
• These hidden and cell states are then passed into the decoder, along with the German inputs, which returns a sequence of outputs (the hidden and cell state outputs of the decoder LSTM are unused in this function).
• The loss should then be computed between the decoder outputs and the German output function argument.
• The function returns the loss and gradients with respect to the encoder and decoder’s trainable variables.
• Decorate the function with @tf.function
• Define and run a custom training loop for a number of epochs (for you to choose) that does the following:
• Iterates through the training dataset, and creates decoder inputs and outputs from the German sequences.
• Updates the parameters of the translation model using the gradients of the function above and an optimizer object.
• Every epoch, compute the validation loss on a number of batches from the validation and save the epoch training and validation losses.
• Plot the learning curves for loss vs epoch for both training and validation sets.

@tf.function
def forward_backward(encoder_model, decoder_model, e, g_in, g_out, loss):
h, c = encoder_model(e)
d_g_out, _, _ = decoder_model(g_in, h, c)
cur_loss = loss(g_out, d_g_out)

def train_encoder_decoder(encoder_model, decoder_model, num_epochs, train_data, valid_data, valid_steps,
train_losses = []
val_loasses = []
for epoch in range(num_epochs):
train_epoch_loss_avg = tf.keras.metrics.Mean()
val_epoch_loss_avg = tf.keras.metrics.Mean()
for e, g in train_data:
g_in, g_out = get_german_decoder_data(g)
train_epoch_loss_avg.update_state(train_loss)
for e_v, g_v in valid_data.take(valid_steps):
g_v_in, g_v_out = get_german_decoder_data(g_v)
val_loss, _ = grad_fn(encoder_model, decoder_model, e_v, g_v_in, g_v_out, loss)
val_epoch_loss_avg.update_state(val_loss)
print(f'epoch: {epoch}, train loss: {train_epoch_loss_avg.result()}, validation loss: {val_epoch_loss_avg.result()}')
train_losses.append(train_epoch_loss_avg.result())
val_loasses.append(val_epoch_loss_avg.result())
return train_losses, val_loasses

loss_obj = SparseCategoricalCrossentropy(from_logits=True)
train_loss_results, valid_loss_results = train_encoder_decoder(encoder_model, decoder_model, 20, train_data, valid_data, 20,
optimizer_obj, loss_obj, forward_backward)

#epoch: 0, train loss: 4.4570465087890625, validation loss: 4.1102800369262695
#epoch: 1, train loss: 3.540217399597168, validation loss: 3.36271333694458
#epoch: 2, train loss: 2.756622076034546, validation loss: 2.7144060134887695
#epoch: 3, train loss: 2.049957275390625, validation loss: 2.1480133533477783
#epoch: 4, train loss: 1.4586931467056274, validation loss: 1.7304519414901733
#epoch: 5, train loss: 1.0423369407653809, validation loss: 1.4607685804367065
#epoch: 6, train loss: 0.7781839370727539, validation loss: 1.314332127571106
#epoch: 7, train loss: 0.6160411238670349, validation loss: 1.2391613721847534
#epoch: 8, train loss: 0.5013922452926636, validation loss: 1.1840368509292603
#epoch: 9, train loss: 0.424654096364975, validation loss: 1.1716119050979614
#epoch: 10, train loss: 0.37027251720428467, validation loss: 1.1612160205841064
#epoch: 11, train loss: 0.3173922598361969, validation loss: 1.1330692768096924
#epoch: 12, train loss: 0.2803193926811218, validation loss: 1.1394184827804565
#epoch: 13, train loss: 0.24854864180088043, validation loss: 1.1354353427886963
#epoch: 14, train loss: 0.22135266661643982, validation loss: 1.1059410572052002
#epoch: 15, train loss: 0.2019050121307373, validation loss: 1.1111358404159546
#epoch: 16, train loss: 0.1840481162071228, validation loss: 1.1081823110580444
#epoch: 17, train loss: 0.17126116156578064, validation loss: 1.125329852104187
#epoch: 18, train loss: 0.15828527510166168, validation loss: 1.0979799032211304
#epoch: 19, train loss: 0.14451280236244202, validation loss: 1.0899451971054077

import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.xlabel("Epochs", fontsize=14)
plt.ylabel("Loss", fontsize=14)
plt.title('Loss vs epochs')
plt.plot(train_loss_results, label='train')
plt.plot(valid_loss_results, label='valid')
plt.legend()
plt.show()



The following figure shows how the training and validation loss decrease with epochs (the model is trained for 20 epochs).

## Use the model to translate

Now it’s time to put the model into practice! Let’s run the translation for five randomly sampled English sentences from the dataset. For each sentence, the process is as follows:

• Preprocess and embed the English sentence according to the model requirements.
• Pass the embedded sentence through the encoder to get the encoder hidden and cell states.
• Starting with the special "<start>" token, use this token and the final encoder hidden and cell states to get the one-step prediction from the decoder, as well as the decoder’s updated hidden and cell states.
• Create a loop to get the next step prediction and updated hidden and cell states from the decoder, using the most recent hidden and cell states. Terminate the loop when the "<end>" token is emitted, or when the sentence has reached a maximum length.
• Decode the output token sequence into German text and print the English text and the model’s German translation.


indices = np.random.choice(len(english_sentences), 5)
test_data = tf.data.Dataset.from_tensor_slices(np.array([english_sentences[i] for i in indices]))
test_data = test_data.map(tf.strings.split)
test_data = test_data.map(embedding_layer)
test_data = test_data.filter(lambda x: tf.shape(x) <= 13)
print(test_data.element_spec)
# TensorSpec(shape=(None, 128), dtype=tf.float32, name=None)

start_token = np.array(tokenizer.texts_to_sequences(['']))
end_token = np.array(tokenizer.texts_to_sequences(['']))
for e, i in zip(test_data.take(n), indices):
h, c = encoder_model(tf.expand_dims(e, axis=0))
g_t = []
g_in = start_token
g_out, h, c = decoder_model(g_in, h, c)
g_t.append('')
g_out = tf.argmax(g_out, axis=2)
while g_out != end_token:
g_out, h, c = decoder_model(g_in, h, c)
g_out = tf.argmax(g_out, axis=2)
g_in = g_out
g_t.append(tokenizer.index_word.get(tf.squeeze(g_out).numpy(), 'UNK'))
print(f'English Text: {english_sentences[i]}')
print(f'German Translation: {" ".join(g_t)}')
print()

# English Text: i'll see tom .
# German Translation:  ich werde tom folgen .

# English Text: you're not alone .
# German Translation:  keine nicht allein .

# English Text: what a hypocrite !
# German Translation:  fuer ein idiot !

# English Text: he kept talking .
# German Translation:  sie hat ihn erwuergt .

# English Text: tom's in charge .
# German Translation:  tom ist im bett .



The above output shows the sample English sentences and their German translations predicted by the model.

The following animation (click and open on a new tab) shows how the predicted German translation improves (with the decrease in loss) for a few sample English sentences as the deep learning model is trained for more and more epochs.

As can be seen from the above animation, translation gets better as the deep learning model is trained for more and more epochs. The following is the youtube version for the same.

# NLP with Bangla: semantic similarity with word2vec, Deep learning (RNN) to generate Bangla song-like texts and to do sentiment analysis on astrological prediction dataset, creating a simple Bangla ChatBot using RASA NLU with Python

In this blog, we shall discuss on a few NLP techniques with Bangla language. We shall start with a demonstration on how to train a word2vec model with Bangla wiki corpus with tensorflow and how to visualize the semantic similarity between words using t-SNE. Next, we shall demonstrate how to train a character / word LSTM on selected Tagore’s songs to generate songs like Tagore with keras. Next, we shall create sentiment analysis dataset by crawling the daily astrological prediction pages of a leading Bangla newspaper and manually labeling the sentiment of each of the predictions corresponding to each moon-sign. We shall train an LSTM sentiment a analysis model to predict the sentiment of a moon-sign prediction. Finally, we shall use RASA NLU (natural language understanding) to build a very simple chatbot in Bangla.

## Word2vec model with Bangla wiki corpus with tensorflow

• Let’s start by importing the required libraries
import collections
import numpy as np
import tensorflow as tf
from matplotlib import pylab

• Download the Bangla wikipedia corpus from Kaggle. The first few lines from the corpus are shown below:

id,text,title,url

1528,

“রবীন্দ্রনাথ ঠাকুর”

রবীন্দ্রনাথ ঠাকুর (৭ই মে, ১৮৬১ – ৭ই আগস্ট, ১৯৪১) (২৫ বৈশাখ, ১২৬৮ – ২২ শ্রাবণ, ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। তাঁকে বাংলা ভাষার সর্বশ্রেষ্ঠ সাহিত্যিক মনে করা হয়। রবীন্দ্রনাথকে গুরুদেব, কবিগুরু ও বিশ্বকবি অভিধায় ভূষিত করা হয়। রবীন্দ্রনাথের ৫২টি কাব্যগ্রন্থ, ৩৮টি নাটক, ১৩টি উপন্যাস ও ৩৬টি প্রবন্ধ ও অন্যান্য গদ্যসংকলন তাঁর জীবদ্দশায় বা মৃত্যুর অব্যবহিত পরে প্রকাশিত হয়। তাঁর সর্বমোট ৯৫টি ছোটগল্প ও ১৯১৫টি গান যথাক্রমে “”গল্পগুচ্ছ”” ও “”গীতবিতান”” সংকলনের অন্তর্ভুক্ত হয়েছে। রবীন্দ্রনাথের যাবতীয় প্রকাশিত ও গ্রন্থাকারে অপ্রকাশিত রচনা ৩২ খণ্ডে “”রবীন্দ্র রচনাবলী”” নামে প্রকাশিত হয়েছে। রবীন্দ্রনাথের যাবতীয় পত্রসাহিত্য উনিশ খণ্ডে “”চিঠিপত্র”” ও চারটি পৃথক গ্রন্থে প্রকাশিত। এছাড়া তিনি প্রায় দুই হাজার ছবি এঁকেছিলেন। রবীন্দ্রনাথের রচনা বিশ্বের বিভিন্ন ভাষায় অনূদিত হয়েছে। ১৯১৩ সালে “”গীতাঞ্জলি”” কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি সাহিত্যে নোবেল পুরস্কার লাভ করেন।রবীন্দ্রনাথ ঠাকুর কলকাতার এক ধনাঢ্য ও সংস্কৃতিবান ব্রাহ্ম পিরালী ব্রাহ্মণ পরিবারে জন্মগ্রহণ করেন। বাল্যকালে প্রথাগত বিদ্যালয়-শিক্ষা তিনি গ্রহণ করেননি; গৃহশিক্ষক রেখে বাড়িতেই তাঁর শিক্ষার ব্যবস্থা করা হয়েছিল। আট বছর বয়সে তিনি কবিতা লেখা শুরু করেন। ১৮৭৪ সালে “”তত্ত্ববোধিনী পত্রিকা””-এ তাঁর “””” কবিতাটি প্রকাশিত হয়। এটিই ছিল তাঁর প্রথম প্রকাশিত রচনা। ১৮৭৮ সালে মাত্র সতেরো বছর বয়সে রবীন্দ্রনাথ প্রথমবার ইংল্যান্ডে যান। ১৮৮৩ সালে মৃণালিনী দেবীর সঙ্গে তাঁর বিবাহ হয়। ১৮৯০ সাল থেকে রবীন্দ্রনাথ পূর্ববঙ্গের শিলাইদহের জমিদারি এস্টেটে বসবাস শুরু করেন। ১৯০১ সালে তিনি পশ্চিমবঙ্গের শান্তিনিকেতনে ব্রহ্মচর্যাশ্রম প্রতিষ্ঠা করেন এবং সেখানেই পাকাপাকিভাবে বসবাস শুরু করেন। ১৯০২ সালে তাঁর পত্নীবিয়োগ হয়। ১৯০৫ সালে তিনি বঙ্গভঙ্গ-বিরোধী আন্দোলনে জড়িয়ে পড়েন। ১৯১৫ সালে ব্রিটিশ সরকার তাঁকে নাইট উপাধিতে ভূষিত করেন। কিন্তু ১৯১৯ সালে জালিয়ানওয়ালাবাগ হত্যাকাণ্ডের প্রতিবাদে তিনি সেই উপাধি ত্যাগ করেন। ১৯২১ সালে গ্রামোন্নয়নের জন্য তিনি শ্রীনিকেতন নামে একটি সংস্থা প্রতিষ্ঠা করেন। ১৯২৩ সালে আনুষ্ঠানিকভাবে বিশ্বভারতী প্রতিষ্ঠিত হয়। দীর্ঘজীবনে তিনি বহুবার বিদেশ ভ্রমণ করেন এবং সমগ্র বিশ্বে বিশ্বভ্রাতৃত্বের বাণী প্রচার করেন। ১৯৪১ সালে দীর্ঘ রোগভোগের পর কলকাতার পৈত্রিক বাসভবনেই তাঁর মৃত্যু হয়।রবীন্দ্রনাথের কাব্যসাহিত্যের বৈশিষ্ট্য ভাবগভীরতা, গীতিধর্মিতা চিত্ররূপময়তা, অধ্যাত্মচেতনা, ঐতিহ্যপ্রীতি, প্রকৃতিপ্রেম, মানবপ্রেম, স্বদেশপ্রেম, বিশ্বপ্রেম, রোম্যান্টিক সৌন্দর্যচেতনা, ভাব, ভাষা, ছন্দ ও আঙ্গিকের বৈচিত্র্য, বাস্তবচেতনা ও প্রগতিচেতনা। রবীন্দ্রনাথের গদ্যভাষাও কাব্যিক। ভারতের ধ্রুপদি ও লৌকিক সংস্কৃতি এবং পাশ্চাত্য বিজ্ঞানচেতনা ও শিল্পদর্শন তাঁর রচনায় গভীর প্রভাব বিস্তার করেছিল। কথাসাহিত্য ও প্রবন্ধের মাধ্যমে তিনি সমাজ, রাজনীতি ও রাষ্ট্রনীতি সম্পর্কে নিজ মতামত প্রকাশ করেছিলেন। সমাজকল্যাণের উপায় হিসেবে তিনি গ্রামোন্নয়ন ও গ্রামের দরিদ্র মানুষ কে শিক্ষিত করে তোলার পক্ষে মতপ্রকাশ করেন। এর পাশাপাশি সামাজিক ভেদাভেদ, অস্পৃশ্যতা, ধর্মীয় গোঁড়ামি ও ধর্মান্ধতার বিরুদ্ধেও তিনি তীব্র প্রতিবাদ জানিয়েছিলেন। রবীন্দ্রনাথের দর্শনচেতনায় ঈশ্বরের মূল হিসেবে মানব সংসারকেই নির্দিষ্ট করা হয়েছে; রবীন্দ্রনাথ দেববিগ্রহের পরিবর্তে কর্মী অর্থাৎ মানুষ ঈশ্বরের পূজার কথা বলেছিলেন। সংগীত ও নৃত্যকে তিনি শিক্ষার অপরিহার্য অঙ্গ মনে করতেন। রবীন্দ্রনাথের গান তাঁর অন্যতম শ্রেষ্ঠ কীর্তি। তাঁর রচিত “”আমার সোনার বাংলা”” ও “”জনগণমন-অধিনায়ক জয় হে”” গানদুটি যথাক্রমে গণপ্রজাতন্ত্রী বাংলাদেশ ও ভারতীয় প্রজাতন্ত্রের জাতীয় সংগীত।

জীবন.

প্রথম জীবন (১৮৬১–১৯০১).

শৈশব ও কৈশোর (১৮৬১ – ১৮৭৮).
রবীন্দ্রনাথ ঠাকুর কলকাতার জোড়াসাঁকো ঠাকুরবাড়িতে জন্মগ্রহণ করেছিলেন। তাঁর পিতা ছিলেন ব্রাহ্ম ধর্মগুরু দেবেন্দ্রনাথ ঠাকুর (১৮১৭–১৯০৫) এবং মাতা ছিলেন সারদাসুন্দরী দেবী (১৮২৬–১৮৭৫)। রবীন্দ্রনাথ ছিলেন পিতামাতার চতুর্দশ সন্তান। জোড়াসাঁকোর ঠাকুর পরিবার ছিল ব্রাহ্ম আদিধর্ম মতবাদের প্রবক্তা। রবীন্দ্রনাথের পূর্ব পুরুষেরা খুলনা জেলার রূপসা উপজেলা পিঠাভোগে বাস করতেন। ১৮৭৫ সালে মাত্র চোদ্দ বছর বয়সে রবীন্দ্রনাথের মাতৃবিয়োগ ঘটে। পিতা দেবেন্দ্রনাথ দেশভ্রমণের নেশায় বছরের অধিকাংশ সময় কলকাতার বাইরে অতিবাহিত করতেন। তাই ধনাঢ্য পরিবারের সন্তান হয়েও রবীন্দ্রনাথের ছেলেবেলা কেটেছিল ভৃত্যদের অনুশাসনে। শৈশবে রবীন্দ্রনাথ কলকাতার ওরিয়েন্টাল সেমিনারি, নর্ম্যাল স্কুল, বেঙ্গল অ্যাকাডেমি এবং সেন্ট জেভিয়ার্স কলেজিয়েট স্কুলে কিছুদিন করে পড়াশোনা করেছিলেন। কিন্তু বিদ্যালয়-শিক্ষায় অনাগ্রহী হওয়ায় বাড়িতেই গৃহশিক্ষক রেখে তাঁর শিক্ষার ব্যবস্থা করা হয়েছিল। ছেলেবেলায় জোড়াসাঁকোর বাড়িতে অথবা বোলপুর ও পানিহাটির বাগানবাড়িতে প্রাকৃতিক পরিবেশের মধ্যে ঘুরে বেড়াতে বেশি স্বচ্ছন্দবোধ করতেন রবীন্দ্রনাথ।১৮৭৩ সালে এগারো বছর বয়সে রবীন্দ্রনাথের উপনয়ন অনুষ্ঠিত হয়েছিল। এরপর তিনি কয়েক মাসের জন্য পিতার সঙ্গে দেশভ্রমণে বের হন। প্রথমে তাঁরা আসেন শান্তিনিকেতনে। এরপর পাঞ্জাবের অমৃতসরে কিছুকাল কাটিয়ে শিখদের উপাসনা পদ্ধতি পরিদর্শন করেন। শেষে পুত্রকে নিয়ে দেবেন্দ্রনাথ যান পাঞ্জাবেরই (অধুনা ভারতের হিমাচল প্রদেশ রাজ্যে অবস্থিত) ডালহৌসি শৈলশহরের নিকট বক্রোটায়। এখানকার বক্রোটা বাংলোয় বসে রবীন্দ্রনাথ পিতার কাছ থেকে সংস্কৃত ব্যাকরণ, ইংরেজি, জ্যোতির্বিজ্ঞান, সাধারণ বিজ্ঞান ও ইতিহাসের নিয়মিত পাঠ নিতে শুরু করেন। দেবেন্দ্রনাথ তাঁকে বিশিষ্ট ব্যক্তিবর্গের জীবনী, কালিদাস রচিত ধ্রুপদি সংস্কৃত কাব্য ও নাটক এবং উপনিষদ্‌ পাঠেও উৎসাহিত করতেন। ১৮৭৭ সালে “”ভারতী”” পত্রিকায় তরুণ রবীন্দ্রনাথের কয়েকটি গুরুত্বপূর্ণ রচনা প্রকাশিত হয়। এগুলি হল মাইকেল মধুসূদনের “”””, “”ভানুসিংহ ঠাকুরের পদাবলী”” এবং “””” ও “””” নামে দুটি গল্প। এর মধ্যে “”ভানুসিংহ ঠাকুরের পদাবলী”” বিশেষভাবে উল্লেখযোগ্য। এই কবিতাগুলি রাধা-কৃষ্ণ বিষয়ক পদাবলির অনুকরণে “”ভানুসিংহ”” ভণিতায় রচিত। রবীন্দ্রনাথের “”ভিখারিণী”” গল্পটি (১৮৭৭) বাংলা সাহিত্যের প্রথম ছোটগল্প। ১৮৭৮ সালে প্রকাশিত হয় রবীন্দ্রনাথের প্রথম কাব্যগ্রন্থ তথা প্রথম মুদ্রিত গ্রন্থ “”কবিকাহিনী””। এছাড়া এই পর্বে তিনি রচনা করেছিলেন “””” (১৮৮২) কাব্যগ্রন্থটি। রবীন্দ্রনাথের বিখ্যাত কবিতা “””” এই কাব্যগ্রন্থের অন্তর্গত।

যৌবন (১৮৭৮-১৯০১).
১৮৭৮ সালে ব্যারিস্টারি পড়ার উদ্দেশ্যে ইংল্যান্ডে যান রবীন্দ্রনাথ। প্রথমে তিনি ব্রাইটনের একটি পাবলিক স্কুলে ভর্তি হয়েছিলেন। ১৮৭৯ সালে ইউনিভার্সিটি কলেজ লন্ডনে আইনবিদ্যা নিয়ে পড়াশোনা শুরু করেন। কিন্তু সাহিত্যচর্চার আকর্ষণে সেই পড়াশোনা তিনি সমাপ্ত করতে পারেননি। ইংল্যান্ডে থাকাকালীন শেকসপিয়র ও অন্যান্য ইংরেজ সাহিত্যিকদের রচনার সঙ্গে রবীন্দ্রনাথের পরিচয় ঘটে। এই সময় তিনি বিশেষ মনোযোগ সহকারে পাঠ করেন “”রিলিজিও মেদিচি””, “”কোরিওলেনাস”” এবং “”অ্যান্টনি অ্যান্ড ক্লিওপেট্রা””। এই সময় তাঁর ইংল্যান্ডবাসের অভিজ্ঞতার কথা “”ভারতী”” পত্রিকায় পত্রাকারে পাঠাতেন রবীন্দ্রনাথ। উক্ত পত্রিকায় এই লেখাগুলি জ্যেষ্ঠভ্রাতা দ্বিজেন্দ্রনাথ ঠাকুরের সমালোচনাসহ প্রকাশিত হত “””” নামে। ১৮৮১ সালে সেই পত্রাবলি “””” নামে গ্রন্থাকারে ছাপা হয়। এটিই ছিল রবীন্দ্রনাথের প্রথম গদ্যগ্রন্থ তথা প্রথম চলিত ভাষায় লেখা গ্রন্থ। অবশেষে ১৮৮০ সালে প্রায় দেড় বছর ইংল্যান্ডে কাটিয়ে কোনো ডিগ্রি না নিয়ে এবং ব্যারিস্টারি পড়া শুরু না করেই তিনি দেশে ফিরে আসেন।১৮৮৩ সালের ৯ ডিসেম্বর (২৪ অগ্রহায়ণ, ১২৯০ বঙ্গাব্দ) ঠাকুরবাড়ির অধস্তন কর্মচারী বেণীমাধব রায়চৌধুরীর কন্যা ভবতারিণীর সঙ্গে রবীন্দ্রনাথের বিবাহ সম্পন্ন হয়। বিবাহিত জীবনে ভবতারিণীর নামকরণ হয়েছিল মৃণালিনী দেবী (১৮৭৩–১৯০২ )। রবীন্দ্রনাথ ও মৃণালিনীর সন্তান ছিলেন পাঁচ জন: মাধুরীলতা (১৮৮৬–১৯১৮), রথীন্দ্রনাথ (১৮৮৮–১৯৬১), রেণুকা (১৮৯১–১৯০৩), মীরা (১৮৯৪–১৯৬৯) এবং শমীন্দ্রনাথ (১৮৯৬–১৯০৭)। এঁদের মধ্যে অতি অল্প বয়সেই রেণুকা ও শমীন্দ্রনাথের মৃত্যু ঘটে।১৮৯১ সাল থেকে পিতার আদেশে নদিয়া (নদিয়ার উক্ত অংশটি অধুনা বাংলাদেশের কুষ্টিয়া জেলা), পাবনা ও রাজশাহী জেলা এবং উড়িষ্যার জমিদারিগুলির তদারকি শুরু করেন রবীন্দ্রনাথ। কুষ্টিয়ার শিলাইদহের কুঠিবাড়িতে রবীন্দ্রনাথ দীর্ঘ সময় অতিবাহিত করেছিলেন। জমিদার রবীন্দ্রনাথ শিলাইদহে “”পদ্মা”” নামে একটি বিলাসবহুল পারিবারিক বজরায় চড়ে প্রজাবর্গের কাছে খাজনা আদায় ও আশীর্বাদ প্রার্থনা করতে যেতেন। গ্রামবাসীরাও তাঁর সম্মানে ভোজসভার আয়োজন করত।১৮৯০ সালে রবীন্দ্রনাথের অপর বিখ্যাত কাব্যগ্রন্থ “””” প্রকাশিত হয়। কুড়ি থেকে ত্রিশ বছর বয়সের মধ্যে তাঁর আরও কয়েকটি উল্লেখযোগ্য কাব্যগ্রন্থ ও গীতিসংকলন প্রকাশিত হয়েছিল। এগুলি হলো “”””, “”””, “”রবিচ্ছায়া””, “””” ইত্যাদি। ১৮৯১ থেকে ১৮৯৫ সাল পর্যন্ত নিজের সম্পাদিত “”সাধনা”” পত্রিকায় রবীন্দ্রনাথের বেশ কিছু উৎকৃষ্ট রচনা প্রকাশিত হয়। তাঁর সাহিত্যজীবনের এই পর্যায়টি তাই “”সাধনা পর্যায়”” নামে পরিচিত। রবীন্দ্রনাথের “”গল্পগুচ্ছ”” গ্রন্থের প্রথম চুরাশিটি গল্পের অর্ধেকই এই পর্যায়ের রচনা। এই ছোটগল্পগুলিতে তিনি বাংলার গ্রামীণ জনজীবনের এক আবেগময় ও শ্লেষাত্মক চিত্র এঁকেছিলেন।

• Preprocess the csv files with the following code using regular expressions (to get rid of punctuations). Remember we need to decode to utf-8 first, since we have unicode input files.

from glob import glob
import re
words = []
for f in glob('bangla/wiki/*.csv'):
words += re.sub('[\r\n—?,;।!‘"’\.:…0-9]', ' ', open(f, 'rb').read().decode('utf8').strip()).split(' ')
words = list(filter(lambda x: not x in ['', '-'], words))
print(len(words))
# 13964346
words[:25]
#['রবীন্দ্রনাথ',
# 'ঠাকুর',
# 'রবীন্দ্রনাথ',
# 'ঠাকুর',
# '৭ই',
# 'মে',
# '১৮৬১',
# '৭ই',
# 'আগস্ট',
# '১৯৪১',
# '২৫',
# 'বৈশাখ',
# '১২৬৮',
# '২২',
# 'শ্রাবণ',
# '১৩৪৮',
# 'বঙ্গাব্দ',
# 'ছিলেন',
# 'অগ্রণী',
# 'বাঙালি',
# 'কবি',
# 'ঔপন্যাসিক',
# 'সংগীতস্রষ্টা',
# 'নাট্যকার',
# 'চিত্রকর']


• Create indices for unique words in the dataset.
vocabulary_size = 25000
def build_dataset(words):
count = [['UNK', -1]]
count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
dictionary = dict()
for word, _ in count:
dictionary[word] = len(dictionary)
data = list()
unk_count = 0
for word in words:
if word in dictionary:
index = dictionary[word]
else:
index = 0  # dictionary['UNK']
unk_count = unk_count + 1
data.append(index)
count = unk_count
reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys()))
return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
# Most common words (+UNK) [['UNK', 1961151], ('এবং', 196916), ('ও', 180042), ('হয়', 160533), ('করে', 131206)]
print('Sample data', data[:10])
#Sample data [1733, 1868, 1733, 1868, 5769, 287, 6855, 5769, 400, 2570]
del words  # Hint to reduce memory.

• Generate batches to be trained with the word2vec skip-gram model.
• The target label should be at the center of the buffer each time. That is, given a focus word, our goal will be to learn the most probable context words.
• The input and the target vector will depend on num_skips and skip_window.

data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
global data_index
assert batch_size % num_skips == 0
assert num_skips <= 2 * skip_window
batch = np.ndarray(shape=(batch_size), dtype=np.int32)
labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
span = 2 * skip_window + 1 # [ skip_window target skip_window ]
buffer = collections.deque(maxlen=span)
for _ in range(span):
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
for i in range(batch_size // num_skips):
target = skip_window  #
targets_to_avoid = [ skip_window ]
for j in range(num_skips):
while target in targets_to_avoid:
target = random.randint(0, span - 1)
targets_to_avoid.append(target)
batch[i * num_skips + j] = buffer[skip_window]
labels[i * num_skips + j, 0] = buffer[target]
buffer.append(data[data_index])
data_index = (data_index + 1) % len(data)
return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])
# data: ['রবীন্দ্রনাথ', 'ঠাকুর', 'রবীন্দ্রনাথ', 'ঠাকুর', '৭ই', 'মে', '১৮৬১', '৭ই']
for num_skips, skip_window in [(2, 1), (4, 2)]:
data_index = 0
batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
print('\nwith num_skips = %d and skip_window = %d:' %
(num_skips, skip_window))
print('    batch:', [reverse_dictionary[bi] for bi in batch])
print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
# data: ['রবীন্দ্রনাথ', 'ঠাকুর', 'রবীন্দ্রনাথ', 'ঠাকুর', '৭ই', 'মে',  '১৮৬১', '৭ই']
# with num_skips = 2 and skip_window = 1:
# batch: ['ঠাকুর', 'ঠাকুর', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর',  '৭ই', '৭ই']
# labels: ['রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর', '৭ই', 'রবীন্দ্রনাথ', 'ঠাকুর', 'মে']
# with num_skips = 4 and skip_window = 2:
# batch: ['রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর', 'ঠাকুর', 'ঠাকুর']
# labels: ['রবীন্দ্রনাথ', '৭ই', 'ঠাকুর', 'ঠাকুর', 'মে', 'ঠাকুর', 'রবীন্দ্রনাথ', '৭ই']

• Pick a random validation set to sample nearest neighbors.
• Limit the validation samples to the words that have a low numeric ID, which by construction are also the most frequent.
• Look up embeddings for inputs and compute the softmax loss, using a sample of the negative labels each time (this is known as negative sampling, which is used to make the computation efficient, since the number of labels are often too high).
• The optimizer will optimize the softmax_weights and the embeddings.
This is because the embeddings are defined as a variable quantity and the optimizer’s minimize method will by default modify all variable quantities that contribute to the tensor it is passed.
• Compute the similarity between minibatch examples and all embeddings.

batch_size = 128
embedding_size = 128 # Dimension of the embedding vector.
skip_window = 1 # How many words to consider left and right.
num_skips = 2 # #times to reuse an input to generate a label.
valid_size = 16 # Random set of words to evaluate similarity on.
valid_window = 100 # Only pick dev samples in the head of the
# distribution.
valid_examples = np.array(random.sample(range(valid_window),
valid_size))
num_sampled = 64 # Number of negative examples to sample.
graph = tf.Graph()
with graph.as_default(), tf.device('/cpu:0'):
# Input data.
train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
# Variables.
embeddings = tf.Variable( \
tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
softmax_weights = tf.Variable( \
tf.truncated_normal([vocabulary_size, embedding_size], \
stddev=1.0 / math.sqrt(embedding_size)))
softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
# Model.
embed = tf.nn.embedding_lookup(embeddings, train_dataset)
loss = tf.reduce_mean( \
tf.nn.sampled_softmax_loss(weights=softmax_weights, \
biases=softmax_biases, inputs=embed, labels=train_labels, \
num_sampled=num_sampled, num_classes=vocabulary_size))
# Optimizer.
# use the cosine distance:
norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
normalized_embeddings = embeddings / norm
valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
similarity = tf.matmul(valid_embeddings,  tf.transpose(normalized_embeddings))

• Train the word2vec model with the batches constructed, for 100k steps.

num_steps = 100001with tf.Session(graph=graph) as session:
tf.global_variables_initializer().run()
print('Initialized')
average_loss = 0
for step in range(num_steps):
batch_data, batch_labels = generate_batch(
batch_size, num_skips, skip_window)
feed_dict = {train_dataset : batch_data, \
train_labels : batch_labels}
_, l = session.run([optimizer, loss], feed_dict=feed_dict)
average_loss += l
if step % 2000 == 0:
if step > 0:
average_loss = average_loss / 2000
# The average loss is an estimate of the loss over the last
# 2000 batches.
print('Average loss at step %d: %f' % (step, average_loss))
average_loss = 0
# note that this is expensive (~20% slowdown if computed every
# 500 steps)
if step % 10000 == 0:
sim = similarity.eval()
for i in range(valid_size):
valid_word = reverse_dictionary[valid_examples[i]]
top_k = 8 # number of nearest neighbors
nearest = (-sim[i, :]).argsort()[1:top_k+1]
log = 'Nearest to %s:' % valid_word
for k in range(top_k):
close_word = reverse_dictionary[nearest[k]]
log = '%s %s,' % (log, close_word)
print(log)
final_embeddings = normalized_embeddings.eval()

• The following shows how the loss function decreases with the increase in training steps.
• During the training process, the words that become semantically near come closer in the embedding space.
• Use t-SNE plot to map the following words from 128-dimensional embedding space to 2 dimensional manifold and visualize.

words = ['রাজা', 'রাণী', 'ভারত','বাংলাদেশ','দিল্লী','কলকাতা','ঢাকা',
'পুরুষ','নারী','দুঃখ','লেখক','কবি','কবিতা','দেশ',
'বিদেশ','লাভ','মানুষ', 'এবং', 'ও', 'গান', 'সঙ্গীত', 'বাংলা',
'ইংরেজি', 'ভাষা', 'কাজ', 'অনেক', 'জেলার', 'বাংলাদেশের',
'এক', 'দুই', 'তিন', 'চার', 'পাঁচ', 'দশ', '১', '৫', '২০',
'নবম', 'ভাষার', '১২', 'হিসাবে', 'যদি', 'পান', 'শহরের', 'দল',
'যদিও', 'বলেন', 'রান', 'করেছে', 'করে', 'এই', 'করেন', 'তিনি',
'একটি', 'থেকে', 'করা', 'সালে', 'এর', 'যেমন', 'সব',  'তার',
'খেলা',  'অংশ', 'উপর', 'পরে', 'ফলে',  'ভূমিকা', 'গঠন',
'তা', 'দেন', 'জীবন', 'যেখানে', 'খান', 'এতে',  'ঘটে', 'আগে',
'ধরনের', 'নেন', 'করতেন', 'তাকে', 'আর', 'যার', 'দেখা',
'বছরের', 'উপজেলা', 'থাকেন', 'রাজনৈতিক', 'মূলত', 'এমন',
'কিলোমিটার', 'পরিচালনা', '২০১১', 'তারা', 'তিনি', 'যিনি', 'আমি',
'তুমি', 'আপনি', 'লেখিকা', 'সুখ', 'বেদনা', 'মাস', 'নীল', 'লাল',
'সবুজ', 'সাদা', 'আছে', 'নেই', 'ছুটি', 'ঠাকুর',
'দান', 'মণি', 'করুণা', 'মাইল', 'হিন্দু', 'মুসলমান','কথা', 'বলা',
'সেখানে', 'তখন', 'বাইরে', 'ভিতরে', 'ভগবান' ]
indices = []
for word in words:
#print(word, dictionary[word])
indices.append(dictionary[word])
two_d_embeddings = tsne.fit_transform(final_embeddings[indices, :])
plot(two_d_embeddings, words)

• The following figure shows how the words similar in meaning are mapped to embedding vectors that are close to each other.
• Also, note that arithmetic property of the word embeddings: e.g., the words ‘রাজা’ and ‘রাণী’ are approximately along the same distance and direction as the words ‘লেখক’ and ‘লেখিকা’, reflecting the fact that the nature of the semantic relatedness in terms of gender is same.
• The following animation shows how the embedding is learnt to preserve the semantic similarity in the 2D-manifold more and more as training proceeds.

## Generating song-like texts with LSTM from Tagore’s Bangla songs

### Text generation with Character LSTM

• Let’s import the required libraries first.
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
import io, re

• Read the input file, containing few selected songs of Tagore in Bangla.
raw_text = open('rabindrasangeet.txt','rb').read().decode('utf8')
print(raw_text[0:1000])

পূজা

অগ্নিবীণা বাজাও তুমি

অগ্নিবীণা বাজাও তুমি কেমন ক’রে !
আকাশ কাঁপে তারার আলোর গানের ঘোরে ।।
তেমনি ক’রে আপন হাতে ছুঁলে আমার বেদনাতে,
নূতন সৃষ্টি জাগল বুঝি জীবন-‘পরে ।।
বাজে ব’লেই বাজাও তুমি সেই গরবে,
ওগো প্রভু, আমার প্রাণে সকল সবে ।
বিষম তোমার বহ্নিঘাতে বারে বারে আমার রাতে
জ্বালিয়ে দিলে নূতন তারা ব্যথায় ভ’রে ।।

অচেনাকে ভয় কী
অচেনাকে ভয় কী আমার ওরে?
অচেনাকেই চিনে চিনে উঠবে জীবন ভরে ।।
জানি জানি আমার চেনা কোনো কালেই ফুরাবে না,
চিহ্নহারা পথে আমায় টানবে অচিন ডোরে ।।
ছিল আমার মা অচেনা, নিল আমায় কোলে ।
সকল প্রেমই অচেনা গো, তাই তো হৃদয় দোলে ।
অচেনা এই ভুবন-মাঝে কত সুরেই হৃদয় বাজে-
অচেনা এই জীবন আমার, বেড়াই তারি ঘোরে ।।অন্তর মম
অন্তর মম বিকশিত করো অন্তরতর হে-
নির্মল করো, উজ্জ্বল করো, সুন্দর করো হে ।।
জাগ্রত করো, উদ্যত করো, নির্ভয় করো হে ।।
মঙ্গল করো, নিরলস নিঃসংশয় করো হে ।।
যুক্ত করো হে সবার সঙ্গে, মুক্ত করো হে বন্ধ ।
সঞ্চার করো সকল কর্মে শান্ত তোমার ছন্দ ।
চরণপদ্মে মম চিত নিস্পন্দিত করো হে ।
নন্দিত করো, নন্দিত করো, নন্দিত করো হে ।।

অন্তরে জাগিছ অন্তর্যামী
অন্তরে জাগিছ অন্তর্যামী ।

• Here we shall be using a many-to-many RNN as shown in the next figure.
• Pre-process the text and create character indices to be used as the input in the model.

processed_text = raw_text.lower()
print('corpus length:', len(processed_text))
# corpus length: 207117
chars = sorted(list(set(processed_text)))
print('total chars:', len(chars))
# total chars: 89
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))

• Cut the text in semi-redundant sequences of maxlen characters.

def is_conjunction(c):
h = ord(c) # print(hex(ord(c)))
return (h >= 0x980 and h = 0x9bc and h = 0x9f2)

maxlen = 40
step = 2
sentences = []
next_chars = []
i = 0
while i < len(processed_text) - maxlen:
if is_conjunction(processed_text[i]):
i += 1
continue
sentences.append(processed_text[i: i + maxlen])
next_chars.append(processed_text[i + maxlen])
i += step
print('nb sequences:', len(sentences))
# nb sequences: 89334

• Create one-hot-encodings.

x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
for t, char in enumerate(sentence):
x[i, t, char_indices[char]] = 1
y[i, char_indices[next_chars[i]]] = 1

• Build a model, a single LSTM.

model = Sequential()
model.compile(loss='categorical_crossentropy', optimizer=optimizer)

• The following figure how the model architecture looks like:
• Print the model summary.

model.summary()

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
lstm (LSTM)                  (None, 256)               354304
_________________________________________________________________
dense (Dense)                (None, 128)               32896
_________________________________________________________________
dense_1 (Dense)              (None, 89)                11481
=================================================================
Total params: 398,681 Trainable params: 398,681 Non-trainable params: 0
_________________________________________________________________

• Use the following helper function to sample an index from a probability array.

def sample(preds, temperature=1.0):
preds = np.asarray(preds).astype('float64')
preds = np.log(preds) / temperature
exp_preds = np.exp(preds)
preds = exp_preds / np.sum(exp_preds)
probas = np.random.multinomial(1, preds, 1)
return np.argmax(probas)

• Fit the model and register a callback to print the text generated by the model at the end of each epoch.

print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(x, y, batch_size=128, epochs=60, callbacks=[print_callback])

• The following animation shows how the model generates song-like texts with given seed texts, for different values of the temperature parameter.

### Text Generation with Word LSTM

• Pre-process the input text, split by punctuation characters and create word indices to be used as the input in the model.
processed_text = raw_text.lower()
from string import punctuation
r = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
words = r.split(processed_text)
print(len(words))
words[:16]
39481
# ['পূজা',
# 'অগ্নিবীণা',
# 'বাজাও',
# 'তুমি',
# 'অগ্নিবীণা',
# 'বাজাও',
# 'তুমি',
# 'কেমন',
# 'ক’রে',
# 'আকাশ',
# 'কাঁপে',
# 'তারার',
# 'আলোর',
# 'গানের',
# 'ঘোরে',
# '।।']

unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))
index_unique_word = dict((i, c) for i, c in enumerate(unique_words))

• Create a word-window of length 5 to predict the next word.
WORD_LENGTH = 5
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
prev_words.append(words[i:i + WORD_LENGTH])
next_words.append(words[i + WORD_LENGTH])
print(prev_words)
# ['অগ্নিবীণা', 'বাজাও', 'তুমি', 'অগ্নিবীণা', 'বাজাও']
print(next_words)
# তুমি
print(len(unique_words))
# 7847

• Create OHE for input and output words as done for character-RNN. Fit the model on the pre-rpocessed data.
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(X, Y,
batch_size=128,
epochs=60,
callbacks=[print_callback])

• The following animation shows the song -like text generated by the word-LSTM at the end of an epoc.

## Bangla Sentiment Analysis using LSTM with Daily Astrological Prediction Dataset

• Let’s first create sentiment analysis dataset by crawling the daily astrological predictions (রাশিফল) page of the online edition of আনন্দবাজার পত্রিকা (e.g., for the year 2013), a leading Bangla newspaper and then manually labeling the sentiment of each of the predictions corresponding to each moon-sign.
• Read the csv dataset, the first few lines look like the following.

pd.set_option('display.max_colwidth', 135)

• Transform each text in texts in a sequence of integers.

tokenizer = Tokenizer(num_words=2000, split=' ')
tokenizer.fit_on_texts(df['আপনার আজকের দিনটি'].values)
X = tokenizer.texts_to_sequences(df['আপনার আজকের দিনটি'].values)
X
#array([[   0,    0,    0, ...,   26,  375,    3],
#       [   0,    0,    0, ...,   54,    8,    1],
#       [   0,    0,    0, ...,  108,   42,   43],
#       ...,
#       [   0,    0,    0, ..., 1336,  302,   82],
#       [   0,    0,    0, ..., 1337,  489,  218],
#       [   0,    0,    0, ...,    2,  316,   87]])

• Here we shall use a many-to-one RNN for sentiment analysis as shown below.
• Build an LSTM model that takes a sentence as input and outputs the sentiment label.
model = Sequential()
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
embedding_10 (Embedding)     (None, 12, 128)           256000
_________________________________________________________________
spatial_dropout1d_10 (Spatia (None, 12, 128)           0
_________________________________________________________________
lstm_10 (LSTM)               (None, 128)               131584
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 258
=================================================================
Total params: 387,842
Trainable params: 387,842
Non-trainable params: 0
_________________________________________________________________
None

• Divide the dataset into train and validation (test) dataset and train the LSTM model on the dataset.

Y = pd.get_dummies(df['sentiment']).values
X_train, X_test, Y_train, Y_test, _, indices = train_test_split(X,Y, np.arange(len(X)), test_size = 0.33, random_state = 5)
model.fit(X_train, Y_train, epochs = 5, batch_size=32, verbose = 2)

#Epoch 1/5  - 3s - loss: 0.6748 - acc: 0.5522
#Epoch 2/5  - 1s - loss: 0.5358 - acc: 0.7925
#Epoch 3/5  - 1s - loss: 0.2368 - acc: 0.9418
#Epoch 4/5  - 1s - loss: 0.1011 - acc: 0.9761
#Epoch 5/5  - 1s - loss: 0.0578 - acc: 0.9836

• Predict the sentiment labels of the (held out) test dataset.

result = model.predict(X[indices],batch_size=1,verbose = 2)
df1 = df.iloc[indices]
df1['neg_prob'] = result[:,0]
df1['pos_prob'] = result[:,1]
df1['pred'] = np.array(['negative', 'positive'])[np.argmax(result, axis=1)]

• Finally, compute the accuracy of the model for the positive and negative ground-truth sentiment corresponding to daily astrological predictions.

df2 = df1[df1.sentiment == 'positive']
print('positive accuracy:' + str(np.mean(df2.sentiment == df2.pred)))
#positive accuracy:0.9177215189873418
df2 = df1[df1.sentiment == 'negative']
print('negative accuracy:' + str(np.mean(df2.sentiment == df2.pred)))
#negative accuracy:0.9352941176470588


## Building a very simple Bangla Chatbot with RASA NLU

• The following figure shows how to design a very simple Bangla chatbot to order food from restaurants using RASA NLU.
• We need to design the intents, entities and slots to extract the entities properly and then design stories to define how the chatbot will respond to user inputs (core / dialog).
• The following figure shows how the nlu, domain and stories files are written for the simple chatbot.
• A sequence-to-sequence deep learning model is trained under the hood for intent classification. The next code block shows how the model can be trained.
import rasa
model_path = rasa.train('domain.yml', 'config.yml', ['data/'], 'models/')

• The following gif demonstrates how the chatbot responds to user inputs.

References

# Travelling Salesman Problem (TSP) with Python

In this blog we shall discuss on the Travelling Salesman Problem (TSP) — a very famous NP-hard problem and will take a few attempts to solve it (either by considering special cases such as Bitonic TSP and solving it efficiently or by using algorithms to improve runtime, e.g., using Dynamic programming, or by using approximation algorithms, e.g., for Metric TSP and heuristics, to obtain not necessarily optimal but good enough solutions, e.g., with Simulated Annealing and Genetic Algorithms) and work on the corresponding python implementations. Few of the problems discussed here appeared as programming assignments in the Coursera course Advanced Algorithms and Complexity and some of the problem statements are taken from the course.

# Improving the runtime of the Travelling Salesman Problem with Dynamic Programming

In this problem we shall deal with a classical NP-complete problem called Traveling Salesman Problem. Given a graph with weighted edges, you need to find the shortest cycle visiting each vertex exactly once. Vertices correspond to cities. Edges weights correspond to the cost (e.g., time) to get from one vertex to another one. Some vertices may not be connected by an edge in the general case.

• Here we shall use dynamic programming to solve TSP: instead of solving one problem we will solve a collection of (overlapping) subproblems.
• A subproblem refers to a partial solution
• A reasonable partial solution in case of TSP is the initial part of a cycle
• To continue building a cycle, we need to know the last vertex as well as the set of already visited vertices
• It will be convenient to assume that vertices are integers from 1 to n and that the salesman starts his trip in (and also returns back to) vertex 1.
• The following figure shows the Dynamic programming subproblems, the recurrence relation and the algorithm for TSP with DP.

Implementation tips

• In order to iterate through all subsets of {1, . . . , n}, it will be helpful to notice that there is a natural one-to-one correspondence between integers in the range from 0 and 2^n − 1 and subsets of {0, . . . , n − 1}: k ↔ {i : i -th bit of k is 1}.
• For example, k = 1 (binary 001) corresponds to the set {0}, where k = 5 (binary 101) corresponds to the set {0,2}
• In order to find out the integer corresponding to S − {j} (for j ∈ S), we need to flip the j-th bit of k (from 1 to 0). For this, in turn, we can compute a bitwise XOR of k and 2^j (that has 1 only in j-th position)
• In order to compute the optimal path along with the cost, we need to maintain back-pointers to store the path.

The following python code shows an implementation of the above algorithm.

import numpy as np
from itertools import combinationsdef TSP(G):
n = len(G)
C = [[np.inf for _ in range(n)] for __ in range(1 << n)]
C = 0 # {0} <-> 1
for size in range(1, n):
for S in combinations(range(1, n), size):
S = (0,) + S
k = sum([1 << i for i in S])
for i in S:
if i == 0: continue
for j in S:
if j == i: continue
cur_index = k ^ (1 << i)
C[k][i] = min(C[k][i], C[cur_index][j]+ G[j][i])
#C[S−{i}][j]
all_index = (1 << n) - 1
return min([(C[all_index][i] + G[i], i) \
for i in range(n)])

• The following animation shows how the least cost solution cycle is computed with the DP for a graph with 4 vertices. Notice that in order to represent C(S,i) from the algorithm, the vertices that belong to the set S are colored with red circles, the vertex i where the path that traverses through all the nodes in S ends at is marked with a red double-circle.
• The next animation also shows how the DP table gets updated. The DP table for a graph with 4 nodes will be of size 2⁴ X 4, since there are 2⁴=16 subsets of the vertex set V={0,1,2,3} and a path going through a subset of the vertices in V may end in any of the 4 vertex.
• The transposed DP table is shown in the next animation, here the columns correspond to the subset of the vertices and rows correspond to the vertex the TSP ends at.
• The following animation shows how the least cost solution cycle is computed with the DP for a graph with 5 nodes.

The following animation / figure shows the TSP optimal path is computed for increasing number of nodes (where the weights for the input graphs are randomly generated) and the exponential increase in the time taken.

# Solving TSP with Integer Linear Program

Solving with the mip package using the following python code, produces the output shown by the following animation, for a graph with randomly generated edge-weights.


from mip import Model, xsum, minimize, BINARY

def TSP_ILP(G):

start = time()
V1 =  range(len(G))
n, V = len(G), set(V1)
model = Model()   # binary variables indicating if arc (i,j) is used
# on the route or not
x = [[model.add_var(var_type=BINARY) for j in V] for i in V]   # continuous variable to prevent subtours: each city will have a
# different sequential id in the planned route except the 1st one
y = [model.add_var() for i in V]   # objective function: minimize the distance
model.objective = minimize(xsum(G[i][j]*x[i][j] \
for i in V for j in V))

# constraint : leave each city only once
for i in V:
model += xsum(x[i][j] for j in V - {i}) == 1   # constraint : enter each city only once
for i in V:
model += xsum(x[j][i] for j in V - {i}) == 1   # subtour elimination
for (i, j) in product(V - {0}, V - {0}):
if i != j:
model += y[i] - (n+1)*x[i][j] >= y[j]-n   # optimizing
model.optimize()   # checking if a solution was found
if model.num_solutions:
print('Total distance {}'.format(model.objective_value))
nc = 0 # cycle starts from vertex 0
cycle = [nc]
while True:
nc = [i for i in V if x[nc][i].x >= 0.99]
cycle.append(nc)
if nc == 0:
break

return (model.objective_value, cycle)


The constraint to prevent the subtours to appear in the solution is necessary, if we run without the constraint, we get a solution with subtours instead of a single cycle going through all the nodes, as shown below:

Comparing with Dynamic programming based solution, we can see that ILP is much more efficient for higher n values.

# Bitonic TSP

The following python code snippet implements the above DP algorithm.


def dist(P, i, j):
return np.sqrt((P[i]-P[j])**2+(P[i]-P[j])**2)

def BTSP(P):
n = len(P)
D = np.ones((n,n))*np.inf
path = np.ones((n,n), dtype=int)*(-1)
D[n-2,n-1] = dist(P, n-2, n-1)
path[n-2,n-1] = n-1
for i in range(n-3,-1,-1):
m = np.inf
for k in range(i+2,n):
if m > D[i+1,k] + dist(P,i,k):
m, mk = D[i+1,k] + dist(P,i,k), k
D[i,i+1] = m
path[i,i+1] = mk
for j in range(i+2,n):
D[i,j] = D[i+1,j] + dist(P,i,i+1)
path[i,j] = i+1
D[0,0] = D[0,1] + dist(P,0,1)
path[0,0] = 1
return D, path

def get_tsp_path(path, i, j, n):
if n < 0:
return []
if i <= j:
k = path[i,j]
return [k] + get_tsp_path(path, k, j, n-1)
else:
k = path[j,i]
return get_tsp_path(path, i, k, n-1) + [k]


The following animation shows how the DP table is computed and the optimal path for Bitonic TSP is constructed. It also shows the final optimal path.

# 2-OPT Approximation Algorithm for Metric TSP

The next code snippet implements the above 2-OPT approximation algorithm.


import numpy as np
stack = [x]
visited[x] = True
path = []
while len(stack) > 0:
u = stack.pop(-1)
path.append(u)
if not visited[v]:
stack.append(v)
visited[v] = True
inf = np.inf
c = [inf]*n
s = 0
c[s] = 0
visited = [False]*n
parent = [None]*n
h = queue.PriorityQueue()
for v in range(n):
h.put((c[v], v))
edges = []
while not h.empty():
w, u = h.get()
if visited[u]: continue
visited[u] = True
if parent[u] != None:
edges.append((parent[u], u))
for v in range(n):
if v == u: continue
if (not visited[v]) and (c[v] > adj[u][v]):
parent[v] = u
h.put((c[v], v))
adj = [[] for _ in range(n)]
for i in range(n):
if parent[i] != None:
path += [path]
return path


The following animation shows the TSP path computed with the above approximation algorithm and compares with the OPT path computed using ILP for 20 points on 2D plane. The MST is computed with Prim’s algorithm.

# TSP with Simulated Annealing

The following python code snippet shows how to implement the Simulated Annealing to solve TSP, here G represents the adjacency matrix of the input graph.


def TSP_SA(G):
s = list(range(len(G)))
c = cost(G, s)
ntrial = 1
T = 30
alpha = 0.99
while ntrial <= 1000:
n = np.random.randint(0, len(G))
while True:
m = np.random.randint(0, len(G))
if n != m:
break
s1 = swap(s, m, n)
c1 = cost(G, s1)
if c1 < c:
s, c = s1, c1
else:
if np.random.rand() < np.exp(-(c1 - c)/T):
s, c = s1, c1
T = alpha*T
ntrial += 1def swap(s, m, n):
i, j = min(m, n), max(m, n)
s1 = s.copy()
while i < j:
s1[i], s1[j] = s1[j], s1[i]
i += 1
j -= 1
return s1

def cost(G, s):
l = 0
for i in range(len(s)-1):
l += G[s[i]][s[i+1]]
l += G[s[len(s)-1]][s]
return l


The following animations show how the algorithm works:

The following animation shows the TSP path computed with SA for 100 points in 2D.

# TSP with Genetic Algorithm

Here in the following implementation of the above algorithm we shall have the following assumptions:

• We shall assume the crossover rate is 1.0, i.e., all individuals in a population participate in crossover. The mutation probability to be used is 0.1.
• With each crossover operation between two parent chromosomes, couple of children are generated, cant just swap portions of parents chromosomes, need to be careful to make sure that the offspring represents valid TSP path.
• Mutation is similar to swap operation implemented earlier.
• For each generation we shall keep a constant k=20 (or 30) chromosomes (representing candidate solutions for TSP).
• The fitness function will be the cost of the TSP path represented by each chromosome. Hence, we want to minimize the value of the fitness function — i.e., less the value of a chromosome, more fit is it to survive.
• We shall use rank selection, i.e., after crossover and mutation, only the top k fittest offspring (i.e., with least fitness function value) will survive for the next generation.
• The following python code shows the implementation of the above algorithm with the above assumptions.

import numpy as np

def do_crossover(s1, s2, m):
s1, s2 = s1.copy(), s2.copy()
c1 = s2.copy()
for i in range(m, len(s1)): c1.remove(s1[i])
for i in range(m, len(s1)): c1.append(s1[i])
c2 = s1.copy()
for i in range(m, len(s2)): c2.remove(s2[i])
for i in range(m, len(s2)): c2.append(s2[i])
return (c1, c2)

def do_mutation(s, m, n):
i, j = min(m, n), max(m, n)
s1 = s.copy()
while i < j:
s1[i], s1[j] = s1[j], s1[i]
i += 1
j -= 1
return s1

def compute_fitness(G, s):
l = 0
for i in range(len(s)-1):
l += G[s[i]][s[i+1]]
l += G[s[len(s)-1]][s]
return l

def get_elite(G, gen, k):
gen = sorted(gen, key=lambda s: compute_fitness(G, s))
return gen[:k]

def TSP_GA(G, k=20, ntrial = 200):
n_p = k
mutation_prob = 0.1
gen = []
path = list(range(len(G)))
while len(gen) < n_p:
path1 = path.copy()
np.random.shuffle(path1)
if not path1 in gen:
gen.append(path1)

for trial in range(ntrial):
gen = get_elite(G, gen, k)
gen_costs = [(round(compute_fitness(G, s),3), s) \
for s in gen]
next_gen = []
for i in range(len(gen)):
for j in range(i+1, len(gen)):
c1, c2 = do_crossover(gen[i], gen[j], \
np.random.randint(0, len(gen[i])))
next_gen.append(c1)
next_gen.append(c2)
if np.random.rand() < mutation_prob:
m = np.random.randint(0, len(gen[i]))
while True:
n = np.random.randint(0, len(gen[i]))
if m != n:
break
c = do_mutation(gen[i], m, n)
next_gen.append(c)
gen = next_gen


The following animation shows the TSP path computed with GA for 100 points in 2D.

# Coping with a few NP-Hard Problems with Python

Solving classic NP-hard problems such as 3-Coloring and Hamiltonian path with SAT solvers

In this blog we shall continue our discussion on a few NP-complete / NP-hard problems and will attempt to solve them (e.g., encoding the problem to satisfiability problem and solving with a SAT-solver) and the corresponding python implementations. The problems discussed here appeared as programming assignments in the coursera course Advanced Algorithms and Complexity. The problem statements are taken from the course itself.

# Coloring a Graph with 3 colors using a SAT solver

Given a graph, we need to color its vertices into 3 different colors, so that any two vertices connected by an edge need to be of different colors. Graph coloring is an NP-complete problem, so we don’t currently know an efficient solution to it, and you need to reduce it to an instance of SAT problem which, although it is NP-complete, can often be solved efficiently in practice using special programs called SAT-solvers.

We can reduce the real-world problem about assigning frequencies to the transmitting towers of the cells in a GSM network to a problem of proper coloring a graph into 3 colors.Colors correspond to frequencies, vertices correspond to cells, and edges connect neighboring cells, as shown below.

• We need to output a boolean formula in the conjunctive normal form (CNF) in a specific format. If it is possible to color the vertices of the input graph in 3 colors such that any two vertices connected by an edge are of different colors, the formula must be satisfiable. Otherwise, the formula must be unsatisfiable.
• We shall use pySAT SAT-solver to solve the clauses generated from the graph.
• In particular, if there are n nodes and m edges in the graph that is to be colored with 3 colors, we need to generate 3*n variables and 3*m + n clauses.
• The i-th vertex will correspond to 3 variables, with id i, n+i and 2*n+i. They will represent whether the node is to be colored by red, green or blue.
• Any two vertices forming an edge must not have the same color.
• Each vertex must be colored by one from the 3 colors.

The following python code snippet shows how the encoding needs to be implemented to output the desired clauses for the CNF, subsequently solved by the SAT-solver and the solution assignments is used to color the vertices of the graph. Again, there are following 3 basic steps:

• encode the input graph into SAT formula (reduce the 3-coloring problem to a satisfiability problem)
• solve with a SAT-solver and obtain the solution in terms of variable assignments
• Decode the solution — use the solution assignments by SAT-solver to color the vertices of the graph

import numpy as np
from pysat.solvers import Glucose3def get_colors(assignments):
all_colors = np.array(['red', 'green', 'blue'])
colors = {}
for v in range(n):
colors[v+1] = all_colors[[assignments[v]>0, \
assignments[n+v]>0, assignments[2*n+v]>0]]
return colorsdef print_clauses(clauses):
for c in clauses:
vars = []
for v in c:
vars.append('{}x{}'.format('¬' if v < 0 else '', abs(v)))
print('(' + ' OR '.join(vars) + ')')

def print_SAT_solution(assignments):
sol = ''
for x in assignments:
sol += 'x{}={} '.format(abs(x),x>0)
print(sol)

def solve_graph_coloring():
# encode the input graph into SAT formula
n_clauses = 3*m+n
n_vars = 3*n
clauses = []
for u, v in edges:
clauses.append((-u, -v)) # corresponding to red color
clauses.append((-(n+u), -(n+v))) # corresponding to green
clauses.append((-(2*n+u), -(2*n+v))) # corresponds to blue
for v in range(1, n+1):
clauses.append((v, n+v, 2*n+v)) # at least one color
print_clauses(clauses)
# solve SAT and obtain solution in terms of variable assignments
g = Glucose3()
for c in clauses:
status = g.solve()
assignments = g.get_model()
print(status)
print_SAT_solution(assignments)
# use the solution assignment by SAT to color the graph
colors = get_colors(assignments)
print(colors)


The following animations show the input graphs, the corresponding variables and clauses generated for the CNF to be solved with the SAT-solver, the solution obtained in terms of truth-assignments of the variables and then how they are used solve the graph-coloring problem, to color the graph with 3 colors. Each row in the right subgraph represents a clause in the CNF.

The last graph is the Petersen graph, that has chromatic number 3, so can be colored with 3 colors and a 3-coloring solution is obtained by the SAT-solver output assignments.

Now, the following example shows an input graph which is NOT 3-colorable, with the CNF formed by the clauses generated to be solved by the SAT-solver being unsatisfiable / inconsistent.

# Solving the Hamiltonian Path problem with SAT-solver

In this problem, we shall learn how to solve the classic Hamiltonian Path problem, by designing and implementing an efficient algorithm to reduce it to SAT.

The following python snippet shows

1. how the Hamiltonian Path problem is reduced to SAT

2. Then it’s solved by the pysat SAT-solver.

3. The solution is interpreted to construct the Hamiltonian path for the following input graphs.


def get_hamiltonian_path(assignments):
path = [None]*n
for i in range(n):
for j in range(n):
if assignments[i*n+j] > 0: # True
path[i] = j+1
return path

def reduce_Hamiltonian_Path_to_SAT_and_solve(edges):

def index(i, j):
return n*i + j + 1

m = len(edges)
n_clauses = 2*n + (2*n*n-n-m)*(n-1)
n_vars = n*n
clauses = []

for j in range(n):
clause = []
for i in range(n):
clause.append(index(i,j))
clauses.append(clause)
for i in range(n):
clause = []
for j in range(n):
clause.append(index(i,j))
clauses.append(clause)
for j in range(n):
for i in range(n):
for k in range(i+1, n):
clauses.append((-index(i,j), -index(k,j)))
for i in range(n):
for j in range(n):
for k in range(j+1, n):
clauses.append((-index(i,j), -index(i,k)))
for k in range(n-1):
for i in range(n):
for j in range(n):
if i == j: continue
if not [i+1, j+1] in edges:
clauses.append((-index(k,i), -index(k+1,j)))
print_clauses(clauses)
g = Glucose3()
for c in clauses:
status = g.solve()
assignments = g.get_model()
print(status)
print_SAT_solution(assignments)
path = get_hamiltonian_path(assignments)
print(path)


The following figure shows the output Hamiltonian Path obtained for the line input graph using the solution obtained by SAT-solver.

The following figure shows the Hamiltonian Path obtained with the SAT-solver for the input Petersen’s graph, which indeed has a Hamiltonian Path.

To be continued…

# Graph Algorithms with Python

In this blog we shall discuss about a few popular graph algorithms and their python implementations. The problems discussed here appeared as programming assignments in the coursera course Algorithms on Graphs and on Rosalind. The problem statements are taken from the course itself.

The basic building blocks of graph algorithms such as computing the number of connected components, checking whether there is a path between the given two vertices, checking whether there is a cycle, etc are used practically in many applications working with graphs: for example, finding shortest paths on maps, analyzing social networks, analyzing biological data.

For all the problems it can be assumed that the given input graph is simple, i.e., it does not contain self-loops (edges going from a vertex to itself) and parallel edges.

# Checking if two vertices are Reachable

Given an undirected graph G=(V,E) and two distinct vertices 𝑢 and 𝑣, check if there is a path between 𝑢 and 𝑣.

## Steps

1. Starting from the node u, we can simply use breadth first search (bfs) or depth-first search (dfs) to explore the nodes reachable from u.
2. As soon as we find v we can return the nodes are reachable from one-another.
3. If v is not there in the nodes explored, we can conclude v is not reachable from u.
4. The following implementation (demonstrated using the following animations) uses iterative dfs (with stack) to check if two nodes (initially colored pink) are reachable.
5. We can optionally implement coloring of the nodes w.r.t. the following convention: initially the nodes are all white, when they are visited (pushed onto stack) they are marked as gray and finally when all the adjacent (children) nodes for a given node are visited, the node can be marked black.
6. We can store the parent of each node in an array and we can extract the path between two given nodes using the parent array, if they are reachable.

A very basic python implementation of the iterative dfs are shown below (here adj represents the adjacency list representation of the input graph):

def reach(adj, x, y):
stack = [x]
visited[x] = True
while len(stack) > 0:
u = stack.pop(-1)
if not visited[v]:
stack.append(v)
visited[v] = True
if v == y:
return 1
return 0


The following animations demonstrate how the algorithm works, the stack is also shown at different points in time during the execution. Finally the path between the nodes are shown if they are reachable.

The same algorithm can be used for finding an Exit from a Maze (check whether there is a path from a given cell to a given exit).

# Find Connected Components in an Undirected Graph

Given an undirected graph with 𝑛 vertices and 𝑚 edges, compute the number of connected components in it.

## Steps

1. The following simple modification in dfs can be used to find the number of connected components in an undirected graph, as shown in the following figure.
2. From each node we need to find all the nodes yet to be explored.
3. We can find the nodes in a given component by finding all the nodes reachable from a given node.
4. The same iterative dfs implementation was used (demonstrated with the animations below).
5. The nodes in a component found were colored using the same color.

Python Code

def number_of_components(adj):
result = 0
for x in range(n):
if not visited[x]:
result += 1
stack = [x]
visited[x] = True
while len(stack) > 0:
u = stack.pop(-1)
if not visited[v]:
stack.append(v)
visited[v] = True
return result


The same algorithm can be used to decide that there are no dead zones in a maze, that is, that at least one exit is reachable from each cell.

# Find Euler Tour / Circuit in a Directed Graph

Given A directed graph that contains an Eulerian tour, where the graph is given in the form of an adjacency list.

## Steps

1. While solving the famous Königsberg Bridge Problem, Euler proved that an Euler circuit in an undirected graph exists iff all its nodes have even degree.
2. For an Euler tour to exist in an undirected graph, if there exists odd-degree nodes in the graph, there must be exactly 2 nodes with odd degrees — the tour will start from one such node and end in another node.
3. The tour must visit all the edges in the graph exactly once.
4. The degree test can be extended to directed graphs (with in-degrees and out-degrees) and can be used to determine the existence of Euler tour / circuit in a Digraph.
5. Flury’s algorithm can be used to iteratively remove the edges (selecting the edges not burning the bridges in the graph as much as possible) from the graph and adding them to the tour.
6. DFS can be modified to obtain the Euler tour (circuit) in a DiGraph.
7. The following figure shows these couple of algorithms for finding the Euler tour / circuit in a graph if one exists (note in the figure path is meant to indicate tour). We shall use DFS to find Euler tour.

The following code snippets represent the functions used to find Euler tour in a graph. Here, variables n and m represent the number of vertices and edges of the input DiGraph, respectively, whereas adj represents the corresponding adjacency list.

Python Code

def count_in_out_degrees(adj):
in_deg, out_deg = *n, *n
for u in range(n):
out_deg[u] += 1
in_deg[v] += 1
return in_deg, out_degdef get_start_if_Euler_tour_present(in_deg, out_deg):
start, end, tour = None, None, True
for i in range(len(in_deg)):
d = out_deg[i] - in_deg[i]
if abs(d) > 1:
tour = False
break
elif d == 1:
start = i
elif d == -1:
end = i
tour = (start != None and end != None) or \
(start == None and end == None)
if tour and start == None: # a circuit
start = 0
return (tour, start)def dfs(adj, v, out_deg, tour):
while out_deg[v] > 0:
out_deg[v] -= 1
tour[:] = [v] + tourdef compute_Euler_tour(adj):
tour_present, start = get_start_if_Euler_tour_present(in_deg, \
out_deg)
if not tour_present:
return None
tour = []
if len(tour) == m+1:
return None


The following animations show how the algorithm works.

# Cycle Detection in a Directed Graph

Check whether a given directed graph with 𝑛 vertices and 𝑚 edges contains a cycle.

The following figure shows the classification of the edges encountered in DFS:

It can be shown that whether a Directed Graph is acyclic (DAG) or not (i.e. it contains a cycle), can be checked using the presence of a back-edge while DFS traversal.

## Steps

1. Use the recursive DFS implementation (pseudo-code shown in the below figure)
2. Track if a node to be visited is already on the stack, if it’s there, it forms a back edge.
3. Use parents array to obtain the directed cycle, if found.

Python code

def cyclic(adj):
visited = [False]*n
parents = [None]*n
on_stack = [False]*n
cycle = []

visited[u] = True
on_stack[u] = True
if not visited[v]:
parents[v] = u
elif on_stack[v]:
x = u
while x != v:
cycle.append(x)
x = parents[x]
cycle = [v] + cycle
#print(cycle)
on_stack[u] = False

for v in range(n):
if not visited[v]:

return int(len(cycle) > 0)


The above algorithm can be used to check consistency in a curriculum (e.g., there is no cyclic dependency in the prerequisites for each course listed).

# Topologically Order a Directed Graph

Compute a topological ordering of a given directed acyclic graph (DAG) with 𝑛 vertices and 𝑚 edges.

The following idea can be used to obtain such an ordering in a DiGraph G:

• Find sink.
• Put at end of order.
• Remove from graph.
• Repeat.

It can be implemented efficiently with dfs by keeping track of the time of pre and post visiting the nodes.

## Steps

1. Use the recursive implementation of dfs.
2. When visiting a node with a recursive call, record the pre and post visit times, at the beginning and the end (once all the children of the given node already visited) of the recursive call, respectively.
3. Reorder the nodes by their post-visit times in descending order, as shown in the following figure.

Python code


order = []
visited = [False]*n
previsit = *n
postvisit = *n

global clock
visited[u] = True
previsit[u] = clock
clock += 1
if not visited[v]:
postvisit[u] = clock
clock += 1

for v in range(n):
if not visited[v]:

order = [x for _, x in sorted(zip(postvisit, range(n)), \
key=lambda pair: pair, reverse=True)]

return order


The above algorithm can be used to determine the order of the courses in a curriculum, taking care of the pre-requisites dependencies.

# KosaRaju’s Algorithm to find the Strongly Connected Components (SCCs) in a Digraph

Compute the number of strongly connected components of a given directed graph with 𝑛 vertices and 𝑚 edges.

Note the following:

• dfs can be used to find SCCs in a Digraph.
• We need to make it sure that dfs traversal does not leave a such component with an outgoing edge.
• The sink component is one that does not have an outgoing edge. We need to find a sink component first.
• The vertex with the largest post-visit time is a source component for dfs.
• The reverse (or transpose) graph of G has same SCC as the original graph.
• Source components of the transpose of G are sink components of G.

With all the above information, the following algorithm can be implemented to obtain the SCCs in a digraph G.

The runtime of the algorithm is again O(|V |+|E|). Alternatively, the algorithm can be represented as follows:

Python Code


result = 0
visited = [False]*n
previsit = *n
postvisit = *n

new_adj = [ [] for _ in range(n)]
for i in range(n):

global clock
visited[u] = True
previsit[u] = clock
clock += 1
if not visited[v]:
postvisit[u] = clock
clock += 1

for v in range(n):
if not visited[v]:
post_v = [x for _, x in sorted(zip(postvisit, range(n)), \
key=lambda pair: pair, reverse=True)]
visited = [False]*n
for v in post_v:
if not visited[v]:
result += 1return result


The following animations show how the algorithm works:

# Shortest Path in an Undirected Graph with BFS

Given an undirected graph with 𝑛 vertices and 𝑚 edges and two vertices 𝑢 and 𝑣, compute the length of a shortest path between 𝑢 and 𝑣 (that is, the minimum number of edges in a path from 𝑢 to 𝑣).

• The following figure shows the algorithm for bfs.
• It uses a queue (FIFO) instead of a stack (LIFO) to store the nodes to be explored.
• The traversal is also called level-order traversal, since the nodes reachable with smaller number of hops (shorter distances) from the source / start node are visited earlier than the higher-distance nodes.
• The running time of bfs is again O(|E| + |V |).

Python code


inf = 10**9 #float('Inf')
queue = [s]
d[s] = 0
while len(queue) > 0:
u = queue.pop(0)
if d[v] ==  inf:
queue.append(v)
d[v] = d[u] + 1
if v == t:
return d[t]
return -1


The following animations demonstrate how the algorithm works. The queue used to store the vertices to be traversed is also shown.

The above algorithm can be used to compute the minimum number of flight segments to get from one city to another one.

# Checking if a Graph is Bipartite (2-Colorable)

Given an undirected graph with 𝑛 vertices and 𝑚 edges, check whether it is bipartite.

## Steps

1. Note that a graph bipartite iff its vertices can be colored using 2 colors (so that no adjacent vertices have the same color).
2. Use bfs to traverse the graph from a starting node and color nodes in the alternate levels (measured by distances from source) with red (even level) and blue (odd level).
3. If at any point in time two adjacent vertices are found that are colored using the same color, then the graph is not 2-colorable (hence not bipartite).
4. If no such cases occur, then the graph is bipartite
5. For a graph with multiple components, use bfs on each of them.
6. Also, note that a graph is bipartite iff it contains an odd length cycle.

Python code


if not color[vertex]:
queue = [vertex]
color[vertex] = 'red'
while len(queue) > 0:
u = queue.pop(0)
if color[v] ==  color[u]:
return 0
if not color[v]:
queue.append(v)
color[v] = 'red' if color[u] == 'blue' else 'blue'
return 1


The following animations demonstrate how the algorithm works.

Notice that the last graph contained a length of cycle 4 (even length cycle) and it was bipartite graph. The graph prior to this one contained a triangle in it (cycle of odd length 3), it was not bipartite.

# Shortest Path in a Weighted Graph with Dijkstra

Given an directed graph with positive edge weights and with 𝑛 vertices and 𝑚 edges as well as two vertices 𝑢 and 𝑣, compute the weight of a shortest path between 𝑢 and 𝑣 (that is, the minimum total weight of a path from 𝑢 to 𝑣).

• Optimal substructure property : any subpath of an optimal path is also optimal.
• Initially, we only know the distance to source node S and relax all the edges from S.
• Maintain a set R of vertices for which dist is already set correctly (known region).
• On each iteration take a vertex outside of R with the minimum dist-value, add it to R, and relax all its outgoing edges.
• The next figure shows the pseudocode of the algorithm.
• The algorithm works for any graph with non-negative edge weights.
• The algorithm does not work if the input graph has negative edge weights (known region has the assumption that the dist can be reduced further, this assumption does not hold if the graph has negative edges).

Python code


import queue

inf = 10**19
d = [inf]*n
d[s] = 0
visited = *n
h = queue.PriorityQueue()
for v in range(n):
h.put((d[v], v))
while not h.empty():
u = h.get()
if visited[u]: continue
visited[u] = True
if d[v] > d[u] + cost[u][i]:
d[v] = d[u] + cost[u][i]
h.put((d[v], v))
return d[t] if d[t] != inf else -1


The following animations show how the algorithm works.

The algorithm works for the undirected graphs as well. The following animations show how the shortest path can be found on undirected graphs.

# Detecting Negative Cycle in a Directed Graph with Bellman-Ford

Given an directed graph with possibly negative edge weights and with 𝑛 vertices and 𝑚 edges, check whether it contains a cycle of negative weight. Also, given a vertex 𝑠, compute the length of shortest paths from 𝑠 to all other vertices of the graph.

• For a Digraph with n nodes (without a negative cycle), the shortest path length in between two nodes (e.g., the source node and any other node) can be at most n-1.
• Relax edges while dist changes (at most n-1 times, most of the times the distances will stop changing much before that).
• This algorithm works even for negative edge weights.
• The following figure shows the pseudocode of the algorithm.

The above algorithm can also be used to detect a negative cycle in the input graph.

• Run n = |V | iterations of Bellman–Ford algorithm (i.e., just run the edge-relaxation for once more, for the n-th time).
• If there exists a vertex, the distance of which still decreases, it implies that there exist a negative-weight cycle in the graph and the vertex is reachable from that cycle.
• Save node v relaxed on the last iteration v is reachable from a negative cycle
• Start from x ← v, follow the link x ← prev[x] for |V | times — will be definitely on the cycle
• Save y ← x and go x ← prev[x] until x = y again

The above algorithm can be used to detect infinite arbitrage, with the following steps:

• Do |V | iterations of Bellman–Ford, save all nodes relaxed on V-th iteration-set A
• Put all nodes from A in queue Q
• Do breadth-first search with queue Q and find all nodes reachable from A
• All those nodes and only those can have infinite arbitrage

Python code


inf = 10**19 # float('Inf')
d = [inf]*n
d = 0
for k in range(n-1):
for u in range(n):
if d[v] > d[u] + cost[u][i]:
d[v] = d[u] + cost[u][i]
for u in range(n):
if d[v] > d[u] + cost[u][i]:
d[v] = d[u] + cost[u][i]
return 1
return 0


The following animations demonstrate how the algorithm works.

The last animation shows how the shortest paths can be computed with the algorithm in the presence of negative edge weights.

# Find Minimum Spanning Tree in a Graph with Prim’s Greedy Algorithm

Given 𝑛 points on a plane, connect them with segments of minimum total length such that there is a path between any two points.

## Steps

• Let’s construct the following undirected graph G=(V, E): each node (in V) in the graph corresponds to a point in the plane and the weight of an edge (in E) in between any two nodes corresponds to the distance between the nodes.
• The length of a segment with endpoints (𝑥1, 𝑦1) and (𝑥2, 𝑦2) is equal to the Euclidean distance √︀((𝑥1 − 𝑥2)² + (𝑦1 − 𝑦2)²).
• Then the problem boils down to finding a Minimum Spanning Tree in the graph G.
• The following figure defines the MST problem and two popular greedy algorithms Prim and Kruskal to solve the problem efficiently.

We shall use Prim’s algorithm to find the MST in the graph. Here are the algorithm steps:

• It starts with an initial vertex and is grown to a spanning tree X.
• X is always a subtree, grows by one edge at each iteration
• Add a lightest edge between a vertex of the tree and a vertex not in the tree
• Very similar to Dijkstra’s algorithm
• The pseudocode is shown in the following figure.

Python code


def minimum_distance(x, y):
result = 0.
inf = 10**19
n = len(x)
adj = [[0 for _ in range(n)] for _ in range(n)]
for i in range(n):
for j in range(i+1, n):
(y[i]-y[j])**2)
c = [inf]*n
s = 0
c[s] = 0
visited = [False]*n
parent = [None]*n
h = queue.PriorityQueue()
for v in range(n):
h.put((c[v], v))
while not h.empty():
w, u = h.get()
if visited[u]: continue
visited[u] = True
for v in range(n):
if v == u: continue
if (not visited[v]) and (c[v] > adj[u][v]):
parent[v] = u
h.put((c[v], v))
spanning_tree = []
for i in range(n):
spanning_tree.append((i, parent[i]))
if parent[i] != None:
#print(spanning_tree)
return result


The following animations demonstrate how the algorithm works for a few different set of input points in 2-D plane.

The following two animations show the algorithm steps on the input points and the corresponding graph formed, respectively.

The following animation shows how Prim’s algorithm outputs an MST for 36 input points in a 6×6 grid.

As expected, the cost of the MCST formed is 35 for the 36 points, since a spanning tree with n nodes has n-1 edges, each with unit length.

The above algorithm can be used to build roads between some pairs of the given cities such that there is a path between any two cities and the total length of the roads is minimized.

# Finding Minimum Spanning Tree and Hierarchical Clustering with Kruskal

Given 𝑛 points on a plane and an integer 𝑘, compute the largest possible value of 𝑑 such that the given points can be partitioned into 𝑘 non-empty subsets in such a way that the distance between any two points from different subsets is at least 𝑑.

## Steps

• We shall use Kruskal’s algorithm to solve the above problem. Each point can be thought of a node in the graph, as earlier, with the edge weights between the nodes being equal to the Euclidean distance between them.
• Run iterations of the Kruskal’s algorithm and merge the components till there are exactly k (< n) components left.
• These k components will be the k desired clusters of the points and we can compute d to be the largest distance in between two points belonging to different clusters.

Here are the steps for the Kruskal’s algorithm to compute MST:

• Start with each node in the graph as a single-node tree in a forest X.
• Repeatedly add to X the next lightest edge e that doesn’t produce a cycle.
• At any point of time, the set X is a forest (i.e., a collection of trees)
• The next edge e connects two different trees — say, T1 and T2
• The edge e is the lightest between T1 and V / T1, hence adding e is safe
• Use disjoint sets data structure for implementation
• Initially, each vertex lies in a separate set
• Each set is the set of vertices of a connected component
• To check whether the current edge {u, v} produces a cycle, we check whether u and v belong to the same set (find).
• Merge two sets using union operation.
• The following figure shows a few algorithms to implement union-find abstractions for disjoint sets.

Python code


def clustering(x, y, k):
result = 0.
inf = float('Inf') #10**19
n = len(x)
for i in range(n):
for j in range(i+1, n):
indices = {i:i for i in range(n)}
while len(set(indices.values())) > k:
iu, iv = indices[u], indices[v]
# implemented quick-find here
# To-do: implement weighted union with path-compression heuristic
if iu != iv:
indices[u] = indices[v] = min(iu, iv)
for j in range(n):
if indices[j] == max(iu, iv):
indices[j] = min(iu, iv)
clusters = {}
for i in range(n):
ci = indices[i]
clusters[ci] = clusters.get(ci, []) + [i]
#print(clusters)
d = inf
for i in list(clusters.keys()):
for j in list(clusters.keys()):
if i == j: continue
for vi in clusters[i]:
for vj in clusters[j]:
d = min(d, math.sqrt((x[vi]-x[vj])**2 + (y[vi]-y[vj])**2))
return d


The following animations demonstrate how the algorithm works. Next couple of animations show how 8 points in a plane are (hierarchically) clustered using Kruskal’s algorithm and finally the MST is computed. The first animation shows how the points are clustered and the next one shows how Kruskal works on the corresponding graph created from the points.

The next animation again shows how a set of points in 2-D are clustered using Kruskal and MST is computed.

Clustering is a fundamental problem in data mining. The goal is to partition a given set of objects into subsets (or clusters) in such a way that any two objects from the same subset are close (or similar) to each other, while any two objects from different subsets are far apart.

Now we shall use Kruskal’s algorithm to cluster a small real-world dataset named the Iris dataset (can be downloaded from the UCI Machine learning repository), often used as a test dataset in Machine learning.

Next 3 animations show how Kruskal can be used to cluster Iris dataset . The first few rows of the dataset is shown below:

Let’s use the first two features (SepalLength and SepalWidth), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

Now, let’s use the second and third feature variables (SepalWidth and PetalLength), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

Finally, let’s use the third and fourth feature variables (PetalLength and PetalWidth), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

# Friend Suggestion in Social Networks — Bidirectional Dijkstra

Compute the distance between several pairs of nodes in the network.

## Steps

• Build reverse graph GR
• Start Dijkstra from s in G and from t in GR
• Alternate between Dijkstra steps in G and in GR
• Stop when some vertex v is processed both in G and in GR
• Compute the shortest path between s and t

Meet-in-the-middle

• More general idea, not just for graphs
• Instead of searching for all possible objects, search for first halves and for second halves separately
• Then find compatible halves
• Typically roughly O(√N) instead of O(N)

Friends suggestions in social networks

• Find the shortest path from Michael to Bob via friends connections
• For the two “farthest” people, Dijkstra has to look through 2 billion people
• If we only consider friends of friends of friends for both Michael and Bob, we will find a connection
• Roughly 1M friends of friends of friends
• 1M + 1M = 2M people – 1000 times less
• Dijkstra goes in circles
• Bidirectional search idea can reduce the search space
• Bidirectional Dijkstra can be 1000s times faster than Dijkstra for social networks
• The following figure shows the algorithm

The following python function implements the bidirectional Dijkstra, the arguments adj and adjR represents the adjacency lists for the original and the reverse graph, respectively.

Python code



def process(u, adj, cost, h, d, prev, visited):
if d[v] > d[u] + cost[u][i]:
d[v] = d[u] + cost[u][i]
h.put((d[v], v))
prev[v] = u
visited[u] = True

def shortest_path(s, dist, prev, t, distR, prevR, visited_any):
distance = inf
ubest = None
for u in visited_any:
if dist[u] + distR[u] < distance:
ubest = u
distance = dist[u] + distR[u]
return distance if ubest != None else -1

inf = 10**19
d, dR = [inf]*n, [inf]*n
d[s] = 0
dR[t] = 0
visited, visitedR = [False]*n, [False]*n
visited_any = set([])
prev, prevR = [None]*n, [None]*n
h = queue.PriorityQueue()
h.put((d[s], s))
hR = queue.PriorityQueue()
hr.put((dr[t], t)) while True:
u = h.get()
if visited[u]: continue
process(u, adj, cost, h, d, prev, visited)
if visitedR[u]:
return shortest_path(s, d, prev, t, dr, prevR, visited_any)
uR = hR.get()
if visitedR[uR]: continue
process(ur, adjR, costR, hR, dR, prevR, visitedR)
if visited[uR]:
return shortest_path(s, d, prev, t, dR, prevR, visited_any)
if h.empty() or hr.empty():
return -1


The following animation shows how the algorithm works:

# Computing Distance Faster Using Coordinates with A* search

Compute the distance between several pairs of nodes in the network. The length l between any two nodes u an v is guaranteed to satisfy 𝑙 ≥ √︀((𝑥(𝑢) − 𝑥(𝑣))2 + (𝑦(𝑢) − 𝑦(𝑣))2).

• Let’s say we are submitting a query to find a shortest path between nodes (s,t) to a graph.
• A potential function 𝜋(v) for a node v in a is an estimation of d(v, t) — how far is it from here to t?
• If we have such estimation, we can often avoid going wrong direction directed search
• Take some potential function 𝜋 and run Dijkstra algorithm with edge weights ℓ𝜋
• For any edge (u, v), the new length ℓ𝜋(u, v) must be non-negative, such 𝜋 is called feasible
• Typically 𝜋(v) is a lower bound on d(v, t)
• A* is a directed search algorithm based on Dijkstra and potential functions
• Run Dijkstra with the potential 𝜋 to find the shortest path — the resulting algorithm is A* search, described in the following figure
• On a real map a path from v to t cannot be shorter than the straight line segment from v to t
• For each v, compute 𝜋(v) = dE (v, t)
• If 𝜋(v) gives lower bound on d(v, t)
• Worst case: 𝜋(v) = 0 for all v the same as Dijkstra
• Best case: 𝜋(v) = d(v, t) for all v then ℓ𝜋(u, v) = 0 iff (u, v) is on a shortest path to t, so search visits only the edges of shortest s − t paths
• It can be shown that the tighter are the lower bounds the fewer vertices will be scanned

The following animations show how the A* search algorithm works on road networks (and the corresponding graph representations). Each node in the graph has the corresponding (distance, potential) pairs.

The following animations show how the shortest path is computed between two nodes for the USA road network for the San Francisco Bay area (the corresponding graph containing 321270 nodes and 800172 edges), using Dijkstra, bidirectional Dijkstra and A* algorithms respectively, the data can be downloaded from 9th DIMACS Implementation Challenge — Shortest Paths. Here, with bidirectional Dijkstra, the nodes from forward and reverse priority queues are popped in the same iteration.

# Implementing a few Advanced Algorithms with Python

In this blog we shall discuss about a few advanced algorithms and their python implementations (from scratch). The problems discussed here appeared as programming assignments in the coursera course Advanced Algorithms and Complexity and on Rosalind. The problem statements are taken from the course itself.

# Ford-Fulkerson / Edmonds-Karp Algorithm for Max-flow

In this problem, we shall apply an algorithm for finding maximum flow in a network to determine how fast people can be evacuated from the given city.

The following animations show how the algorithm works. The red nodes represent the source and the sink in the network.

In this problem, we shall apply an algorithm for finding maximum matching in a bipartite graph to assign airline crews to flights in the most efficient way. Later we shall see how an Integer Linear Program (ILP) can be formulated to solve the bipartite matching problem and then solved with a MIP solver.

# With Max-flow

The following animations show how the algorithm finds the maximum matching for a few bipartite graphs (the blue and green nodes represent the self-edge-disjoint sets for the graph):

# With Integer Linear Program

Using the mip package to solve the integer program and using a binary cost matrix for the input bipartite graph G (where c[i,j]=1 iff i ∈ A and j ∈ B: A, B being disjoint vertex sets), the maximum bipartite matching problem can be solved with an ILP solver as shown in the following code snippet:

from mip import Model, xsum, maximize, BINARY

def mip_bipartite_matching(c):

A, B = list(range(n)), list(range(m))

model = Model()
x = [[model.add_var(var_type=BINARY) for j in B] for i in A]

model.objective = maximize(xsum(c[i][j]*x[i][j] \
for i in A for j in B))
for i in A:
model += xsum(x[i][j] for j in B) <= 1
for j in B:
model += xsum(x[i][j] for i in A) <= 1

model.optimize()
print(model.objective_value)
return [(i,j) for i in A for j in B if x[i][j].x >= 0.99]

The ILP for the maximum bipartite matching problem can also be represented as follows in terms of the incidence matrix:

where the incidence matrix can be represented as follows, for the the above bipartite graph:

The above matrix can be shown to be totally unimodular (i.e., the determinant of any submatrix is either 0,-1 or 1, e.g., out of 215 submatrices of the above incidence matrix, 206 submatrices have determinant 0, 8 submatrices have deteminant 1 and 1 submatrix has determinat -1), implying that every basic feasible solution (hence the optimal solution too) for the LP will have integer values for every variable.

On the contrary, the solution obtained by the max-flow algorithm for the same input bipartite graph is shown follows:

# Solving 2-SAT problem and Integrated Circuit Design

VLSI or Very Large-Scale Integration is a process of creating an integrated circuit by combining thousands of transistors on a single chip. You want to design a single layer of an integrated circuit.

You know exactly what modules will be used in this layer, and which of them should be connected by wires. The wires will be all on the same layer, but they cannot intersect with each other. Also, each wire can only be bent once, in one of two directions — to the left or to the right. If you connect two modules with a wire, selecting the direction of bending uniquely defines the position of the wire. Of course, some positions of some pairs of wires lead to intersection of the wires, which is forbidden. You need to determine a position for each wire in such a way that no wires intersect.

This problem can be reduced to 2-SAT problem — a special case of the SAT problem in which each clause contains exactly 2 variables. For each wire 𝑖, denote by 𝑥_𝑖 a binary variable which takes value 1 if the wire is bent to the right and 0 if the wire is bent to the left. For each 𝑖, 𝑥_𝑖 must be either 0 or 1.

Also, some pairs of wires intersect in some positions. For example, it could be so that if wire 1 is bent to the left and wire 2 is bent to the right, then they intersect. We want to write down a formula which is satisfied only if no wires intersect. In this case, we will add the clause (𝑥1 𝑂𝑅 𝑥2) to the formula which ensures that either 𝑥1 (the first wire is bent to the right) is true or 𝑥2 (the second wire is bent to the left) is true, and so the particular crossing when wire 1 is bent to the left AND wire 2 is bent to the right doesn’t happen whenever the formula is satisfied. We will add such a clause for each pair of wires and each pair of their positions if they intersect when put in those positions. Of course, if some pair of wires intersects in any pair of possible positions, we won’t be able to design a circuit. Your task is to determine whether it is possible, and if yes, determine the direction of bending for each of the wires.

The input represents a 2-CNF formula. We need to find a satisfying assignment to the formula if one exists, or report that the formula is not satisfiable otherwise.

The following animations show how the algorithm works.

# Computing the Maximum Independent Set in a Tree

The following animations show how the dynamic programming algorithm computes the DP table (initialized to -1 for all nodes, as shown) and the corresponding maximum weighted independent sets for the subproblems and uses them to compute the MWIS of the entire tree. The pink nodes represent the nodes belonging to the maximum weighted independent set for a given subproblem (the root of the corresponding subtree is marked with double circles) and also for the final problem.

# Computing the Convex Hull for a set of points in 2D

In this problem we shall implement Jarvis’ March gift wrapping algorithm to compute the convex hull for a given set of 2D points.

The following code snippet shows a python implementation of the algorithm. The points are assumed to be stored as list of (x,y) tuples.

import numpy as np

def cross_prod(p0, p1, p2):
return (p1 - p0) * (p2 - p0) - \
(p1 - p0) * (p2 - p0)

def left_turn(p0, p1, p2):
return cross_prod(p0, p1, p2) > 0

def convex_hull(points):
n = len(points)
# the first point
_, p0 = min([((x, y), i) for i,(x,y) in  enumerate(points)])
start_index = p0     chull = []
while(True):
chull.append(p0)
p1 = (p0 + 1) % n #  make sure p1 != p0
for p2 in range(n):
# if p2 is more counter-clockwise than
# current p1, then update p1
if left_turn(points[p0], points[p1], points[p2]):
p1 = p2
p0 = p1
# came back to first point?
if p0 == start_index:
break    return chull

The following animations show how the algorithm works.

The following animation shows a bad case for the algorithm where all the given points are on the convex hull.

# Solving a few AI problems with Python

In this blog we shall discuss about a few problems in artificial intelligence and their python implementations. The problems discussed here appeared as programming assignments in the edX course CS50’s Introduction to Artificial Intelligence with Python (HarvardX:CS50 AI). The problem statements are taken from the course itself.

## Degrees

Write a program that determines how many “degrees of separation” apart two actors are.

According to the Six Degrees of Kevin Bacon game, anyone in the Hollywood film industry can be connected to Kevin Bacon within six steps, where each step consists of finding a film that two actors both starred in.

• In this problem, we’re interested in finding the shortest path between any two actors by choosing a sequence of movies that connects them. For example, the shortest path between Jennifer Lawrence and Tom Hanks is 2: Jennifer Lawrence is connected to Kevin Bacon by both starring in “X-Men: First Class,” and Kevin Bacon is connected to Tom Hanks by both starring in “Apollo 13.”
• We can frame this as a search problem: our states are people. Our actions are movies, which take us from one actor to another (it’s true that a movie could take us to multiple different actors, but that’s okay for this problem).
• Our initial state and goal state are defined by the two people we’re trying to connect. By using breadth-first search, we can find the shortest path from one actor to another.

The following figure shows how small samples from the dataset look like:

In the csv file for the people, each person has a unique id, corresponding with their id in IMDb’s database. They also have a name, and a birth year.

• In the csv file for the movies, each movie also has a unique id, in addition to a title and the year in which the movie was released.
• The csv file for the stars establishes a relationship between the people and the movies from the corresponding files. Each row is a pair of a person_id value and movie_id value.
• Given two actors nodes in the graph we need to find the distance (shortest path) between the nodes. We shall use BFS to find the distance or the degree of separation. The next figure shows the basic concepts required to define a search problem in AI.

The next python code snippet implements BFS and returns the path between two nodes source and target in the given input graph.

def shortest_path(source, target):
"""
Returns the shortest list of (movie_id, person_id) pairs
that connect the source to the target.
If no possible path, returns None.
"""

explored = set([])
frontier =
parents = {}
while len(frontier) > 0:
person = frontier.pop(0)
if person == target:
break
for (m, p) in neighbors_for_person(person):
if not p in frontier and not p in explored:
frontier.append(p)
parents[p] = (m, person)
if not target in parents:
return None
path = []
person = target
while person != source:
m, p = parents[person]
path.append((m, person))
person = p
path = path[::-1]
return path

The following animation shows how BFS finds the minimum degree(s) of separation (distance) between the query source and target actor nodes (green colored). The shortest path found between the actor nodes (formed of the edges corresponding to the movies a pair of actors acted together in) is colored in red.

The following shows the distribution of the degrees of separation between the actors.

## Tic-Tac-Toe

Using Minimax, implement an AI to play Tic-Tac-Toe optimally.

• For this problem, the board is represented as a list of three lists (representing the three rows of the board), where each internal list contains three values that are either X, O, or EMPTY.
• Implement a player function that should take a board state as input, and return which player’s turn it is (either X or O).
• In the initial game state, X gets the first move. Subsequently, the player alternates with each additional move.
• Any return value is acceptable if a terminal board is provided as input (i.e., the game is already over).
• Implement an action function that should return a set of all of the possible actions that can be taken on a given board.
• Each action should be represented as a tuple (i, j) where i corresponds to the row of the move (0, 1, or 2) and j corresponds to which cell in the row corresponds to the move (also 0, 1, or 2).
• Possible moves are any cells on the board that do not already have an X or an O in them.
• Any return value is acceptable if a terminal board is provided as input.
• Implement a result function that takes a board and an action as input, and should return a new board state, without modifying the original board.
• If action is not a valid action for the board, your program should raise an exception.
• The returned board state should be the board that would result from taking the original input board, and letting the player whose turn it is make their move at the cell indicated by the input action.
• Importantly, the original board should be left unmodified: since Minimax will ultimately require considering many different board states during its computation. This means that simply updating a cell in board itself is not a correct implementation of the result function. You’ll likely want to make a deep copy of the board first before making any changes.
• Implement a winner function that should accept a board as input, and return the winner of the board if there is one.
• If the X player has won the game, your function should return X. If the O player has won the game, your function should return O.
• One can win the game with three of their moves in a row horizontally, vertically, or diagonally.
• You may assume that there will be at most one winner (that is, no board will ever have both players with three-in-a-row, since that would be an invalid board state).
• If there is no winner of the game (either because the game is in progress, or because it ended in a tie), the function should return None.
• Implement a terminal function that should accept a board as input, and return a boolean value indicating whether the game is over.
• If the game is over, either because someone has won the game or because all cells have been filled without anyone winning, the function should return True.
• Otherwise, the function should return False if the game is still in progress.
• Implement an utility function that should accept a terminal board as input and output the utility of the board.
• If X has won the game, the utility is 1. If O has won the game, the utility is -1. If the game has ended in a tie, the utility is 0.
• You may assume utility will only be called on a board if terminal(board) is True.
• Implement a minimax function that should take a board as input, and return the optimal move for the player to move on that board.
• The move returned should be the optimal action (i, j) that is one of the allowable actions on the board. If multiple moves are equally optimal, any of those moves is acceptable.
• If the board is a terminal board, the minimax function should return None.
• For all functions that accept a board as input, you may assume that it is a valid board (namely, that it is a list that contains three rows, each with three values of either X, O, or EMPTY).
• Since Tic-Tac-Toe is a tie given optimal play by both sides, you should never be able to beat the AI (though if you don’t play optimally as well, it may beat you!)
• The following figure demonstrates the basic concepts of adversarial search and Game and defines the different functions for tic-tac-toe.
• Here we shall use Minimax with alpha-beta pruning to speedup the game when it is computer’s turn.

The following python code fragment shows how to implement the minimax algorithm with alpha-beta pruning:

def max_value_alpha_beta(board, alpha, beta):
if terminal(board):
return utility(board), None
v, a = -np.inf, None
for action in actions(board):
m, _ = min_value_alpha_beta(result(board, action),
alpha, beta)
if m > v:
v, a = m, action
alpha = max(alpha, v)
if alpha >= beta:
break
return (v, a)

def min_value_alpha_beta(board, alpha, beta):
if terminal(board):
return utility(board), None
v, a = np.inf, None
for action in actions(board):
m, _ = max_value_alpha_beta(result(board, action),
alpha, beta)
if m < v:
v, a = m, action
beta = min(beta, v)
if alpha >= beta:
break
return (v, a)

def minimax(board):
"""
Returns the optimal action for the current player on the board.
"""
if terminal(board):
return None
cur_player = player(board)
if cur_player == X:
_, a = max_value_alpha_beta(board, -np.inf, np.inf)
elif cur_player == O:
_, a = min_value_alpha_beta(board, -np.inf, np.inf)
return a

The following sequence of actions by the human player (plays as X, the Max player) and the computer (plays as O, the Min player) shows what the computer thinks to come up with optimal position of O and the game tree it produces using Minimax with alpha-beta pruning algorithm shown in the above figure.

The following animations show how the game tree is created when the computer thinks for the above turn. The row above the game board denotes the value of the utility function and it’s color-coded: when the value of the corresponding state is +1 (in favor of the Max player) it’s colored as green, when the value is -1 (in favor of the Min player) it’s colored as red, otherwise it’s colored gray if the value is 0.

Notice the game tree is full-grown to the terminal leaves. The following figure shows a path (colored in blue) in the game tree that leads to a decision taken by the computer to choose the position of O.

Now let’s consider the next sequence of actions by the human player and the computer. As can be seen, the computer again finds the optimal position.

The following animation shows how the game tree is created when the computer thinks for the above turn.

The following figure shows a path (colored in blue) in the game tree that leads to a decision taken by the computer to choose the position of O (note that the optimal path corresponds to a tie in a terminal node).

Finally, let’s consider the next sequence of actions by the human player and the computer. As can be seen, the computer again finds the optimal position.

The following animations show how the game tree is created when the computer thinks for the above turn.

The next figure shows the path (colored in blue) in the game tree that leads to a decision taken by the computer to choose the position of O (note that the optimal path corresponds to a tie in a terminal node).

Now let’s consider another example game. The following sequence of actions by the computer player (AI agent now plays as X, the Max player) and the human player (plays as O, the Min player) shows what the computer thinks to come up with optimal position of O and the game tree it produces using Minimax with alpha-beta pruning algorithm.

The following animations show how the game tree is created when the computer thinks for the above turn.

The next figure shows the path (colored in blue) in the game tree that leads to a decision taken by the computer to choose the next position of X (note that the optimal path corresponds to a win in a terminal node).

Note from the above tree that no matter what action the human player chooses next, AI agent (the computer) can always win the game. An example final outcome of the game be seen from the next game tree.

## Knights

Write a program to solve logic puzzles.

In 1978, logician Raymond Smullyan published “What is the name of this book?”, a book of logical puzzles. Among the puzzles in the book were a class of puzzles that Smullyan called “Knights and Knaves” puzzles.

• In a Knights and Knaves puzzle, the following information is given: Each character is either a knight or a knave.
• A knight will always tell the truth: if knight states a sentence, then that sentence is true.
• Conversely, a knave will always lie: if a knave states a sentence, then that sentence is false.
• The objective of the puzzle is, given a set of sentences spoken by each of the characters, determine, for each character, whether that character is a knight or a knave.

The task in this problem is to determine how to represent these puzzles using propositional logic, such that an AI running a model-checking algorithm could solve these puzzles.

Puzzles

1. Contains a single character: A. – A says “I am both a knight and a knave.”
2. Has two characters: A and B. – A says “We are both knaves.” – B says nothing.
3. Has two characters: A and B. – A says “We are the same kind.” – B says “We are of different kinds.”
4. Has three characters: A, B, and C. – A says either “I am a knight.” or “I am a knave.”, but you don’t know which. – B says “A said ‘I am a knave.’” – B then says “C is a knave.” – C says “A is a knight.”
• In this problem, we shall use model checking to find the solutions to the above puzzles. The following figure shows the theory that are relevant.
• The following python code snippet shows the base class corresponding to a propositional logic sentence (taken from the code provided for the assignment).
import itertoolsclass Sentence():
def evaluate(self, model):
"""Evaluates the logical sentence."""
raise Exception("nothing to evaluate")    def formula(self):
"""Returns string formula representing logical sentence."""
return ""    def symbols(self):
"""Returns a set of all symbols in the logical sentence."""
return set()    @classmethod
def validate(cls, sentence):
if not isinstance(sentence, Sentence):
raise TypeError("must be a logical sentence")
• Now, a propositional symbol and an operator (unary / binary) can inherit this class and implement the methods specific to the symbol as shown for the boolean operator AND (the code taken from the resources provided for the assignment).
class And(Sentence):
def __init__(self, *conjuncts):
for conjunct in conjuncts:
Sentence.validate(conjunct)
self.conjuncts = list(conjuncts)

def __eq__(self, other):
return isinstance(other, And) and \
self.conjuncts == other.conjuncts

def __hash__(self):
return hash(
("and", tuple(hash(conjunct) for conjunct in
self.conjuncts))
)

def __repr__(self):
conjunctions = ", ".join(
[str(conjunct) for conjunct in self.conjuncts]
)
return f"And({conjunctions})"

Sentence.validate(conjunct)
self.conjuncts.append(conjunct)

def evaluate(self, model):
return all(conjunct.evaluate(model) for conjunct in
self.conjuncts)    def formula(self):
if len(self.conjuncts) == 1:
return self.conjuncts.formula()
return " ∧ ".join([Sentence.parenthesize(conjunct.formula())
for conjunct in self.conjuncts])

def symbols(self):
return set.union(*[conjunct.symbols() \
for conjunct in self.conjuncts])
• Finally, we need to represent the KB for every puzzle, as the one shown in the code below:
AKnight = Symbol("A is a Knight")
AKnave = Symbol("A is a Knave")BKnight = Symbol("B is a Knight")
BKnave = Symbol("B is a Knave")CKnight = Symbol("C is a Knight")
CKnave = Symbol("C is a Knave")# Puzzle 1
# A says "I am both a knight and a knave."
knowledge0 = And(
# TODO
Or(And(AKnight, Not(AKnave)), And(Not(AKnight), AKnave)),
Implication(AKnight, And(AKnight, AKnave)),
Implication(AKnave, Not(And(AKnight, AKnave)))
)

The following section demonstrates the outputs obtained with model checking implemented using the above code (by iteratively assigning all possible models / truth values and checking if the KB corresponding to a puzzle evaluates to true for a model, and then outputting all the models for which the KB evaluates to true). To start solving a puzzle we must define the KB for the puzzle as shown above.

• The following figure shows the KB (presented in terms of a propositional logic sentence and) shown as an expression tree for the Puzzle 1.
• The following animation shows how model-checking solves Puzzle 1 (the solution is marked in green).
• The following figure shows the KB in terms of an expression tree for the Puzzle 2
• The following animation shows how model-checking solves Puzzle 2 (the solution is marked in green).
• The following figure shows the KB in terms of an expression tree for the Puzzle 3
• The following animation shows how model-checking solves Puzzle 3 (the solution is marked in green).
• The following figure shows the KB in terms of an expression tree for the Puzzle 4
• The following animation shows how model-checking solves Puzzle 4 (the solution is marked in green).

## Minesweeper

Write an AI agent to play Minesweeper.

• Minesweeper is a puzzle game that consists of a grid of cells, where some of the cells contain hidden “mines.”
• Clicking on a cell that contains a mine detonates the mine, and causes the user to lose the game.
• Clicking on a “safe” cell (i.e., a cell that does not contain a mine) reveals a number that indicates how many neighboring cells — where a neighbor is a cell that is one square to the left, right, up, down, or diagonal from the given cell — contain a mine.
• The following figure shows an example Minesweeper game. In this 3 X 3 Minesweeper game, for example, the three 1 values indicate that each of those cells has one neighboring cell that is a mine. The four 0 values indicate that each of those cells has no neighboring mine.
• Given this information, a logical player could conclude that there must be a mine in the lower-right cell and that there is no mine in the upper-left cell, for only in that case would the numerical labels on each of the other cells be accurate.
• The goal of the game is to flag (i.e., identify) each of the mines.
• In many implementations of the game, the player can flag a mine by right-clicking on a cell (or two-finger clicking, depending on the computer).

Propositional Logic

• Your goal in this project will be to build an AI that can play Minesweeper. Recall that knowledge-based agents make decisions by considering their knowledge base, and making inferences based on that knowledge.
• One way we could represent an AI’s knowledge about a Minesweeper game is by making each cell a propositional variable that is true if the cell contains a mine, and false otherwise.
• What information does the AI agent have access to? Well, the AI would know every time a safe cell is clicked on and would get to see the number for that cell.
• Consider the following Minesweeper board, where the middle cell has been revealed, and the other cells have been labeled with an identifying letter for the sake of discussion.
• What information do we have now? It appears we now know that one of the eight neighboring cells is a mine. Therefore, we could write a logical expression like the following one to indicate that one of the neighboring cells is a mine: Or(A, B, C, D, E, F, G, H)
• But we actually know more than what this expression says. The above logical sentence expresses the idea that at least one of those eight variables is true. But we can make a stronger statement than that: we know that exactly one of the eight variables is true. This gives us a propositional logic sentence like the below.
Or(
And(A, Not(B), Not(C), Not(D), Not(E), Not(F), Not(G), Not(H)),
And(Not(A), B, Not(C), Not(D), Not(E), Not(F), Not(G), Not(H)),
And(Not(A), Not(B), C, Not(D), Not(E), Not(F), Not(G), Not(H)),
And(Not(A), Not(B), Not(C), D, Not(E), Not(F), Not(G), Not(H)),
And(Not(A), Not(B), Not(C), Not(D), E, Not(F), Not(G), Not(H)),
And(Not(A), Not(B), Not(C), Not(D), Not(E), F, Not(G), Not(H)),
And(Not(A), Not(B), Not(C), Not(D), Not(E), Not(F), G, Not(H)),
And(Not(A), Not(B), Not(C), Not(D), Not(E), Not(F), Not(G), H)
)
• That’s quite a complicated expression! And that’s just to express what it means for a cell to have a 1 in it. If a cell has a 2 or 3 or some other value, the expression could be even longer.
• Trying to perform model checking on this type of problem, too, would quickly become intractable: on an 8 X 8 grid, the size Microsoft uses for its Beginner level, we’d have 64 variables, and therefore 2⁶⁴ possible models to check — far too many for a computer to compute in any reasonable amount of time. We need a better representation of knowledge for this problem.

Knowledge Representation

• Instead, we’ll represent each sentence of our AI’s knowledge like the below. {A, B, C, D, E, F, G, H} = 1
• Every logical sentence in this representation has two parts: a set of cells on the board that are involved in the sentence, and a number count , representing the count of how many of those cells are mines. The above logical sentence says that out of cells A, B, C, D, E, F, G, and H, exactly 1 of them is a mine.
• Why is this a useful representation? In part, it lends itself well to certain types of inference. Consider the game below:
• Using the knowledge from the lower-left number, we could construct the sentence {D, E, G} = 0 to mean that out of cells D, E, and G, exactly 0 of them are mines.
• Intuitively, we can infer from that sentence that all of the cells must be safe. By extension, any time we have a sentence whose count is 0 , we know that all of that sentence’s cells must be safe.
• Similarly, consider the game below.
• Our AI agent would construct the sentence {E, F, H} = 3 . Intuitively, we can infer that all of E, F, and H are mines.
• More generally, any time the number of cells is equal to the count , we know that all of that sentence’s cells must be mines.
• In general, we’ll only want our sentences to be about cells that are not yet known to be either safe or mines. This means that, once we know whether a cell is a mine or not, we can update our sentences to simplify them and potentially draw new conclusions.
• For example, if our AI agent knew the sentence {A, B, C} = 2 , we don’t yet have enough information to conclude anything. But if we were told that C were safe, we could remove C from the sentence altogether, leaving us with the sentence {A, B} = 2 (which, incidentally, does let us draw some new conclusions.)
• Likewise, if our AI agent knew the sentence {A, B, C} = 2 , and we were told that C is a mine, we could remove C from the sentence and decrease the value of count (since C was a mine that contributed to that count), giving us the sentence {A, B} = 1 . This is logical: if two out of A, B, and C are mines, and we know that C is a mine, then it must be the case that out of A and B, exactly one of them is a mine.
• If we’re being even more clever, there’s one final type of inference we can do
• Consider just the two sentences our AI agent would know based on the top middle cell and the bottom middle cell. From the top middle cell, we have {A, B, C} = 1 . From the bottom middle cell, we have {A, B, C, D, E} = 2 . Logically, we could then infer a new piece of knowledge, that {D, E} = 1 . After all, if two of A, B, C, D, and E are mines, and only one of A, B, and C are mines, then it stands to reason that exactly one of D and E must be the other mine.
• More generally, any time we have two sentences set1 = count1 and set2 = count2 where set1 is a subset of set2 , then we can construct the new sentence set2 — set1 = count2 — count1 . Consider the example above to ensure you understand why that’s true.
• So using this method of representing knowledge, we can write an AI agent that can gather knowledge about the Minesweeper board, and hopefully select cells it knows to be safe!
• The following figure shows examples of inference rules and resolution by inference relevant for this problem:

A few implementation tips

• Represent a logical sentence by a python class Sentence again.
• Each sentence has a set of cells within it and a count of how many of those cells are mines.
• Let the class also contain functions known_mines and known_safes for determining if any of the cells in the sentence are known to be mines or known to be safe.
• It should also have member functions mark_mine() and mark_safe() to update a sentence in response to new information about a cell.
• Each cell is a pair (i, j) where i is the row number (ranging from 0 to height — 1 ) and j is the column number (ranging from 0 to width — 1 ).
• Implement a known_mines() method that should return a set of all of the cells in self.cells that are known to be mines.
• Implement a known_safes() method that should return a set of all the cells in self.cells that are known to be safe.
• Implement a mark_mine() method that should first check if cell is one of the cells included in the sentence. If cell is in the sentence, the function should update the sentence so that cell is no longer in the sentence, but still represents a logically correct sentence given that cell is known to be a mine.
• Likewise mark_safe() method should first check if cell is one of the cells included in the sentence. If yes, then it should update the sentence so that cell is no longer in the sentence, but still represents a logically correct sentence given that cell is known to be safe.
• Implement a method add_knowledge() that should accept a cell and
• The function should mark the cell as one of the moves made in the game.
• The function should mark the cell as a safe cell, updating any sentences that contain the cell as well.
• The function should add a new sentence to the AI’s knowledge base, based on the value of cell and count , to indicate that count of the cell ’s neighbors are mines. Be sure to only include cells whose state is still undetermined in the sentence.
• If, based on any of the sentences in self.knowledge , new cells can be marked as safe or as mines, then the function should do so.
• If, based on any of the sentences in self.knowledge , new sentences can be inferred , then those sentences should be added to the knowledge base as well.
• Implement a method make_safe_move() that should return a move (i, j) that is known to be safe.
• The move returned must be known to be safe, and not a move already made.
• If no safe move can be guaranteed, the function should return None .
• Implement a method make_random_move() that should return a random move (i, j) .
• This function will be called if a safe move is not possible: if the AI agent doesn’t know where to move, it will choose to move randomly instead.
• The move must not be a move that has already been made.
• he move must not be a move that is known to be a mine.
• If no such moves are possible, the function should return None .

The following python code snippet shows implementation of few of the above functions:

class MinesweeperAI():
"""
Minesweeper game player
"""
def mark_mine(self, cell):
"""
Marks a cell as a mine, and updates all knowledge
to mark that cell as a mine as well.
"""
for sentence in self.knowledge:
sentence.mark_mine(cell)    # ensure that no duplicate
def knowledge_contains(self, sentence):
for s in self.knowledge:
if s == sentence:
return True
return False    def add_knowledge(self, cell, count):
"""
Called when the Minesweeper board tells us, for a given
safe cell, how many neighboring cells have mines in them.
This function should:
1) mark the cell as a move that has been made
2) mark the cell as safe
3) add a new sentence to the AI's knowledge base
based on the value of cell and count
4) mark any additional cells as safe or as mines
if it can be concluded based on the AI's KB
5) add any new sentences to the AI's knowledge base
if they can be inferred from existing knowledge
"""
# mark the cell as a move that has been made
# mark the cell as safe
self.mark_safe(cell)
# add a new sentence to the AI's knowledge base,
# based on the value of cell and count
i, j = cell
cells = []
for row in range(max(i-1,0), min(i+2,self.height)):
for col in range(max(j-1,0), min(j+2,self.width)):
# if some mines in the neighbors are already known,
# make sure to decrement the count
if (row, col) in self.mines:
count -= 1
if (not (row, col) in self.safes) and \
(not (row, col) in self.mines):
cells.append((row, col))
sentence = Sentence(cells, count)
# add few more inference rules here

def make_safe_move(self):
"""
Returns a safe cell to choose on the Minesweeper board.
The move must be known to be safe, and not already a move
This function may use the knowledge in self.mines,
self.safes and self.moves_made, but should not modify any of
those values.
"""
if len(safe_moves) > 0:
return safe_moves.pop()
else:
return None

The following animations show how the Minesweeper AI agent updates safe / mine cells and updates its knowledge-base, for a few example games (the light green and red cells represent known safe and mine cells, respectively, in the 2nd subplot):

## PageRank

Write an AI agent to rank web pages by importance.

• When search engines like Google display search results, they do so by placing more “important” and higher-quality pages higher in the search results than less important pages. But how does the search engine know which pages are more important than other pages?
• One heuristic might be that an “important” page is one that many other pages link to, since it’s reasonable to imagine that more sites will link to a higher-quality webpage than a lower-quality webpage.
• We could therefore imagine a system where each page is given a rank according to the number of incoming links it has from other pages, and higher ranks would signal higher importance.
• But this definition isn’t perfect: if someone wants to make their page seem more important, then under this system, they could simply create many other pages that link to their desired page to arti􀃥cially in􀃦ate its rank.
• For that reason, the PageRank algorithm was created by Google’s co-founders (including Larry Page, for whom the algorithm was named).
• In PageRank’s algorithm, a website is more important if it is linked to by other important websites, and links from less important websites have their links weighted less.
• This definition seems a bit circular, but it turns out that there are multiple strategies for calculating these rankings.

Random Surfer Model

One way to think about PageRank is with the random surfer model, which considers the behavior of a hypothetical surfer on the internet who clicks on links at random.

• The random surfer model imagines a surfer who starts with a web page at random, and then randomly chooses links to follow.
• A page’s PageRank, then, can be described as the probability that a random surfer is on that page at any given time. After all, if there are more links to a particular page, then it’s more likely that a random surfer will end up on that page.
• Moreover, a link from a more important site is more likely to be clicked on than a link from a less important site that fewer pages link to, so this model handles weighting links by their importance as well.
• One way to interpret this model is as a Markov Chain, where each page represents a state, and each page has a transition model that chooses among its links at random. At each time step, the state switches to one of the pages linked to by the current state.
• By sampling states randomly from the Markov Chain, we can get an estimate for each page’s PageRank. We can start by choosing a page at random, then keep following links at random, keeping track of how many times we’ve visited each page.
• After we’ve gathered all of our samples (based on a number we choose in advance), the proportion of the time we were on each page might be an estimate for that page’s rank.
• To ensure we can always get to somewhere else in the corpus of web pages, we’ll introduce to our model a damping factor d .
• With probability d (where d is usually set around 0.85 ), the random surfer will choose from one of the links on the current page at random. But otherwise (with probability 1 — d ), the random surfer chooses one out of all of the pages in the corpus at random (including the one they are currently on).
• Our random surfer now starts by choosing a page at random, and then, for each additional sample we’d like to generate, chooses a link from the current page at random with probability d , and chooses any page at random with probability 1 — d .
• If we keep track of how many times each page has shown up as a sample, we can treat the proportion of states that were on a given page as its PageRank.

Iterative Algorithm

• We can also define a page’s PageRank using a recursive mathematical expression. Let PR(p) be the PageRank of a given page p : the probability that a random surfer ends up on that page.
• How do we define PR(p)? Well, we know there are two ways that a random surfer could end up on the page: 1. With probability 1 — d , the surfer chose a page at random and ended up on page p . 2. With probability d , the surfer followed a link from a page i to page p .
• The first condition is fairly straightforward to express mathematically: it’s 1 — d divided by N , where N is the total number of pages across the entire corpus. This is because the 1 — d probability of choosing a page at random is split evenly among all N possible pages.
• For the second condition, we need to consider each possible page i that links to page p . For each of those incoming pages, let NumLinks(i) be the number of links on page i .
• Each page i that links to p has its own PageRank, PR(i) , representing the probability that we are on page i at any given time.
• And since from page i we travel to any of that page’s links with equal probability, we divide PR(i) by the number of links NumLinks(i) to get the probability that we were on page i and chose the link to page p.
• This gives us the following definition for the PageRank for a page p .
• In this formula, d is the damping factor, N is the total number of pages in the corpus, i ranges over all pages that link to page p , and NumLinks(i) is the number of links present on page i .
• How would we go about calculating PageRank values for each page, then? We can do so via iteration: start by assuming the PageRank of every page is 1 / N (i.e., equally likely to be on any page).
• Then, use the above formula to calculate new PageRank values for each page, based on the previous PageRank values.
• If we keep repeating this process, calculating a new set of PageRank values for each page based on the previous set of PageRank values, eventually the PageRank values will converge (i.e., not change by more than a small threshold with each iteration).
• Now, let’s implement both such approaches for calculating PageRank — calculating both by sampling pages from a Markov Chain random surfer and by iteratively applying the PageRank formula.

The following python code snippet represents the implementation of PageRank computation with Monte-Carlo sampling.

def sample_pagerank(corpus, damping_factor, n):
"""
Return PageRank values for each page by sampling n pages
according to transition model, starting with a page at random.
Return a dictionary where keys are page names, and values are
their estimated PageRank value (a value between 0 and 1). All
PageRank values should sum to 1.
"""
pages = [page for page in corpus]
pageranks = {page:0 for page in pages}
page = random.choice(pages)
pageranks[page] = 1
for i in range(n):
probs = transition_model(corpus, page, damping_factor)
probs = probs.values() #[probs[p] for p in pages]
page = random.choices(pages, weights=probs, k=1)
pageranks[page] += 1
return {page:pageranks[page]/n for page in pageranks}
• The following animation shows how PageRank changes with sampling-based implementation.
• Again, the following python code snippet represents the iterative PageRank implementation.
def iterate_pagerank(corpus, damping_factor):
"""
Return PageRank values for each page by iteratively updating
PageRank values until convergence.    Return a dictionary where
keys are page names, and values are their estimated PageRank value
(a value between 0 and 1). All PageRank values should sum to 1.
"""
incoming = {p:set() for p in corpus}
for p in corpus:
for q in corpus[p]:
pages = [page for page in corpus]
n = len(pages)
pageranks = {page:1/n for page in pages}
diff = float('inf')
while diff > 0.001:
diff = 0
for page in pages:
p = (1-damping_factor)/n
for q in incoming[page]:
p += damping_factor * \
(sum([pageranks[q]/len(corpus[q])]) \
if len(corpus[q]) > 0 else 1/n)
diff = max(diff, abs(pageranks[page]-p))
pageranks[page] = p
return {p:pageranks[p]/sum(pageranks.values())
for p in pageranks}

The following animation shows how the iterative PageRank formula is computed for a sample very small web graph:

## Heredity

Write an AI agent to assess the likelihood that a person will have a particular genetic trait.

• Mutated versions of the GJB2 gene are one of the leading causes of hearing impairment in newborns.
• Each person carries two versions of the gene, so each person has the potential to possess either 0, 1, or 2 copies of the hearing impairment version GJB2.
• Unless a person undergoes genetic testing, though, it’s not so easy to know how many copies of mutated GJB2 a person has.
• This is some “hidden state”: information that has an effect that we can observe (hearing impairment), but that we don’t necessarily directly know.
• After all, some people might have 1 or 2 copies of mutated GJB2 but not exhibit hearing impairment, while others might have no copies of mutated GJB2 yet still exhibit hearing impairment.
• Every child inherits one copy of the GJB2 gene from each of their parents. If a parent has two copies of the mutated gene, then they will pass the mutated gene on to the child; if a parent has no copies of the mutated gene, then they will not pass the mutated gene on to the child; and if a parent has one copy of the mutated gene, then the gene is passed on to the child with probability 0.5.
• After a gene is passed on, though, it has some probability of undergoing additional mutation: changing from a version of the gene that causes hearing impairment to a version that doesn’t, or vice versa.
• We can attempt to model all of these relationships by forming a Bayesian Network of all the relevant variables, as in the one below, which considers a family of two parents and a single child.
• Each person in the family has a Gene random variable representing how many copies of a particular gene (e.g., the hearing impairment version of GJB2) a person has: a value that is 0, 1, or 2.
• Each person in the family also has a Trait random variable, which is yes or no depending on whether that person expresses a trait (e.g., hearing impairment) based on that gene.
• There’s an arrow from each person’s Gene variable to their Trait variable to encode the idea that a person’s genes affect the probability that they have a particular trait.
• Meanwhile, there’s also an arrow from both the mother and father’s Gene random variable to their child’s Gene random variable: the child’s genes are dependent on the genes of their parents.
• The task in this problem is to use this model to make inferences about a population. Given information about people, who their parents are, and whether they have a particular observable trait (e.g. hearing loss) caused by a given gene, the AI agent will infer the probability distribution for each person’s genes, as well as the probability distribution for whether any person will exhibit the trait in question.
• The next figure shows the concepts required for running inference on a Bayes Network
• Recall that to compute a joint probability of multiple events, you can do so by multiplying those probabilities together. But remember that for any child, the probability of them having a certain number of genes is conditional on what genes their parents have.
• The following figure shows the given data files
• Implement a function joint_probability() that should take as input a dictionary of people, along with data about who has how many copies of each of the genes, and who exhibits the trait. The function should return the joint probability of all of those events taking place.
• Implement a function update() that adds a new joint distribution probability to the existing probability distributions in probabilities .
• Implement the function normalize() that updates a dictionary of probabilities such that each probability distribution is normalized.
• PROBS is a dictionary containing a number of constants representing probabilities of various different events. All of these events have to do with how many copies of a particular gene a person has , and whether a person exhibits a particular trait based on that gene.
• PROBS[“gene”] represents the unconditional probability distribution over the gene (i.e., the probability if we know nothing about that person’s parents).
• PROBS[“trait”] represents the conditional probability that a person exhibits a trait.
• PROBS[“mutation”] is the probability that a gene mutates from being the gene in question to not being that gene, and vice versa.

The following python code snippet shows how to compute the joint probabilities:

def joint_probability(people, one_gene, two_genes, have_trait):
"""
Compute and return a joint probability.
The probability returned should be the probability that
* everyone in set one_gene has one copy of the gene, and
* everyone in set two_genes has two copies of the gene, and
* everyone not in one_gene / two_gene doesn't have the gene
* everyone in set have_trait has the trait, and
* everyone not in set have_trait does not have the trait.
"""
prob_gene = PROBS['gene']
prob_trait_gene = PROBS['trait']
prob_mutation = PROBS['mutation']
prob_gene_pgenes = { # CPT
2: {
0: {
0: (1-prob_mutation)*prob_mutation,
1: (1-prob_mutation)*(1-prob_mutation) + \
prob_mutation*prob_mutation,
2: prob_mutation*(1-prob_mutation)
},
1: {
0: prob_mutation/2,
1: 0.5,
2: (1-prob_mutation)/2
},
2: {
0: prob_mutation*prob_mutation,
1: 2*(1-prob_mutation)*prob_mutation,
2: (1-prob_mutation)*(1-prob_mutation)
}
},
# compute the probabilities for the other cases...
}

all_probs = {}
for name in people:
has_trait = name in have_trait
num_gene = 1 if name in one_gene else 2 if name in two_genes
else 0
prob = prob_trait_gene[num_gene][has_trait]* \
prob_gene[num_gene]
if people[name]['mother'] != None:
mother, father = people[name]['mother'], \
people[name]['father']
num_gene_mother = 1 if mother in one_gene else \
2 if mother in two_genes else 0
num_gene_father = 1 if father in one_gene else 2
if father in two_genes else 0
prob = prob_trait_gene[num_gene][has_trait] * \
prob_gene_pgenes[num_gene_mother][num_gene_father] \
[num_gene]
all_probs[name] = prob
probability = np.prod(list(all_probs.values()))
return probability

The following animations show how the CPTs (conditional probability tables) are computed for the Bayesian Networks corresponding to the three families, respectively.

The following figure shows the conditional probabilities (computed with simulation) that children does / doesn’t have the gene given the parents does / doesn’t have the gene.

## Crossword

Write an AI agent to generate crossword puzzles.

• Given the structure of a crossword puzzle (i.e., which squares of the grid are meant to be filled in with a letter), and a list of words to use, the problem becomes one of choosing which words should go in each vertical or horizontal sequence of squares.
• We can model this sort of problem as a constraint satisfaction problem.
• Each sequence of squares is one variable, for which we need to decide on its value(which word in the domain of possible words will 􀃥ll in that sequence). Consider the following crossword puzzle structure.
• In this structure, we have four variables, representing the four words we need to 􀃥ll into this crossword puzzle (each indicated by a number in the above image). Each variable is defined by four values: the row it begins on (its i value), the column it begins on (its j value), the direction of the word (either down or across), and the length of the word.
• Variable 1, for example, would be a variable represented by a row of 1 (assuming 0 indexed counting from the top), a column of 1 (also assuming 0 indexed counting from the left), a direction of across, and a length of 4.
• As with many constraint satisfaction problems, these variables have both unary and binary constraints. The unary constraint on a variable is given by its length.
• For Variable 1, for instance, the value BYTE would satisfy the unary constraint, but the value BIT would not (it has the wrong number of letters).
• Any values that don’t satisfy a variable’s unary constraints can therefore be removed from the variable’s domain immediately.
• The binary constraints on a variable are given by its overlap with neighboring variables. Variable 1 has a single neighbor: Variable 2. Variable 2 has two neighbors: Variable 1 and Variable 3.
• For each pair of neighboring variables, those variables share an overlap: a single square that is common to them both.
• We can represent that overlap as the character index in each variable’s word that must be the same character.
• For example, the overlap between Variable 1 and Variable 2 might be represented as the pair (1, 0), meaning that Variable 1’s character at index 1 necessarily must be the same as Variable 2’s character at index 0 (assuming 0-indexing, again).
• The overlap between Variable 2 and Variable 3 would therefore be represented as the pair (3, 1): character 3 of Variable 2’s value must be the same as character 1 of Variable 3’s value.
• For this problem, we’ll add the additional constraint that all words must be different: the same word should not be repeated multiple times in the puzzle.
• The challenge ahead, then, is write a program to 􀃥find a satisfying assignment: a different word (from a given vocabulary list) for each variable such that all of the unary and binary constraints are met.
• The following figure shows the theory required to solve the problem:
• Implement a function enforce_node_consistency() that should update the domains such that each variable is node consistent.
• Implement a function ac3() that should, using the AC3 algorithm, enforce arc consistency on the problem. Recall that arc consistency is achieved when all the values in each variable’s domain satisfy that variable’s binary constraints.
• Recall that the AC3 algorithm maintains a queue of arcs to process. This function takes an optional argument called arcs, representing an initial list of arcs to process. If arcs is None, the function should start with an initial queue of all of the arcs in the problem. Otherwise, the algorithm should begin with an initial queue of only the arcs that are in the list arcs.
• Implement a function consistent() that should check to see if a given assignment is consistent and a function assignment_complete() that should check if a given assignment is complete.
• Implement a function select_unassigned_variable() that should return a single variable in the crossword puzzle that is not yet assigned by assignment, according to the minimum remaining value heuristic and then the degree heuristic. Also, it should return the variable with the fewest number of remaining values in its domain.
• Implement a function backtrack() that should accept a partial assignment as input and, using backtracking search, return a complete satisfactory assignment of variables to values if it is possible to do so.

The following python code snippet shows couple of the above function implementations:

def select_unassigned_variable(self, assignment):
"""
Return an unassigned variable not already part of assignment.
Choose the variable with the minimum number of remaining values
in its domain. If there is a tie, choose the variable with the
highest degree. If there is a tie, any of the tied variables
are acceptable return values.
"""
unassigned_vars = list(self.crossword.variables - \
set(assignment.keys()))
mrv, svar = float('inf'), None
for var in unassigned_vars:
if len(self.domains[var]) < mrv:
mrv, svar =  len(self.domains[var]), var
elif len(self.domains[var]) == mrv:
if self.crossword.neighbors(var) >
self.crossword.neighbors(svar):
svar = var
return svar

def backtrack(self, assignment):
"""
Using Backtracking Search, take as input a partial assignment
for the crossword and return a complete assignment if possible
to do so.
assignment is a mapping from variables (keys) to words
(values).
If no assignment is possible, return None.
"""
# Check if assignment is complete
if len(assignment) == len(self.crossword.variables):
return assignment    # Try a new variable
var = self.select_unassigned_variable(assignment)
for value in self.domains[var]:
new_assignment = assignment.copy()
new_assignment[var] = value
if self.consistent(new_assignment):
result = self.backtrack(new_assignment)
if result is not None:
return result
return None

The following animation shows how the backtracking algorithm with AC3 generates the crossword with the given structure.

######_
____##_
_##____
_##_##_
_##_##_
#___##_

Now let’s compute the speedup obtained with backtracking with and without AC3 (on average), i.e., compare the time to generate a crossword puzzle (with the above structure), averaged over 1000 runs (each time the puzzle generated may be different though). The next figure shows the result of the runtime experiment and that backtracking with AC3 is way faster on average over backtracking implementation without AC3.

## Shopping

Write an AI agent to predict whether online shopping customers will complete a purchase.

• When users are shopping online, not all will end up purchasing something. Most visitors to an online shopping website, in fact, likely don’t end up going through with a purchase during that web browsing session.
• It might be useful, though, for a shopping website to be able to predict whether a user intends to make a purchase or not: perhaps displaying different content to the user, like showing the user a discount offer if the website believes the user isn’t planning to complete the purchase.
• How could a website determine a user’s purchasing intent? That’s where machine learning will come in.
• The task in this problem is to build a nearest-neighbor classifier to solve this problem. Given information about a user — how many pages they’ve visited, whether they’re shopping on a weekend, what web browser they’re using, etc. — the classifier will predict whether or not the user will make a purchase.
• The classifier won’t be perfectly accurate — but it should be better than guessing randomly.
• To train the classifier a dataset from a shopping website from about 12,000 users sessions is provided.
• How do we measure the accuracy of a system like this? If we have a testing data set, we could run our classifier and compute what proportion of the time we correctly classify the user’s intent.
• This would give us a single accuracy percentage. But that number might be a little misleading. Imagine, for example, if about 15% of all users end up going through with a purchase. A classifier that always predicted that the user would not go through with a purchase, then, we would measure as being 85% accurate: the only users it classifies incorrectly are the 15% of users who do go through with a purchase. And while 85% accuracy sounds pretty good, that doesn’t seem like a very useful classifier.
• Instead, we’ll measure two values: sensitivity (also known as the “true positive rate”) and specificity (also known as the “true negative rate”).
• Sensitivity refers to the proportion of positive examples that were correctly identified: in other words, the proportion of users who did go through with a purchase who were correctly identified.
• Specificity refers to the proportion of negative examples that were correctly identified: in this case, the proportion of users who did not go through with a purchase who were correctly identified.
• So our “always guess no” classifier from before would have perfect specificity (1.0) but no sensitivity (0.0). Our goal is to build a classifier that performs reasonably on both metrics.
• First few rows from the csv dataset are shown in the following figure:
• There are about 12,000 user sessions represented in this spreadsheet: represented as one row for each user session.
• The first six columns measure the different types of pages users have visited in the session: the Administrative , Informational , and ProductRelated columns measure how many of those types of pages the user visited, and their corresponding _Duration columns measure how much time the user spent on any of those pages.
• The BounceRates , ExitRates , and PageValues columns measure information from Google Analytics about the page the user visited. SpecialDay is a value that measures how closer the date of the user’s session is to a special day (like Valentine’s Day or Mother’s Day).
• Month is an abbreviation of the month the user visited. OperatingSystems , Browser , Region, and TrafficType are all integers describing information about the user themself.
• VisitorType will take on the value Returning_Visitor for returning visitors and New_Visitor for new visitors to the site.
• Weekend is TRUE or FALSE depending on whether or not the user is visiting on a weekend.
• Perhaps the most important column, though, is the last one: the Revenue column. This is the column that indicates whether the user ultimately made a purchase or not: TRUE if they did, FALSE if they didn’t. This is the column that we’d like to learn to predict (the “label”), based on the values for all of the other columns (the “evidence”).
• Implement a function load_data() that should accept a CSV filename as its argument, open that file, and return a tuple (evidence, labels) . evidence should be a list of all of the evidence for each of the data points, and labels should be a list of all of the labels for each data point.
• Implement a function train_model() function should accept a list of evidence and a list of labels, and return a scikit-learn nearest-neighbor classifier (a k-nearest-neighbor classifier where k = 1) 􀃥fitted on that training data.
• Relevant concepts in supervised machine learning (classification) needed for this problem is described in the following figure.

Use the following python code snippet to divide the dataset into two parts: the first one to train the 1-NN model on and the second one being a held-out test dataset to evaluate the model using the performance metrics.

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
TEST_SIZE = 0.4
# Load data from spreadsheet and split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
evidence, labels, test_size=TEST_SIZE
)
# Train model and make predictions
model = train_model(X_train, y_train)
predictions = model.predict(X_test)
sensitivity, specificity = evaluate(y_test, predictions)

def train_model(evidence, labels):
"""
Given a list of evidence lists and a list of labels, return a
fitted k-nearest neighbor model (k=1) trained on the data.
"""
model = KNeighborsClassifier(n_neighbors=1)
model.fit(evidence, labels)
return model

With 60–40 validation with a sample run following result is obtained, by training the model on 60% training data and use the model to predict on the 40% held-out test dataset, and then comparing the test predictions with the ground truth:

Correct: 3881
Incorrect: 1051
True Positive Rate: 31.14%
True Negative Rate: 87.37%

Now, let’s repeat the experiment for 25 times (with different random seeds for train_test_split()) and let’s plot the sensitivity against specificity for every random partition, along with color-coded accuracy, to obtain a plot like the following one.

with the mean performance metrics reported as follows (to reduce the variability in the result):

Sensitivity: 0.3044799546457174
Specificity: 0.8747533289505658
Accuracy: 0.7867234387672343

As can be seen from the above result, the specificity if quite good, but sensitivity obtained is quite poor.

Let’s now plot the decision boundary obtained with 1-NN classifier for different 60–40 train-test validation datasets created, with the data projected across two principal components, the output is shown in the following animation (here the red points represent the cases where purchase happened and the blue points represent the cases where it did not).

## Nim

Implement an AI agent that teaches itself to play Nim through reinforcement learning.

• Recall that in the game Nim, we begin with some number of piles, each with some number of objects. Players take turns: on a player’s turn, the player removes any non-negative number of objects from any one non-empty pile. Whoever removes the last object loses.
• There’s some simple strategy you might imagine for this game: if there’s only one pile and three objects left in it, and it’s your turn, your best bet is to remove two of those objects, leaving your opponent with the third and final object to remove.
• But if there are more piles, the strategy gets considerably more complicated. In this problem, we’ll build an AI agent to learn the strategy for this game through reinforcement learning. By playing against itself repeatedly and learning from experience, eventually our AI agent will learn which actions to take and which actions to avoid.
• In particular, we’ll use Q-learning for this project. Recall that in Q-learning, we try to learn a reward value (a number) for every (state,action) pair. An action that loses the game will have a reward of -1, an action that results in the other player losing the game will have a reward of 1, and an action that results in the game continuing has an immediate reward of 0, but will also have some future reward.
• How will we represent the states and actions inside of a Python program? A “state” of the Nim game is just the current size of all of the piles. A state, for example, might be [1, 1, 3, 5] , representing the state with 1 object in pile 0, 1 object in pile 1, 3 objects in pile 2, and 5 objects in pile 3. An “action” in the Nim game will be a pair of integers (i, j) , representing the action of taking j objects from pile i . So the action (3, 5) represents the action “from pile 3, take away 5 objects.” Applying that action to the state [1, 1, 3, 5] would result in the new state [1, 1, 3, 0] (the same state, but with pile 3 now empty).
• Recall that the key formula for Q-learning is below. Every time we are in a state s and take an action a , we can update the Q-value Q(s, a) according to: Q(s, a) <- Q(s, a) + alpha * (new value estimate — old value estimate)
• In the above formula, alpha is the learning rate (how much we value new information compared to information we already have).
• The new value estimate represents the sum of the reward received for the current action and the estimate of all the future rewards that the player will receive.
• The old value estimate is just the existing value for Q(s, a) . By applying this formula every time our AI agent takes a new action, over time our AI agent will start to learn which actions are better in any state.
• The concepts required for reinforcement learning and Q-learning is shown in the following figure.

Next python code snippet shows how to implement q-values update for Q-learning with the update() method from the NimAI class.

class NimAI():
def update(self, old_state, action, new_state, reward):
"""
Update Q-learning model, given an old state, an action taken
in that state, a new resulting state, and the reward
"""
old = self.get_q_value(old_state, action)
best_future = self.best_future_reward(new_state)
self.update_q_value(old_state, action, old, reward,
best_future)
def update_q_value(self, state, action, old_q, reward,
future_rewards):
"""
Update the Q-value for the state state and the action
action given the previous Q-value old_q, a current
reward reward, and an estimate of future rewards
future_rewards.
Use the formula:        Q(s, a) <- old value estimate
+ alpha * (new value estimate - old value estimate)
where old value estimate is the previous Q-value,
alpha is the learning rate, and new value estimate
is the sum of the current reward and estimated future
rewards.
"""
self.q[(tuple(state), action)] = old_q + self.alpha * (
(reward + future_rewards) - old_q)

The AI agent plays 10000 training games to update the q-values, the following animation shows how the q-values are updated in last few games (corresponding to a few selected states and action pairs):

The following animation shows an example game played between the AI agent (the computer) and the human (the green, red and white boxes in each row of the first subplot represent the non-empty, to be removed and empty piles for the corresponding row, respectively, for the AI agent and the human player).

As can be seen from the above figure, one of actions with the highest q-value for the AI agent’s first turn is (3,6) , that’s why it removes 6 piles from the 3rd row. Similarly, in the second tun of the AI agent, one of the best actions is (2,5), that’s why it removes 5 piles from row 2 and likewise.

The following animation shows another example game AI agent plays with human:

## Traffic

Write an AI agent to identify which traffic sign appears in a photograph.

• As research continues in the development of self-driving cars, one of the key challenges is computer vision, allowing these cars to develop an understanding of their environment from digital images.
• In particular, this involves the ability to recognize and distinguish road signs — stop signs, speed limit signs, yield signs, and more.
• The following figure contains the theory required to perform the multi-class classification task using a deep neural network.
• In this problem, we shall use TensorFlow (https://www.tensor􀃦ow.org/) to build a neural network to classify road signs based on an image of those signs.
• To do so, we shall need a labeled dataset: a collection of images that have already been categorized by the road sign represented in them.
• Several such data sets exist, but for this project, we’ll use the German Traffic Sign Recognition Benchmark (http://benchmark.ini.rub.de/?section=gtsrb&subsection=news) (GTSRB) dataset, which contains thousands of images of 43 different kinds of road signs.
• The dataset contains 43 sub-directories, numbered 0 through 42. Each numbered sub-directory represents a different category (a different type of road sign). Within each traffic sign’s directory is a collection of images of that type of traffic sign.
• Implement the load_data() function which should accept as an argument data_dir, representing the path to a directory where the data is stored, and return image arrays and labels for each image in the data set.
• While loading the dataset, the images have to be resized to a fixed size. If the images are resized correctly, each of them will have shape as (30, 30, 3).
• Also, implement the function get_model() which should return a compiled neural network model.
• The following python code snippet shows how to implement the classification model with keras, along with the output produced when the model is run.
import numpy as np
import tensorflow as tfEPOCHS = 10
IMG_WIDTH = 30
IMG_HEIGHT = 30
NUM_CATEGORIES = 43
TEST_SIZE = 0.4# Get image arrays and labels for all image files
images, labels = load_data('gtsrb')# Split data into training and testing sets
labels = tf.keras.utils.to_categorical(labels)
x_train, x_test, y_train, y_test = train_test_split(
np.array(images), np.array(labels), test_size=TEST_SIZE
)# Get a compiled neural network
model = get_model()# Fit model on training data
model.fit(x_train, y_train, epochs=EPOCHS)# Evaluate neural network performance
model.evaluate(x_test,  y_test, verbose=2)def get_model():
"""
Returns a compiled convolutional neural network model.
Assume that the input_shape of the first layer is
(IMG_WIDTH, IMG_HEIGHT, 3).
The output layer should have NUM_CATEGORIES units, one for
each category.
"""
# Create a convolutional neural network
model = tf.keras.models.Sequential([
# Convolutional layer. Learn 32 filters using a 3x3 kernel
tf.keras.layers.Conv2D(
32, (3, 3), activation="relu", input_shape=(30, 30, 3)
),
# Max-pooling layer, using 2x2 pool size
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
# Convolutional layer. Learn 32 filters using a 3x3 kernel
tf.keras.layers.Conv2D(
64, (3, 3), activation="relu"
),
# Max-pooling layer, using 2x2 pool size
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
tf.keras.layers.Conv2D(
128, (3, 3), activation="relu"
),
# Max-pooling layer, using 2x2 pool size
tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
# Flatten units
tf.keras.layers.Flatten(),
# Add a hidden layer with dropout
tf.keras.layers.Dense(256, activation="relu"),
tf.keras.layers.Dropout(0.5),
# Add an output layer with output units for all 10 digits
tf.keras.layers.Dense(NUM_CATEGORIES, activation="softmax")
])
print(model.summary)    # Train neural network
model.compile(
loss="categorical_crossentropy",
metrics=["accuracy"]
)
return model

# Model: "sequential"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #
# =================================================================
# conv2d (Conv2D)              (None, 28, 28, 32)        896
# _________________________________________________________________
# max_pooling2d (MaxPooling2D) (None, 14, 14, 32)        0
# _________________________________________________________________
# conv2d_1 (Conv2D)            (None, 12, 12, 64)        18496
# _________________________________________________________________
# max_pooling2d_1 (MaxPooling2 (None, 6, 6, 64)          0
# _________________________________________________________________
# conv2d_2 (Conv2D)            (None, 4, 4, 128)         73856
# _________________________________________________________________
# max_pooling2d_2 (MaxPooling2 (None, 2, 2, 128)         0
# _________________________________________________________________
# flatten (Flatten)            (None, 512)               0
# _________________________________________________________________
# dense (Dense)                (None, 256)               131328
# _________________________________________________________________
# dropout (Dropout)            (None, 256)               0
# _________________________________________________________________
# dense_1 (Dense)              (None, 43)                11051
# =================================================================
# Total params: 235,627
# Trainable params: 235,627
# Non-trainable params: 0
# _________________________________________________________________
• The architecture of the deep network is shown in the next figure:
• The following output shows how the cross-entropy loss on the training dataset decreases and the accuracy of classification increases over epochs on the training dataset.
Epoch 1/10
15864/15864 [==============================] - 17s 1ms/sample - loss: 2.3697 - acc: 0.4399
Epoch 2/10
15864/15864 [==============================] - 15s 972us/sample - loss: 0.6557 - acc: 0.8133
Epoch 3/10
15864/15864 [==============================] - 16s 1ms/sample - loss: 0.3390 - acc: 0.9038
Epoch 4/10
15864/15864 [==============================] - 16s 995us/sample - loss: 0.2526 - acc: 0.9285
Epoch 5/10
15864/15864 [==============================] - 16s 1ms/sample - loss: 0.1992 - acc: 0.9440
Epoch 6/10
15864/15864 [==============================] - 16s 985us/sample - loss: 0.1705 - acc: 0.9512
Epoch 7/10
15864/15864 [==============================] - 15s 956us/sample - loss: 0.1748 - acc: 0.9536
Epoch 8/10
15864/15864 [==============================] - 15s 957us/sample - loss: 0.1551 - acc: 0.9578
Epoch 9/10
15864/15864 [==============================] - 16s 988us/sample - loss: 0.1410 - acc: 0.9630
Epoch 10/10
15864/15864 [==============================] - 14s 878us/sample - loss: 0.1303 - acc: 0.9667
10577/10577 - 3s - loss: 0.1214 - acc: 0.9707
• The following animation again shows how the cross-entropy loss on the training dataset decreases and the accuracy of classification increases over epochs on the training dataset.
• The following figure shows a few sample images from the traffic dataset with different labels.
• The next animations show the features learnt at various convolution and pooling layers.
• The next figure shows the evaluation result on test dataset — the corresponding confusion matrix.

## Parser

Write an AI agent to parse sentences and extract noun phrases.

• A common task in natural language processing is parsing, the process of determining the structure of a sentence.
• This is useful for a number of reasons: knowing the structure of a sentence can help a computer to better understand the meaning of the sentence, and it can also help the computer extract information out of a sentence.
• In particular, it’s often useful to extract noun phrases out of a sentence to get an understanding for what the sentence is about.
• In this problem, we’ll use the context-free grammar formalism to parse English sentences to determine their structure.
• Recall that in a context-free grammar, we repeatedly apply rewriting rules to transform symbols into other symbols. The objective is to start with a non-terminal symbol S (representing a sentence) and repeatedly apply context-free grammar rules until we generate a complete sentence of terminal symbols (i.e., words).
• The rule S -> N V , for example, means that the S symbol can be rewritten as N V (a noun followed by a verb). If we also have the rule N -> “Holmes” and the rule V -> “sat” , we can generate the complete sentence “Holmes sat.” .
• Of course, noun phrases might not always be as simple as a single word like “Holmes” . We might have noun phrases like “my companion” or “a country walk” or “the day before Thursday” , which require more complex rules to account for.
• To account for the phrase “my companion” , for example, we might imagine a rule like: NP -> N | Det N. In this rule, we say that an NP (a “noun phrase”) could be either just a noun ( N ) or a determiner ( Det ) followed by a noun, where determiners include words like “a” , “the” , and “my” . The vertical bar ( | ) just indicates that there are multiple possible ways to rewrite an NP , with each possible rewrite separated by a bar.
• To incorporate this rule into how we parse a sentence ( S ), we’ll also need to modify our S -> N V rule to allow for noun phrases ( NP s) as the subject of our sentence. See how? And to account for more complex types of noun phrases, we may need to modify our grammar even further.
• Given sentences that are needed to be parsed by the CFG (to be constructed) are shown below:
Holmes sat.
Holmes lit a pipe.
We arrived the day before Thursday.
Holmes sat in the red armchair and he chuckled.
My companion smiled an enigmatical smile.
Holmes chuckled to himself.
She never said a word until we were at the door here.
Holmes sat down and lit his pipe.
I had a country walk on Thursday and came home in a dreadful mess.
I had a little moist red paint in the palm of my hand.

Implementation tips

• The NONTERMINALS global variable should be replaced with a set of context-free grammar rules that, when combined with the rules in TERMINALS , allow the parsing of all sentences in the sentences/ directory.
• Each rules must be on its own line. Each rule must include the -> characters to denote which symbol is being replaced, and may optionally include | symbols if there are multiple ways to rewrite a symbol.
• Do not need to keep the existing rule S -> N V in your solution, but the first rule must begin with S -> since S (representing a sentence) is the starting symbol.
• May add as many nonterminal symbols as needed.
• Use the nonterminal symbol NP to represent a “noun phrase”, such as the subject of a sentence.
• A few rules needed to parse the given sentences are shown below (we need to add a few more rules):
NONTERMINALS = """
S -> NP VP | S Conj VP | S Conj S | S P S
NP -> N | Det N | AP NP | Det AP NP | N PP | Det N PP
"""
• Few rules for the TERMINALS from the given grammar are shown below:
TERMINALS = """
Adj -> "country" | "dreadful" | "enigmatical" | "little" | "moist" | "red"
Conj -> "and"
Det -> "a" | "an" | "his" | "my" | "the"
N -> "armchair" | "companion" | "day" | "door" | "hand" | "he" | "himself"
P -> "at" | "before" | "in" | "of" | "on" | "to" | "until"
V -> "smiled" | "tell" | "were"
"""
• The next python code snippet implements the function np_chunk() that should accept a nltk tree representing the syntax of a sentence, and return a list of all of the noun phrase chunks in that sentence.
• For this problem, a “noun phrase chunk” is defined as a noun phrase that doesn’t contain other noun phrases within it. Put more formally, a noun phrase chunk is a subtree of the original tree whose label is NP and that does not itself contain other noun phrases as subtrees.
• Assume that the input will be a nltk.tree object whose label is S (that is to say, the input will be a tree representing a sentence).
• The function should return a list of nltk.tree objects, where each element  has the label NP.
• The documentation for nltk.tree will be helpful for identifying how to manipulate a nltk.tree object.
import nltkdef np_chunk(tree):
"""
Return a list of all noun phrase chunks in the sentence tree.
A noun phrase chunk is defined as any subtree of the sentence
whose label is "NP" that does not itself contain any other
noun phrases as subtrees.
"""
chunks = []
for s in tree.subtrees(lambda t: t.label() == 'NP'):
if len(list(s.subtrees(lambda t: t.label() == 'NP' and
t != s))) == 0:
chunks.append(s)
return chunks

The following animation shows how the given sentences are parsed with the grammar rules constructed.

## Questions

Write an AI agent to answer questions from a given corpus

• Question Answering (QA) is a field within natural language processing focused on designing systems that can answer questions.
• Among the more famous question answering systems is Watson , the IBM computer that competed (and won) on Jeopardy!.
• A question answering system of Watson’s accuracy requires enormous complexity and vast amounts of data, but in this problem, we’ll design a very simple question answering system based on inverse document frequency.
• Our question answering system will perform two tasks: document retrieval and passage retrieval.
• Our system will have access to a corpus of text documents.
• When presented with a query (a question in English asked by the user), document retrieval will first identify which document(s) are most relevant to the query.
• Once the top documents are found, the top document(s) will be subdivided into passages (in this case, sentences) so that the most relevant passage to the question can be determined.
• How do we find the most relevant documents and passages? To find the most relevant documents, we’ll use tf-idf to rank documents based both on term frequency for words in the query as well as inverse document frequency for words in the query. The required concepts are summarized in the following figure:
• Once we’ve found the most relevant documents, there are many possible metrics for scoring passages, but we’ll use a combination of inverse document frequency and a query term density measure.
• More sophisticated question answering systems might employ other strategies (analyzing the type of question word used, looking for synonyms of query words, lemmatizing to handle different forms of the same word, etc.), but that comes only after the baseline version is implemented.

Implementation tips

• Implement a function compute_idfs() that should accept a dictionary of documents and return a new dictionary mapping words to their IDF (inverse document frequency) values.
• Assume that documents will be a dictionary mapping names of documents to a list of words in that document.
• The returned dictionary should map every word that appears in at least one of the documents to its inverse document frequency value.
• Recall that the inverse document frequency of a word is defined by taking the natural logarithm of the number of documents divided by the number of documents in which the word appears.
• Implement the top_files() function that should, given a query (a set of words), files (a dictionary mapping names of files to a list of their words), and idfs (a dictionary mapping words to their IDF values), return a list of the filenames the n top files that that match the query, ranked according to tf-idf.
• The returned list of should be of length n and should be ordered with the best match first.
• Files should be ranked according to the sum of tf-idf values for any word in the query that also appears in the file. Words in the query that do not appear in the file should not contribute to the file’s score.
• Recall that tf-idf for a term is computed by multiplying the number of times the term appears in the document by the IDF value for that term.
• Implement the function top_sentences() that should, given a query (a set of words), sentences (a dictionary mapping sentences to a list of their words), and idfs (a dictionary mapping words to their IDF values), return a list of the n top sentences that match the query, ranked according to IDF.
• The returned list of sentences should be of length n and should be ordered with the best match first.
• Sentences should be ranked according to “matching word measure”: namely, the sum of IDF values for any word in the query that also appears in the sentence. Note that term frequency should not be taken into account here, only inverse document frequency.
• If two sentences have the same value according to the matching word measure, then sentences with a higher “query term density” should be preferred. Query term density is defined as the proportion of words in the sentence that are also words in the query.
• The next python code snippet shows how the couple of functions mentioned above are implemented:
def compute_idfs(documents):
"""
Given a dictionary of documents that maps names of documents
to a list of words, return a dictionary that maps words to their
IDF values.
Any word that appears in at least one of the documents should
be in the resulting dictionary.
"""
words = set()
for document in documents:
words.update(documents[document])
idfs = dict()
for word in words:
f = sum(word in documents[document] for document in
documents)
idfs[word] = math.log(len(documents) / f)
return idfsdef top_files(query, files, idfs, n):
"""
Given a query (a set of words), files (a dictionary mapping
names of files to a list of their words), and idfs (a
dictionary mapping words to their IDF values), return a list of
the filenames of the the n top files that match the query,
ranked according to tf-idf.
"""
scores = []
for filename in files:
score = 0
for word in query:
tf = len(list(filter(lambda w: w == word,
files[filename])))
score += tf * idfs.get(word, 0)
scores.append((score, filename))
scores.sort(key = lambda x: -x)
return [item for item in scores][:n]

The following animation shows the answers to the questions found from the given corpus: for each question, the best candidate text (document) likely containing the answer is first found inside the corpus and then the top 5 candidate answer sentences is found inside that document.

Artificial Intelligence is a vast area of research and it contains many different sub-areas those themselves are huge. In this blog, we discussed some problems from Search, Knowledge (Logic) and Uncertainty, Optimization, Learning, Neural Network and Language. We discussed how to implement AI agents for an adversarial search game (tic-tac-toe), a logic-based game (minesweeper) and a reinforcement learning-based game (Nim). We have also seen how to generate crossword puzzles by representing the problem as a CSP problem. Finally, we have seen application of AI in Computer Vision (traffic sign classification needed for self driving cars) and NLP (question-answering system with information retrieval).

# Gaussian Process Regression With Python

In this blog, we shall discuss on Gaussian Process Regression, the basic concepts, how it can be implemented with python from scratch and also using the GPy library. Then we shall demonstrate an application of GPR in Bayesian optimization with the GPyOpt library. The problems appeared in this coursera course on Bayesian methods for Machine Learning by UCSanDiego HSE and also in this Machine learning course provided at UBC.

# Gaussian Process

A GP is a Gaussian distribution over functions, that takes two parameters, namely the mean (m) and the kernel function K (to ensure smoothness). In this article, we shall implement non-linear regression with GP.

Given training data points (X,y) we want to learn a (non-linear) function f:R^d -> R (here X is d-dimensional), s.t., y = f(x).

Then use the function f to predict the value of y for unseen data points Xtest, along with the confidence of prediction.

The following figure describes the basic concepts of a GP and how it can be used for regression. We need to use the conditional expectation and variance formula (given the data) to compute the posterior distribution for the GP.

Here, we shall first discuss on Gaussian Process Regression. Let’s follow the steps below to get some intuition.

• Generate 10 data points (these points will serve as training datapoints) with negligible noise (corresponds to noiseless GP regression). Use the following python function with default noise variance.

import numpy as np
def generate_noisy_points(n=10, noise_variance=1e-6):
np.random.seed(777)
X = np.random.uniform(-3., 3., (n, 1))
y = np.sin(X) + np.random.randn(n, 1) * noise_variance**0.5
return X, y

• Plot the points with the following code snippet.

import matplotlib.pylab as plt
X, y = generate_noisy_points()
plt.plot(X, y, 'x')
plt.show()

• Now, let’s implement the algorithm for GP regression, the one shown in the above figure. First lets generate 100 test data points.
Xtest, ytest = generate_noisy_points(100)
Xtest.sort(axis=0)

• Draw 10 function samples from the GP prior distribution using the following python code.
n = len(Xtest)
K = kernel(Xtest, Xtest)
L = np.linalg.cholesky(K + noise_var*np.eye(n))
f_prior = np.dot(L, np.random.normal(size=(n, n_samples)))

• The following animation shows the sample functions drawn from the GP prior dritibution.
• Next, let’s compute the GP posterior distribution given the original (training) 10 data points, using the following python code snippet.
N, n = len(X), len(Xtest)
K = kernel(X, X)
L = np.linalg.cholesky(K + noise_var*np.eye(N))
K_ = kernel(Xtest, Xtest)
Lk = np.linalg.solve(L, kernel(X, Xtest))
mu = np.dot(Lk.T, np.linalg.solve(L, y))
L = np.linalg.cholesky(K_ + noise_var*np.eye(n) - np.dot(Lk.T, Lk))
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,n_samples)))

• The following animation shows 10 function samples drawn from the GP posterior distribution. As expected, we get nearly zero uncertainty in the prediction of the points that are present in the training dataset and the variance increase as we move further from the points.
• The kernel function used here is RBF kernel, can be implemented with the following python code snippet.
def kernel(x, y, l2):
sqdist = np.sum(x**2,1).reshape(-1,1) + \
np.sum(y**2,1) - 2*np.dot(x, y.T)
return np.exp(-.5 * (1/l2) * sqdist)

• Now, let’s predict with the Gaussian Process Regression model, using the following python function:
def posterior(X, Xtest, l2=0.1, noise_var=1e-6):
# compute the mean at our test points.
N, n = len(X), len(Xtest)
K = kernel(X, X, l2)
L = np.linalg.cholesky(K + noise_var*np.eye(N))
Lk = np.linalg.solve(L, kernel(X, Xtest, l2))
mu = np.dot(Lk.T, np.linalg.solve(L, y))
# compute the variance at our test points.
K_ = kernel(Xtest, Xtest, l2)
sd = np.sqrt(np.diag(K_) - np.sum(Lk**2, axis=0))
return (mu, sd)

• Use the above function to predict the mean and standard deviation at x=1.
mu, sd = posterior(X, np.array([]))
print(mu, sd)
# [[0.85437633]] [0.26444361]

• The following figure shows the predicted values along with the associated 3 s.d. confidence.
• As can be seen, the highest confidence (corresponds to zero confidence interval) is again at the training data points. The blue curve represents the original function, the red one being the predicted function with GP and the red “+” points are the training data points.

Noisy Gaussian Regression

• Now let’s increase the noise variance to implement the noisy version of GP.
• The following animation shows how the predictions and the confidence interval change as noise variance is increased: the predictions become less and less uncertain, as expected.
• Next, let’s see how varying the RBF kernel parameter l changes the confidence interval, in the following animation.

# Gaussian processes and Bayesian optimization

Now, let’s learn how to use GPy and GPyOpt libraries to deal with gaussian processes. These libraries provide quite simple and inuitive interfaces for training and inference, and we will try to get familiar with them in a few tasks. The following figure shows the basic concepts required for GP regression again.

## Gaussian processes Regression with GPy (documentation)

• Again, let’s start with a simple regression problem, for which we will try to fit a Gaussian Process with RBF kernel.
• Create RBF kernel with variance sigma_f and length-scale parameter l for 1D samples and compute value of the kernel between points, using the following code snippet.
import GPy
import GPyOpt
import seaborn as sns
sigma_f, l = 1.5, 2
kernel = GPy.kern.RBF(1, sigma_f, l)
sns.heatmap(kernel.K(X, X))
plt.show()

• The following figure shows how the kernel heatmap looks like (we have 10 points in the training data, so the computed kernel is a 10X10 matrix.
• Let’s fit a GP on the training data points. Use kernel from previous task. As shown in the code below, use GPy.models.GPRegression class to predict mean and vairance at position 𝑥=1, e.g.
X, y = generate_noisy_points(noise_variance=0.01)
model = GPy.models.GPRegression(X,y,kernel)
mean, variance = model.predict(np.array([]))
print(mean, variance)
# 0.47161301004863576 1.1778512693257484

• Let’s see the parameters of the model and plot the model
print(model)
model.plot()
plt.show()
# Name : GP regression
# Objective : 11.945995014694255
# Number of Parameters : 3
# Number of Optimization Parameters : 3
# Parameters:
# <strong>GP_regression.         </strong>  |               value  |  constraints  |
# priors
# <strong>rbf.variance           </strong>  |  0.5884024388364221  |      +ve      |
# <strong>rbf.lengthscale        </strong>  |   1.565659066396689  |      +ve      |
# <strong>Gaussian_noise.variance</strong>  |                 1.0  |      +ve      |

• Observe that the model didn’t fit the data quite well. Let’s try to fit kernel and noise parameters automatically.
model.optimize()
print(model)
# Name : GP regression
# Objective : 0.691713117288967
# Number of Parameters : 3
# Number of Optimization Parameters : 3
# Parameters:
# <strong>GP_regression.         </strong>  |                  value  |  constraints  |  # priors
# <strong>rbf.variance           </strong>  |     0.5088590289246014  |      +ve      |
# <strong>rbf.lengthscale        </strong>  |     1.1019439281553658  |      +ve      |
# <strong>Gaussian_noise.variance</strong>  |  0.0030183424485056066  |      +ve      |

• Now plot the model to obtain a figure like the following one.
• As can be seen from the above figure, the process generates outputs just right.

## Differentiating noise and signal with GP

• Generate two datasets: sinusoid wihout noise (with the function generate_points() and noise variance 0) and samples from gaussian noise (with the function generate_noise() define below).
def generate_noise(n=10, noise_variance=0.01):
np.random.seed(777)
X = np.random.uniform(-3., 3., (n, 1))
y = np.random.randn(n, 1) * noise_variance**0.5
return X, y

X, y = generate_noise(noise_variance=10)

• Optimize kernel parameters compute the optimal values of noise component for the noise.
model = GPy.models.GPRegression(X,y,kernel)
model.optimize()
print(model)
# Name : GP regression
# Objective : 26.895319516885678
# Number of Parameters : 3
# Number of Optimization Parameters : 3
# Parameters:
# <strong>GP_regression.         </strong>  |              value  |  constraints  |
# priors
# <strong>rbf.variance           </strong>  |  4.326712527380182  |      +ve      |
# <strong>rbf.lengthscale        </strong>  |  0.613701590417825  |      +ve      |
# <strong>Gaussian_noise.variance</strong>  |  9.006031601676087  |      +ve      |

• Now optimize kernel parameters compute the optimal values of noise component for the signal without noise.
X, y = generate_noisy_points(noise_variance=0)
model = GPy.models.GPRegression(X,y,kernel)
model.optimize()
print(model)
# Name : GP regression
# Objective : -22.60291145140328
# Number of Parameters : 3
# Number of Optimization Parameters : 3
# Parameters:
#  <strong>GP_regression.       </strong>  |                  value  |  constraints |  #  priors
#  <strong>rbf.variance         </strong>  |      2.498516381000242  |      +ve     |
#  <strong>rbf.lengthscale      </strong>  |     2.4529513517426444  |      +ve     |
#  <strong>Gaussian_noise.variance</strong> |  5.634716625480888e-16  |      +ve    |

• As can be seen from above, the GP detects the noise correctly with a high value of Gaussian_noise.variance output parameter.

# Sparse GP

• Now let’s consider the speed of GP. Let’s generate a dataset of 3000 points and measure the time that is consumed for prediction of mean and variance for each point.
• Then let’s try to use inducing inputs and find the optimal number of points according to quality-time tradeoff.
• For the sparse model with inducing points, you should use GPy.models.SparseGPRegression class.
• The number of inducing inputs can be set with parameter num_inducing and optimize their positions and values with .optimize() call.
• Let’s first create a dataset of 1000 points and fit GPRegression. Measure time for predicting mean and variance at position 𝑥=1.
import time
X, y = generate_noisy_points(1000)
start = time.time()
model = GPy.models.GPRegression(X,y,kernel)
mean, variance = model.predict(np.array([]))
time_gp = time.time()-start
print(mean, variance, time_gp)
# [[0.84157973]] [[1.08298092e-06]] 7.353320360183716

• Then fit SparseGPRegression with 10 inducing inputs and repeat the experiment. Let’s find speedup as a ratio between consumed time without and with inducing inputs.
start = time.time()
model = GPy.models.SparseGPRegression(X,y,kernel,num_inducing=10)
model.optimize()
mean, variance = model.predict(np.array([]))
time_sgp = time.time()-start
print(mean, variance, time_sgp)
# [[0.84159203]] [[1.11154212e-06]] 0.8615052700042725

• As can be seen, there is a speedup of more than 8 with sparse GP using only the inducing points.
time_gp / time_sgp
# 8.535432824627119

• The model is shown in the next figure.

# Bayesian optimization

• Bayesian Optimization is used when there is no explicit objective function and it’s expensive to evaluate the objective function.
• As shown in the next figure, a GP is used along with an acquisition (utility) function to choose the next point to sample, where it’s more likely to find the maximum value in an unknown objective function.
• A GP is constructed from the points already sampled and the next point is sampled from the region where the GP posterior has higher mean (to exploit) and larger variance (to explore), which is determined by the maximum value of the acquisition function (which is a function of GP posterior mean and variance).
• To choose the next point to be sampled, the above process is repeated.
• The next couple of figures show the basic concepts of Bayesian optimization using GP, the algorithm, how it works, along with a few popular acquisition functions.

## Bayesian Optimization with GPyOpt (documentation)

Paramerter Tuning for XGBoost Regressor

• Let’s now try to find optimal hyperparameters to XGBoost model using Bayesian optimization with GP, with the diabetes dataset (from sklearn) as input. Let’s first load the dataset with the following python code snippet:
from sklearn import datasets
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_scoredataset = sklearn.datasets.load_diabetes()
X = dataset['data']
y = dataset['target']

• We will use cross-validation score to estimate accuracy and our goal will be to tune: max_depth, learning_rate, n_estimators parameters. First, we have to define optimization function and domains, as shown in the code below.
# Optimizer will try to find minimum, so let's add a "-" sign.
def f(parameters):
parameters = parameters
score = -cross_val_score(
XGBRegressor(learning_rate=parameters,
max_depth=int(parameters),
n_estimators=int(parameters),
gamma=int(parameters),
min_child_weight = parameters),
X, y, scoring='neg_root_mean_squared_error'
).mean()
score = np.array(score)
return score

# Bounds (define continuous variables first, then discrete!)
bounds = [
{'name': 'learning_rate',
'type': 'continuous',
'domain': (0, 1)},
{'name': 'gamma',
'type': 'continuous',
'domain': (0, 5)},
{'name': 'max_depth',
'type': 'discrete',
'domain': (1, 50)},
{'name': 'n_estimators',
'type': 'discrete',
'domain': (1, 300)},
{'name': 'min_child_weight',
'type': 'discrete',
'domain': (1, 10)}
]

• Let’s find the baseline RMSE with default XGBoost parameters is . Let’s see if we can do better.
baseline = -cross_val_score(
XGBRegressor(), X, y, scoring='neg_root_mean_squared_error'
).mean()
baseline
# 64.90693011829266

• Now, run the Bayesian optimization with GPyOpt and plot convergence, as in the next code snippet:
optimizer = GPyOpt.methods.BayesianOptimization(
f=f, domain=bounds,
acquisition_type ='MPI',
acquisition_par = 0.1,
exact_eval=True
)
max_iter = 50
max_time = 60
optimizer.run_optimization(max_iter, max_time)
optimizer.plot_convergence() • Extract the best values of the parameters and compute the RMSE / gain obtained with Bayesian Optimization, using the following code.
optimizer.X[np.argmin(optimizer.Y)]# array([2.01515532e-01, 1.35401092e+00, 1.00000000e+00, # 3.00000000e+02, 1.00000000e+00])print('RMSE:', np.min(optimizer.Y),      'Gain:', baseline/np.min(optimizer.Y)*100)# RMSE: 57.6844355488563 Gain: 112.52069904249859
• As can be seen, we were able to get 12% boost without tuning parameters by hand.

Paramerter Tuning for SVR

• Now, let’s tune a Support Vector Regressor model with Bayesian Optimization and find the optimal values for three parameters: C, epsilon and gamma.
• Let’s use range (1e-5, 1000) for C, (1e-5, 10) for epsilon and gamma.
• Let’s use MPI as an acquisition function with weight 0.1.
from sklearn.svm import SVR# Bounds (define continuous variables first, then discrete!)
bounds = [
{'name': 'C',
'type': 'continuous',
'domain': (1e-5, 1000)},    {'name': 'epsilon',
'type': 'continuous',
'domain': (1e-5, 10)},    {'name': 'gamma',
'type': 'continuous',
'domain': (1e-5, 10)}
]

# Score. Optimizer will try to find minimum, so we will add a "-" sign.
def f(parameters):
parameters = parameters
score = -cross_val_score(
SVR(C = parameters,
epsilon = parameters,
gamma = parameters),
X, y, scoring='neg_root_mean_squared_error'
).mean()
score = np.array(score)
return scoreoptimizer = GPyOpt.methods.BayesianOptimization(
f=f, domain=bounds,
acquisition_type ='MPI',
acquisition_par = 0.1,
exact_eval=True
)
max_iter = 50*4
max_time = 60*4
optimizer.run_optimization(max_iter, max_time)baseline = -cross_val_score(
SVR(), X, y, scoring='neg_root_mean_squared_error'
).mean()
print(baseline)
# 70.44352670586173print(optimizer.X[np.argmin(optimizer.Y)])
# [126.64337652   8.49323372   8.59189135]print('RMSE:', np.min(optimizer.Y),
'Gain:', baseline/np.min(optimizer.Y)*100)
# RMSE: 54.02576574389976 Gain: 130.38876124364006     best_epsilon = optimizer.X[np.argmin(optimizer.Y)]
optimizer.plot_convergence()
`
• For the model above the boost in performance that was obtained after tuning hyperparameters was 30%.