Neural Translation – Machine Translation with Neural Nets with Keras / Python

In this blog, we shall discuss about how to build a neural network to translate from English to German. This problem appeared as the Capstone project for the coursera course “Tensorflow 2: Customising your model“, a part of the specialization “Tensorflow2 for Deep Learning“, by the Imperial College, London. The problem statement / description / steps are taken from the course itself. We shall use the concepts from the course, including building more flexible model architectures, freezing layers, data processing pipeline and sequence modelling.

Image taken from the Capstone project

Here we shall use a language dataset from http://www.manythings.org/anki/ to build a neural translation model. This dataset consists of over 200k pairs of sentences in English and German. In order to make the training quicker, we will restrict to our dataset to 20k pairs. The below figure shows a few sentence pairs taken from the file.

Our goal is to develop a neural translation model from English to German, making use of a pre-trained English word embedding module.

1. Text preprocessing

We need to start with preprocessing the above input file. Here are the steps that we need to follow:

  • First let’s create separate lists of English and German sentences.
  • Add a special “” and “” token to the beginning and end of every German sentence.
  • Use the Tokenizer class from the tf.keras.preprocessing.text module to tokenize the German sentences, ensuring that no character filters are applied.

The next figure shows 5 randomly chosen examples of (preprocessed) English and German sentence pairs. For the German sentence, the text (with start and end tokens) as well as the tokenized sequence are shown.

  • Pad the end of the tokenized German sequences with zeros, and batch the complete set of sequences into a single numpy array, using the following code snippet.

padded_tokenized_german_sentences = tf.keras.preprocessing.sequence.pad_sequences(tokenized_german_sentences, 
                                                                                  maxlen=14, padding='post', value=0) 
padded_tokenized_german_sentences.shape
#(20000, 14)

As can be seen from the next code block, the maximum length of a German sentence is 14, whereas there are 5743 unique words in the German sentences from the subset of the corpus. The index of the <start> token is 1.

max([len(tokenized_german_sentences[i]) for i in range(20000)])
# 14
len(tokenizer.index_word)
# 5743
tokenizer.word_index['']
# 1

2. Preparing the data with tf.data.Dataset

Loading the embedding layer

As part of the dataset preproceessing for this project we shall use a pre-trained English word-embedding module from TensorFlow Hub. The URL for the module is https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1.

This embedding takes a batch of text tokens in a 1-D tensor of strings as input. It then embeds the separate tokens into a 128-dimensional space.

Although this model can also be used as a sentence embedding module (e.g., where the module will process each token by removing punctuation and splitting on spaces and then averages the word embeddings over a sentence to give a single embedding vector), however, we will use it only as a word embedding module here, and will pass each word in the input sentence as a separate token.

The following code snippet shows how an English sentence with 7 words is mapped into a 7×128 tensor in the embedding space.


embedding_layer(tf.constant(["these", "aren't", "the", "droids", "you're", "looking", "for"])).shape
# TensorShape([7, 128])

Now, let’s prepare the training and validation Datasets as follows:

  • Create a random training and validation set split of the data, reserving e.g. 20% of the data for validation (each English dataset example is a single sentence string, and each German dataset example is a sequence of padded integer tokens).
  • Load the training and validation sets into a tf.data.Dataset object, passing in a tuple of English and German data for both training and validation sets, using the following code snippet.

def make_Dataset(input_array, target_array):
    return tf.data.Dataset.from_tensor_slices((input_array, target_array)) 

train_data = make_Dataset(input_train, target_train)
valid_data = make_Dataset(input_valid, target_valid)

  • Create a function to map over the datasets that splits each English sentence at spaces. Apply this function to both Dataset objects using the map method, using the following code snippet.

def str_split(e, g):
    e = tf.strings.split(e)
    return e, g

train_data = train_data.map(str_split)
valid_data = valid_data.map(str_split)

  • Create a function to map over the datasets that embeds each sequence of English words using the loaded embedding layer/model. Apply this function to both Dataset objects using the map method, using the following code snippet.

def embed_english(x, y):
    return embedding_layer(x), y

train_data = train_data.map(embed_english)
valid_data = valid_data.map(embed_english)

  • Create a function to filter out dataset examples where the English sentence is more than 13 (embedded) tokens in length. Apply this function to both Dataset objects using the filter method, using the following code snippet.

def remove_long_sentence(e, g):
    return tf.shape(e)[0] <= 13

train_data = train_data.filter(remove_long_sentence)
valid_data = valid_data.filter(remove_long_sentence)

  • Create a function to map over the datasets that pads each English sequence of embeddings with some distinct padding value before the sequence, so that each sequence is length 13. Apply this function to both Dataset objects using the map method, as shown in the next code block. 

def pad_english(e, g):
    return tf.pad(e, paddings = [[13-tf.shape(e)[0],0], [0,0]], mode='CONSTANT', constant_values=0), g

train_data = train_data.map(pad_english)
valid_data = valid_data.map(pad_english)

  • Batch both training and validation Datasets with a batch size of 16.

train_data = train_data.batch(16)
valid_data = valid_data.batch(16)

  • Let’s now print the element_spec property for the training and validation Datasets. Also, let’s print the shape of an English data example from the training Dataset and a German data example Tensor from the validation Dataset.

train_data.element_spec
#(TensorSpec(shape=(None, None, 128), dtype=tf.float32, name=None),
# TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

valid_data.element_spec
#(TensorSpec(shape=(None, None, 128), dtype=tf.float32, name=None),
 #TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

for e, g in train_data.take(1):
    print(e.shape)
#(16, 13, 128)

for e, g in valid_data.take(1):
    print(g)
#tf.Tensor(
#[[   1   11  152    6  458    3    2    0    0    0    0    0    0    0]
# [   1   11  333  429    3    2    0    0    0    0    0    0    0    0]
# [   1   11   59   12    3    2    0    0    0    0    0    0    0    0]
# [   1  990   25   42  444    7    2    0    0    0    0    0    0    0]
# [   1    4   85 1365    3    2    0    0    0    0    0    0    0    0]
# [   1  131    8   22    5  583    3    2    0    0    0    0    0    0]
# [   1    4   85 1401    3    2    0    0    0    0    0    0    0    0]
# [   1   17  381   80    3    2    0    0    0    0    0    0    0    0]
# [   1 2998   13   33    7    2    0    0    0    0    0    0    0    0]
# [   1  242    6  479    3    2    0    0    0    0    0    0    0    0]
# [   1   35   17   40    7    2    0    0    0    0    0    0    0    0]
# [   1   11   30  305   46   47 1913  471    3    2    0    0    0    0]
# [   1    5   48 1184    3    2    0    0    0    0    0    0    0    0]
# [   1    5  287   12  834 5268    3    2    0    0    0    0    0    0]
# [   1    5    6  523    3    2    0    0    0    0    0    0    0    0]
# [   1   13  109   28   29   44  491    3    2    0    0    0    0    0]], shape=(16, 14), dtype=int32)

The custom translation model

The following is a schematic of the custom translation model architecture we shall develop now.

Image taken from the Capstone project

The custom model consists of an encoder RNN and a decoder RNN. The encoder takes words of an English sentence as input, and uses a pre-trained word embedding to embed the words into a 128-dimensional space. To indicate the end of the input sentence, a special end token (in the same 128-dimensional space) is passed in as an input. This token is a TensorFlow Variable that is learned in the training phase (unlike the pre-trained word embedding, which is frozen).

The decoder RNN takes the internal state of the encoder network as its initial state. A start token is passed in as the first input, which is embedded using a learned German word embedding. The decoder RNN then makes a prediction for the next German word, which during inference is then passed in as the following input, and this process is repeated until the special <end> token is emitted from the decoder.

Create the custom layer

Let’s create a custom layer to add the learned end token embedding to the encoder model:

Image taken from the capstone project

Now let’s first build the custom layer, which will be later used to create the encoder.

  • Using layer subclassing, create a custom layer that takes a batch of English data examples from one of the Datasets, and adds a learned embedded ‘end’ token to the end of each sequence.
  • This layer should create a TensorFlow Variable (that will be learned during training) that is 128-dimensional (the size of the embedding space).

from tensorflow.keras.models import  Sequential, Model
from tensorflow.keras.layers import Layer, Concatenate, Input, Masking, LSTM, Embedding, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

class CustomLayer(Layer):

    def __init__(self, **kwargs):
        super(CustomLayer, self).__init__(**kwargs)
        self.embed = tf.Variable(initial_value=tf.zeros(shape=(1,128)), trainable=True, dtype='float32')
        
    def call(self, inputs):
        x = tf.tile(self.embed, [tf.shape(inputs)[0], 1])
        x = tf.expand_dims(x, axis=1)
        return tf.concat([inputs, x], axis=1)
        <em>#return Concatenate(axis=1)([inputs, x])</em>

  • Let’s extract a batch of English data examples from the training Dataset and print the shape. Test the custom layer by calling the layer on the English data batch Tensor and print the resulting Tensor shape (the layer should increase the sequence length by one).

custom_layer = CustomLayer()
e, g = next(iter(train_data.take(1)))
print(e.shape)
# (16, 13, 128)
o = custom_layer(e)
o.shape
# TensorShape([16, 14, 128])

Build the encoder network

The encoder network follows the schematic diagram above. Now let’s build the RNN encoder model.

  • Using the keras functional API, build the encoder network according to the following spec:
    • The model will take a batch of sequences of embedded English words as input, as given by the Dataset objects.
    • The next layer in the encoder will be the custom layer you created previously, to add a learned end token embedding to the end of the English sequence.
    • This is followed by a Masking layer, with the mask_value set to the distinct padding value you used when you padded the English sequences with the Dataset preprocessing above.
    • The final layer is an LSTM layer with 512 units, which also returns the hidden and cell states.
    • The encoder is a multi-output model. There should be two output Tensors of this model: the hidden state and cell states of the LSTM layer. The output of the LSTM layer is unused.

inputs = Input(batch_shape = (<strong>None</strong>, 13, 128), name='input')
x = CustomLayer(name='custom_layer')(inputs)
x = Masking(mask_value=0, name='masking_layer')(x)
x, h, c = LSTM(units=512, return_state=<strong>True</strong>, name='lstm')(x)
encoder_model = Model(inputs = inputs, outputs = [h, c], name='encoder')

encoder_model.summary()

# Model: "encoder"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input (InputLayer)           [(None, 13, 128)]         0         
# _________________________________________________________________
# custom_layer (CustomLayer)   (None, 14, 128)           128       
# _________________________________________________________________
# masking_layer (Masking)      (None, 14, 128)           0         
# _________________________________________________________________
# lstm (LSTM)                  [(None, 512), (None, 512) 1312768   
# =================================================================
# Total params: 1,312,896
# Trainable params: 1,312,896
# Non-trainable params: 0
# _________________________________________________________________

Build the decoder network

The decoder network follows the schematic diagram below.

image taken from the capstone project

Now let’s build the RNN decoder model.

  • Using Model subclassing, build the decoder network according to the following spec:
    • The initializer should create the following layers:
      • An Embedding layer with vocabulary size set to the number of unique German tokens, embedding dimension 128, and set to mask zero values in the input.
      • An LSTM layer with 512 units, that returns its hidden and cell states, and also returns sequences.
      • A Dense layer with number of units equal to the number of unique German tokens, and no activation function.
    • The call method should include the usual inputs argument, as well as the additional keyword arguments hidden_state and cell_state. The default value for these keyword arguments should be None.
    • The call method should pass the inputs through the Embedding layer, and then through the LSTM layer. If the hidden_state and cell_state arguments are provided, these should be used for the initial state of the LSTM layer. 
    • The call method should pass the LSTM output sequence through the Dense layer, and return the resulting Tensor, along with the hidden and cell states of the LSTM layer.

class Decoder(Model):
    
    def __init__(self, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.embed = Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=128, mask_zero=True, name='embedding_layer')
        self.lstm = LSTM(units = 512, return_state = True, return_sequences = True, name='lstm_layer')
        self.dense = Dense(len(tokenizer.index_word)+1, name='dense_layer')
        
    def call(self, inputs, hidden_state = None, cell_state = None):
        x = self.embed(inputs)
        x, hidden_state, cell_state = self.lstm(x, initial_state = [hidden_state, cell_state]) \
                                                     if hidden_state is not None and cell_state is not None else self.lstm(x)
        x = self.dense(x)
        return x, hidden_state, cell_state

decoder_model = Decoder(name='decoder')
e, g_in = next(iter(train_data.take(1)))
h, c = encoder_model(e)
g_out, h, c = decoder_model(g_in, h, c)

print(g_out.shape, h.shape, c.shape)
# (16, 14, 5744) (16, 512) (16, 512)

decoder_model.summary()

#Model: "decoder"
#_________________________________________________________________
#Layer (type)                 Output Shape              Param #   
#=================================================================
#embedding_layer (Embedding)  multiple                  735232    
#_________________________________________________________________
#lstm_layer (LSTM)            multiple                  1312768   
#_________________________________________________________________
#dense_layer (Dense)          multiple                  2946672   
#=================================================================
#Total params: 4,994,672
#Trainable params: 4,994,672
#Non-trainable params: 0

Create a custom training loop

custom training loop to train your custom neural translation model.

  • Define a function that takes a Tensor batch of German data (as extracted from the training Dataset), and returns a tuple containing German inputs and outputs for the decoder model (refer to schematic diagram above).
  • Define a function that computes the forward and backward pass for your translation model. This function should take an English input, German input and German output as arguments, and should do the following:
    • Pass the English input into the encoder, to get the hidden and cell states of the encoder LSTM.
    • These hidden and cell states are then passed into the decoder, along with the German inputs, which returns a sequence of outputs (the hidden and cell state outputs of the decoder LSTM are unused in this function).
    • The loss should then be computed between the decoder outputs and the German output function argument.
    • The function returns the loss and gradients with respect to the encoder and decoder’s trainable variables.
    • Decorate the function with @tf.function
  • Define and run a custom training loop for a number of epochs (for you to choose) that does the following:
    • Iterates through the training dataset, and creates decoder inputs and outputs from the German sequences.
    • Updates the parameters of the translation model using the gradients of the function above and an optimizer object.
    • Every epoch, compute the validation loss on a number of batches from the validation and save the epoch training and validation losses.
  • Plot the learning curves for loss vs epoch for both training and validation sets.

@tf.function
def forward_backward(encoder_model, decoder_model, e, g_in, g_out, loss):
    with tf.GradientTape() as tape:
        h, c = encoder_model(e)
        d_g_out, _, _ = decoder_model(g_in, h, c)
        cur_loss = loss(g_out, d_g_out)
        grads = tape.gradient(cur_loss, encoder_model.trainable_variables + decoder_model.trainable_variables)
    return cur_loss, grads

def train_encoder_decoder(encoder_model, decoder_model, num_epochs, train_data, valid_data, valid_steps, 
                          optimizer, loss, grad_fn):
    train_losses = []
    val_loasses = []
    for epoch in range(num_epochs):
        train_epoch_loss_avg = tf.keras.metrics.Mean()
        val_epoch_loss_avg = tf.keras.metrics.Mean()
        for e, g in train_data:
            g_in, g_out = get_german_decoder_data(g)
            train_loss, grads = grad_fn(encoder_model, decoder_model, e, g_in, g_out, loss)
            optimizer.apply_gradients(zip(grads, encoder_model.trainable_variables + decoder_model.trainable_variables))
            train_epoch_loss_avg.update_state(train_loss)    
        for e_v, g_v in valid_data.take(valid_steps):
            g_v_in, g_v_out = get_german_decoder_data(g_v)
            val_loss, _ = grad_fn(encoder_model, decoder_model, e_v, g_v_in, g_v_out, loss)
            val_epoch_loss_avg.update_state(val_loss)        
        print(f'epoch: {epoch}, train loss: {train_epoch_loss_avg.result()}, validation loss: {val_epoch_loss_avg.result()}')    
        train_losses.append(train_epoch_loss_avg.result())
        val_loasses.append(val_epoch_loss_avg.result())
    return train_losses, val_loasses

optimizer_obj = Adam(learning_rate = 1e-3)
loss_obj = SparseCategoricalCrossentropy(from_logits=True)
train_loss_results, valid_loss_results = train_encoder_decoder(encoder_model, decoder_model, 20, train_data, valid_data, 20,
                                                          optimizer_obj, loss_obj, forward_backward)

#epoch: 0, train loss: 4.4570465087890625, validation loss: 4.1102800369262695
#epoch: 1, train loss: 3.540217399597168, validation loss: 3.36271333694458
#epoch: 2, train loss: 2.756622076034546, validation loss: 2.7144060134887695
#epoch: 3, train loss: 2.049957275390625, validation loss: 2.1480133533477783
#epoch: 4, train loss: 1.4586931467056274, validation loss: 1.7304519414901733
#epoch: 5, train loss: 1.0423369407653809, validation loss: 1.4607685804367065
#epoch: 6, train loss: 0.7781839370727539, validation loss: 1.314332127571106
#epoch: 7, train loss: 0.6160411238670349, validation loss: 1.2391613721847534
#epoch: 8, train loss: 0.5013922452926636, validation loss: 1.1840368509292603
#epoch: 9, train loss: 0.424654096364975, validation loss: 1.1716119050979614
#epoch: 10, train loss: 0.37027251720428467, validation loss: 1.1612160205841064
#epoch: 11, train loss: 0.3173922598361969, validation loss: 1.1330692768096924
#epoch: 12, train loss: 0.2803193926811218, validation loss: 1.1394184827804565
#epoch: 13, train loss: 0.24854864180088043, validation loss: 1.1354353427886963
#epoch: 14, train loss: 0.22135266661643982, validation loss: 1.1059410572052002
#epoch: 15, train loss: 0.2019050121307373, validation loss: 1.1111358404159546
#epoch: 16, train loss: 0.1840481162071228, validation loss: 1.1081823110580444
#epoch: 17, train loss: 0.17126116156578064, validation loss: 1.125329852104187
#epoch: 18, train loss: 0.15828527510166168, validation loss: 1.0979799032211304
#epoch: 19, train loss: 0.14451280236244202, validation loss: 1.0899451971054077

import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.xlabel("Epochs", fontsize=14)
plt.ylabel("Loss", fontsize=14)
plt.title('Loss vs epochs')
plt.plot(train_loss_results, label='train')
plt.plot(valid_loss_results, label='valid')
plt.legend()
plt.show()

The following figure shows how the training and validation loss decrease with epochs (the model is trained for 20 epochs).

Use the model to translate

Now it’s time to put the model into practice! Let’s run the translation for five randomly sampled English sentences from the dataset. For each sentence, the process is as follows:

  • Preprocess and embed the English sentence according to the model requirements.
  • Pass the embedded sentence through the encoder to get the encoder hidden and cell states.
  • Starting with the special "<start>" token, use this token and the final encoder hidden and cell states to get the one-step prediction from the decoder, as well as the decoder’s updated hidden and cell states.
  • Create a loop to get the next step prediction and updated hidden and cell states from the decoder, using the most recent hidden and cell states. Terminate the loop when the "<end>" token is emitted, or when the sentence has reached a maximum length.
  • Decode the output token sequence into German text and print the English text and the model’s German translation.


indices = np.random.choice(len(english_sentences), 5)
test_data = tf.data.Dataset.from_tensor_slices(np.array([english_sentences[i] for i in indices]))
test_data = test_data.map(tf.strings.split)
test_data = test_data.map(embedding_layer)
test_data = test_data.filter(lambda x: tf.shape(x)[0] <= 13)
test_data = test_data.map(lambda x: tf.pad(x, paddings = [[13-tf.shape(x)[0],0], [0,0]], mode='CONSTANT', constant_values=0))
print(test_data.element_spec)
# TensorSpec(shape=(None, 128), dtype=tf.float32, name=None)

start_token = np.array(tokenizer.texts_to_sequences(['']))
end_token = np.array(tokenizer.texts_to_sequences(['']))
for e, i in zip(test_data.take(n), indices):
    h, c = encoder_model(tf.expand_dims(e, axis=0))
    g_t = []
    g_in = start_token
    g_out, h, c = decoder_model(g_in, h, c)
    g_t.append('')
    g_out = tf.argmax(g_out, axis=2)
    while g_out != end_token: 
        g_out, h, c = decoder_model(g_in, h, c)
        g_out = tf.argmax(g_out, axis=2)
        g_in = g_out
        g_t.append(tokenizer.index_word.get(tf.squeeze(g_out).numpy(), 'UNK'))
    print(f'English Text: {english_sentences[i]}')
    print(f'German Translation: {" ".join(g_t)}')
    print()

# English Text: i'll see tom .
# German Translation:  ich werde tom folgen . 

# English Text: you're not alone .
# German Translation:  keine nicht allein . 

# English Text: what a hypocrite !
# German Translation:  fuer ein idiot ! 

# English Text: he kept talking .
# German Translation:  sie hat ihn erwuergt . 

# English Text: tom's in charge .
# German Translation:  tom ist im bett . 

The above output shows the sample English sentences and their German translations predicted by the model.

The following animation (click and open on a new tab) shows how the predicted German translation improves (with the decrease in loss) for a few sample English sentences as the deep learning model is trained for more and more epochs.

As can be seen from the above animation, translation gets better as the deep learning model is trained for more and more epochs. The following is the youtube version for the same.

NLP with Bangla: semantic similarity with word2vec, Deep learning (RNN) to generate Bangla song-like texts and to do sentiment analysis on astrological prediction dataset, creating a simple Bangla ChatBot using RASA NLU with Python

In this blog, we shall discuss on a few NLP techniques with Bangla language. We shall start with a demonstration on how to train a word2vec model with Bangla wiki corpus with tensorflow and how to visualize the semantic similarity between words using t-SNE. Next, we shall demonstrate how to train a character / word LSTM on selected Tagore’s songs to generate songs like Tagore with keras. Next, we shall create sentiment analysis dataset by crawling the daily astrological prediction pages of a leading Bangla newspaper and manually labeling the sentiment of each of the predictions corresponding to each moon-sign. We shall train an LSTM sentiment a analysis model to predict the sentiment of a moon-sign prediction. Finally, we shall use RASA NLU (natural language understanding) to build a very simple chatbot in Bangla.

Word2vec model with Bangla wiki corpus with tensorflow

  • Let’s start by importing the required libraries
import collections
import numpy as np
import tensorflow as tf
from matplotlib import pylab
  • Download the Bangla wikipedia corpus from Kaggle. The first few lines from the corpus are shown below:

id,text,title,url

1528,

“রবীন্দ্রনাথ ঠাকুর”

রবীন্দ্রনাথ ঠাকুর (৭ই মে, ১৮৬১ – ৭ই আগস্ট, ১৯৪১) (২৫ বৈশাখ, ১২৬৮ – ২২ শ্রাবণ, ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। তাঁকে বাংলা ভাষার সর্বশ্রেষ্ঠ সাহিত্যিক মনে করা হয়। রবীন্দ্রনাথকে গুরুদেব, কবিগুরু ও বিশ্বকবি অভিধায় ভূষিত করা হয়। রবীন্দ্রনাথের ৫২টি কাব্যগ্রন্থ, ৩৮টি নাটক, ১৩টি উপন্যাস ও ৩৬টি প্রবন্ধ ও অন্যান্য গদ্যসংকলন তাঁর জীবদ্দশায় বা মৃত্যুর অব্যবহিত পরে প্রকাশিত হয়। তাঁর সর্বমোট ৯৫টি ছোটগল্প ও ১৯১৫টি গান যথাক্রমে “”গল্পগুচ্ছ”” ও “”গীতবিতান”” সংকলনের অন্তর্ভুক্ত হয়েছে। রবীন্দ্রনাথের যাবতীয় প্রকাশিত ও গ্রন্থাকারে অপ্রকাশিত রচনা ৩২ খণ্ডে “”রবীন্দ্র রচনাবলী”” নামে প্রকাশিত হয়েছে। রবীন্দ্রনাথের যাবতীয় পত্রসাহিত্য উনিশ খণ্ডে “”চিঠিপত্র”” ও চারটি পৃথক গ্রন্থে প্রকাশিত। এছাড়া তিনি প্রায় দুই হাজার ছবি এঁকেছিলেন। রবীন্দ্রনাথের রচনা বিশ্বের বিভিন্ন ভাষায় অনূদিত হয়েছে। ১৯১৩ সালে “”গীতাঞ্জলি”” কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি সাহিত্যে নোবেল পুরস্কার লাভ করেন।রবীন্দ্রনাথ ঠাকুর কলকাতার এক ধনাঢ্য ও সংস্কৃতিবান ব্রাহ্ম পিরালী ব্রাহ্মণ পরিবারে জন্মগ্রহণ করেন। বাল্যকালে প্রথাগত বিদ্যালয়-শিক্ষা তিনি গ্রহণ করেননি; গৃহশিক্ষক রেখে বাড়িতেই তাঁর শিক্ষার ব্যবস্থা করা হয়েছিল। আট বছর বয়সে তিনি কবিতা লেখা শুরু করেন। ১৮৭৪ সালে “”তত্ত্ববোধিনী পত্রিকা””-এ তাঁর “””” কবিতাটি প্রকাশিত হয়। এটিই ছিল তাঁর প্রথম প্রকাশিত রচনা। ১৮৭৮ সালে মাত্র সতেরো বছর বয়সে রবীন্দ্রনাথ প্রথমবার ইংল্যান্ডে যান। ১৮৮৩ সালে মৃণালিনী দেবীর সঙ্গে তাঁর বিবাহ হয়। ১৮৯০ সাল থেকে রবীন্দ্রনাথ পূর্ববঙ্গের শিলাইদহের জমিদারি এস্টেটে বসবাস শুরু করেন। ১৯০১ সালে তিনি পশ্চিমবঙ্গের শান্তিনিকেতনে ব্রহ্মচর্যাশ্রম প্রতিষ্ঠা করেন এবং সেখানেই পাকাপাকিভাবে বসবাস শুরু করেন। ১৯০২ সালে তাঁর পত্নীবিয়োগ হয়। ১৯০৫ সালে তিনি বঙ্গভঙ্গ-বিরোধী আন্দোলনে জড়িয়ে পড়েন। ১৯১৫ সালে ব্রিটিশ সরকার তাঁকে নাইট উপাধিতে ভূষিত করেন। কিন্তু ১৯১৯ সালে জালিয়ানওয়ালাবাগ হত্যাকাণ্ডের প্রতিবাদে তিনি সেই উপাধি ত্যাগ করেন। ১৯২১ সালে গ্রামোন্নয়নের জন্য তিনি শ্রীনিকেতন নামে একটি সংস্থা প্রতিষ্ঠা করেন। ১৯২৩ সালে আনুষ্ঠানিকভাবে বিশ্বভারতী প্রতিষ্ঠিত হয়। দীর্ঘজীবনে তিনি বহুবার বিদেশ ভ্রমণ করেন এবং সমগ্র বিশ্বে বিশ্বভ্রাতৃত্বের বাণী প্রচার করেন। ১৯৪১ সালে দীর্ঘ রোগভোগের পর কলকাতার পৈত্রিক বাসভবনেই তাঁর মৃত্যু হয়।রবীন্দ্রনাথের কাব্যসাহিত্যের বৈশিষ্ট্য ভাবগভীরতা, গীতিধর্মিতা চিত্ররূপময়তা, অধ্যাত্মচেতনা, ঐতিহ্যপ্রীতি, প্রকৃতিপ্রেম, মানবপ্রেম, স্বদেশপ্রেম, বিশ্বপ্রেম, রোম্যান্টিক সৌন্দর্যচেতনা, ভাব, ভাষা, ছন্দ ও আঙ্গিকের বৈচিত্র্য, বাস্তবচেতনা ও প্রগতিচেতনা। রবীন্দ্রনাথের গদ্যভাষাও কাব্যিক। ভারতের ধ্রুপদি ও লৌকিক সংস্কৃতি এবং পাশ্চাত্য বিজ্ঞানচেতনা ও শিল্পদর্শন তাঁর রচনায় গভীর প্রভাব বিস্তার করেছিল। কথাসাহিত্য ও প্রবন্ধের মাধ্যমে তিনি সমাজ, রাজনীতি ও রাষ্ট্রনীতি সম্পর্কে নিজ মতামত প্রকাশ করেছিলেন। সমাজকল্যাণের উপায় হিসেবে তিনি গ্রামোন্নয়ন ও গ্রামের দরিদ্র মানুষ কে শিক্ষিত করে তোলার পক্ষে মতপ্রকাশ করেন। এর পাশাপাশি সামাজিক ভেদাভেদ, অস্পৃশ্যতা, ধর্মীয় গোঁড়ামি ও ধর্মান্ধতার বিরুদ্ধেও তিনি তীব্র প্রতিবাদ জানিয়েছিলেন। রবীন্দ্রনাথের দর্শনচেতনায় ঈশ্বরের মূল হিসেবে মানব সংসারকেই নির্দিষ্ট করা হয়েছে; রবীন্দ্রনাথ দেববিগ্রহের পরিবর্তে কর্মী অর্থাৎ মানুষ ঈশ্বরের পূজার কথা বলেছিলেন। সংগীত ও নৃত্যকে তিনি শিক্ষার অপরিহার্য অঙ্গ মনে করতেন। রবীন্দ্রনাথের গান তাঁর অন্যতম শ্রেষ্ঠ কীর্তি। তাঁর রচিত “”আমার সোনার বাংলা”” ও “”জনগণমন-অধিনায়ক জয় হে”” গানদুটি যথাক্রমে গণপ্রজাতন্ত্রী বাংলাদেশ ও ভারতীয় প্রজাতন্ত্রের জাতীয় সংগীত।

জীবন.

প্রথম জীবন (১৮৬১–১৯০১).

শৈশব ও কৈশোর (১৮৬১ – ১৮৭৮).
রবীন্দ্রনাথ ঠাকুর কলকাতার জোড়াসাঁকো ঠাকুরবাড়িতে জন্মগ্রহণ করেছিলেন। তাঁর পিতা ছিলেন ব্রাহ্ম ধর্মগুরু দেবেন্দ্রনাথ ঠাকুর (১৮১৭–১৯০৫) এবং মাতা ছিলেন সারদাসুন্দরী দেবী (১৮২৬–১৮৭৫)। রবীন্দ্রনাথ ছিলেন পিতামাতার চতুর্দশ সন্তান। জোড়াসাঁকোর ঠাকুর পরিবার ছিল ব্রাহ্ম আদিধর্ম মতবাদের প্রবক্তা। রবীন্দ্রনাথের পূর্ব পুরুষেরা খুলনা জেলার রূপসা উপজেলা পিঠাভোগে বাস করতেন। ১৮৭৫ সালে মাত্র চোদ্দ বছর বয়সে রবীন্দ্রনাথের মাতৃবিয়োগ ঘটে। পিতা দেবেন্দ্রনাথ দেশভ্রমণের নেশায় বছরের অধিকাংশ সময় কলকাতার বাইরে অতিবাহিত করতেন। তাই ধনাঢ্য পরিবারের সন্তান হয়েও রবীন্দ্রনাথের ছেলেবেলা কেটেছিল ভৃত্যদের অনুশাসনে। শৈশবে রবীন্দ্রনাথ কলকাতার ওরিয়েন্টাল সেমিনারি, নর্ম্যাল স্কুল, বেঙ্গল অ্যাকাডেমি এবং সেন্ট জেভিয়ার্স কলেজিয়েট স্কুলে কিছুদিন করে পড়াশোনা করেছিলেন। কিন্তু বিদ্যালয়-শিক্ষায় অনাগ্রহী হওয়ায় বাড়িতেই গৃহশিক্ষক রেখে তাঁর শিক্ষার ব্যবস্থা করা হয়েছিল। ছেলেবেলায় জোড়াসাঁকোর বাড়িতে অথবা বোলপুর ও পানিহাটির বাগানবাড়িতে প্রাকৃতিক পরিবেশের মধ্যে ঘুরে বেড়াতে বেশি স্বচ্ছন্দবোধ করতেন রবীন্দ্রনাথ।১৮৭৩ সালে এগারো বছর বয়সে রবীন্দ্রনাথের উপনয়ন অনুষ্ঠিত হয়েছিল। এরপর তিনি কয়েক মাসের জন্য পিতার সঙ্গে দেশভ্রমণে বের হন। প্রথমে তাঁরা আসেন শান্তিনিকেতনে। এরপর পাঞ্জাবের অমৃতসরে কিছুকাল কাটিয়ে শিখদের উপাসনা পদ্ধতি পরিদর্শন করেন। শেষে পুত্রকে নিয়ে দেবেন্দ্রনাথ যান পাঞ্জাবেরই (অধুনা ভারতের হিমাচল প্রদেশ রাজ্যে অবস্থিত) ডালহৌসি শৈলশহরের নিকট বক্রোটায়। এখানকার বক্রোটা বাংলোয় বসে রবীন্দ্রনাথ পিতার কাছ থেকে সংস্কৃত ব্যাকরণ, ইংরেজি, জ্যোতির্বিজ্ঞান, সাধারণ বিজ্ঞান ও ইতিহাসের নিয়মিত পাঠ নিতে শুরু করেন। দেবেন্দ্রনাথ তাঁকে বিশিষ্ট ব্যক্তিবর্গের জীবনী, কালিদাস রচিত ধ্রুপদি সংস্কৃত কাব্য ও নাটক এবং উপনিষদ্‌ পাঠেও উৎসাহিত করতেন। ১৮৭৭ সালে “”ভারতী”” পত্রিকায় তরুণ রবীন্দ্রনাথের কয়েকটি গুরুত্বপূর্ণ রচনা প্রকাশিত হয়। এগুলি হল মাইকেল মধুসূদনের “”””, “”ভানুসিংহ ঠাকুরের পদাবলী”” এবং “””” ও “””” নামে দুটি গল্প। এর মধ্যে “”ভানুসিংহ ঠাকুরের পদাবলী”” বিশেষভাবে উল্লেখযোগ্য। এই কবিতাগুলি রাধা-কৃষ্ণ বিষয়ক পদাবলির অনুকরণে “”ভানুসিংহ”” ভণিতায় রচিত। রবীন্দ্রনাথের “”ভিখারিণী”” গল্পটি (১৮৭৭) বাংলা সাহিত্যের প্রথম ছোটগল্প। ১৮৭৮ সালে প্রকাশিত হয় রবীন্দ্রনাথের প্রথম কাব্যগ্রন্থ তথা প্রথম মুদ্রিত গ্রন্থ “”কবিকাহিনী””। এছাড়া এই পর্বে তিনি রচনা করেছিলেন “””” (১৮৮২) কাব্যগ্রন্থটি। রবীন্দ্রনাথের বিখ্যাত কবিতা “””” এই কাব্যগ্রন্থের অন্তর্গত।

যৌবন (১৮৭৮-১৯০১).
১৮৭৮ সালে ব্যারিস্টারি পড়ার উদ্দেশ্যে ইংল্যান্ডে যান রবীন্দ্রনাথ। প্রথমে তিনি ব্রাইটনের একটি পাবলিক স্কুলে ভর্তি হয়েছিলেন। ১৮৭৯ সালে ইউনিভার্সিটি কলেজ লন্ডনে আইনবিদ্যা নিয়ে পড়াশোনা শুরু করেন। কিন্তু সাহিত্যচর্চার আকর্ষণে সেই পড়াশোনা তিনি সমাপ্ত করতে পারেননি। ইংল্যান্ডে থাকাকালীন শেকসপিয়র ও অন্যান্য ইংরেজ সাহিত্যিকদের রচনার সঙ্গে রবীন্দ্রনাথের পরিচয় ঘটে। এই সময় তিনি বিশেষ মনোযোগ সহকারে পাঠ করেন “”রিলিজিও মেদিচি””, “”কোরিওলেনাস”” এবং “”অ্যান্টনি অ্যান্ড ক্লিওপেট্রা””। এই সময় তাঁর ইংল্যান্ডবাসের অভিজ্ঞতার কথা “”ভারতী”” পত্রিকায় পত্রাকারে পাঠাতেন রবীন্দ্রনাথ। উক্ত পত্রিকায় এই লেখাগুলি জ্যেষ্ঠভ্রাতা দ্বিজেন্দ্রনাথ ঠাকুরের সমালোচনাসহ প্রকাশিত হত “””” নামে। ১৮৮১ সালে সেই পত্রাবলি “””” নামে গ্রন্থাকারে ছাপা হয়। এটিই ছিল রবীন্দ্রনাথের প্রথম গদ্যগ্রন্থ তথা প্রথম চলিত ভাষায় লেখা গ্রন্থ। অবশেষে ১৮৮০ সালে প্রায় দেড় বছর ইংল্যান্ডে কাটিয়ে কোনো ডিগ্রি না নিয়ে এবং ব্যারিস্টারি পড়া শুরু না করেই তিনি দেশে ফিরে আসেন।১৮৮৩ সালের ৯ ডিসেম্বর (২৪ অগ্রহায়ণ, ১২৯০ বঙ্গাব্দ) ঠাকুরবাড়ির অধস্তন কর্মচারী বেণীমাধব রায়চৌধুরীর কন্যা ভবতারিণীর সঙ্গে রবীন্দ্রনাথের বিবাহ সম্পন্ন হয়। বিবাহিত জীবনে ভবতারিণীর নামকরণ হয়েছিল মৃণালিনী দেবী (১৮৭৩–১৯০২ )। রবীন্দ্রনাথ ও মৃণালিনীর সন্তান ছিলেন পাঁচ জন: মাধুরীলতা (১৮৮৬–১৯১৮), রথীন্দ্রনাথ (১৮৮৮–১৯৬১), রেণুকা (১৮৯১–১৯০৩), মীরা (১৮৯৪–১৯৬৯) এবং শমীন্দ্রনাথ (১৮৯৬–১৯০৭)। এঁদের মধ্যে অতি অল্প বয়সেই রেণুকা ও শমীন্দ্রনাথের মৃত্যু ঘটে।১৮৯১ সাল থেকে পিতার আদেশে নদিয়া (নদিয়ার উক্ত অংশটি অধুনা বাংলাদেশের কুষ্টিয়া জেলা), পাবনা ও রাজশাহী জেলা এবং উড়িষ্যার জমিদারিগুলির তদারকি শুরু করেন রবীন্দ্রনাথ। কুষ্টিয়ার শিলাইদহের কুঠিবাড়িতে রবীন্দ্রনাথ দীর্ঘ সময় অতিবাহিত করেছিলেন। জমিদার রবীন্দ্রনাথ শিলাইদহে “”পদ্মা”” নামে একটি বিলাসবহুল পারিবারিক বজরায় চড়ে প্রজাবর্গের কাছে খাজনা আদায় ও আশীর্বাদ প্রার্থনা করতে যেতেন। গ্রামবাসীরাও তাঁর সম্মানে ভোজসভার আয়োজন করত।১৮৯০ সালে রবীন্দ্রনাথের অপর বিখ্যাত কাব্যগ্রন্থ “””” প্রকাশিত হয়। কুড়ি থেকে ত্রিশ বছর বয়সের মধ্যে তাঁর আরও কয়েকটি উল্লেখযোগ্য কাব্যগ্রন্থ ও গীতিসংকলন প্রকাশিত হয়েছিল। এগুলি হলো “”””, “”””, “”রবিচ্ছায়া””, “””” ইত্যাদি। ১৮৯১ থেকে ১৮৯৫ সাল পর্যন্ত নিজের সম্পাদিত “”সাধনা”” পত্রিকায় রবীন্দ্রনাথের বেশ কিছু উৎকৃষ্ট রচনা প্রকাশিত হয়। তাঁর সাহিত্যজীবনের এই পর্যায়টি তাই “”সাধনা পর্যায়”” নামে পরিচিত। রবীন্দ্রনাথের “”গল্পগুচ্ছ”” গ্রন্থের প্রথম চুরাশিটি গল্পের অর্ধেকই এই পর্যায়ের রচনা। এই ছোটগল্পগুলিতে তিনি বাংলার গ্রামীণ জনজীবনের এক আবেগময় ও শ্লেষাত্মক চিত্র এঁকেছিলেন।

  • Preprocess the csv files with the following code using regular expressions (to get rid of punctuations). Remember we need to decode to utf-8 first, since we have unicode input files.
 
from glob import glob
import re
words = []
for f in glob('bangla/wiki/*.csv'):
    words += re.sub('[\r\n—?,;।!‘"’\.:\(\)\[\]…0-9]', ' ', open(f, 'rb').read().decode('utf8').strip()).split(' ')
words = list(filter(lambda x: not x in ['', '-'], words))
print(len(words))
# 13964346
words[:25]
#['রবীন্দ্রনাথ',
# 'ঠাকুর',
# 'রবীন্দ্রনাথ',
# 'ঠাকুর',
# '৭ই',
# 'মে',
# '১৮৬১',
# '৭ই',
# 'আগস্ট',
# '১৯৪১',
# '২৫',
# 'বৈশাখ',
# '১২৬৮',
# '২২',
# 'শ্রাবণ',
# '১৩৪৮',
# 'বঙ্গাব্দ',
# 'ছিলেন',
# 'অগ্রণী',
# 'বাঙালি',
# 'কবি',
# 'ঔপন্যাসিক',
# 'সংগীতস্রষ্টা',
# 'নাট্যকার',
# 'চিত্রকর']

  • Create indices for unique words in the dataset.
vocabulary_size = 25000
def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
# Most common words (+UNK) [['UNK', 1961151], ('এবং', 196916), ('ও', 180042), ('হয়', 160533), ('করে', 131206)]
print('Sample data', data[:10])
#Sample data [1733, 1868, 1733, 1868, 5769, 287, 6855, 5769, 400, 2570]
del words  # Hint to reduce memory.
  • Generate batches to be trained with the word2vec skip-gram model.
  • The target label should be at the center of the buffer each time. That is, given a focus word, our goal will be to learn the most probable context words.
  • The input and the target vector will depend on num_skips and skip_window.
 
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):
    target = skip_window  # 
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])
# data: ['রবীন্দ্রনাথ', 'ঠাকুর', 'রবীন্দ্রনাথ', 'ঠাকুর', '৭ই', 'মে', '১৮৬১', '৭ই']
for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % 
          (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
    # data: ['রবীন্দ্রনাথ', 'ঠাকুর', 'রবীন্দ্রনাথ', 'ঠাকুর', '৭ই', 'মে',  '১৮৬১', '৭ই']
    # with num_skips = 2 and skip_window = 1:
    # batch: ['ঠাকুর', 'ঠাকুর', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর',  '৭ই', '৭ই']
    # labels: ['রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর', '৭ই', 'রবীন্দ্রনাথ', 'ঠাকুর', 'মে']
    # with num_skips = 4 and skip_window = 2:
    # batch: ['রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর', 'ঠাকুর', 'ঠাকুর']
    # labels: ['রবীন্দ্রনাথ', '৭ই', 'ঠাকুর', 'ঠাকুর', 'মে', 'ঠাকুর', 'রবীন্দ্রনাথ', '৭ই']
  • Pick a random validation set to sample nearest neighbors.
  • Limit the validation samples to the words that have a low numeric ID, which by construction are also the most frequent.
  • Look up embeddings for inputs and compute the softmax loss, using a sample of the negative labels each time (this is known as negative sampling, which is used to make the computation efficient, since the number of labels are often too high).
  • The optimizer will optimize the softmax_weights and the embeddings.
    This is because the embeddings are defined as a variable quantity and the optimizer’s `minimize` method will by default modify all variable quantities that contribute to the tensor it is passed.
  • Compute the similarity between minibatch examples and all embeddings.
 
 batch_size = 128
 embedding_size = 128 # Dimension of the embedding vector.
 skip_window = 1 # How many words to consider left and right.
 num_skips = 2 # #times to reuse an input to generate a label.
 valid_size = 16 # Random set of words to evaluate similarity on.
 valid_window = 100 # Only pick dev samples in the head of the   
                    # distribution.
 valid_examples = np.array(random.sample(range(valid_window), 
                                         valid_size))
 num_sampled = 64 # Number of negative examples to sample.
 graph = tf.Graph()
 with graph.as_default(), tf.device('/cpu:0'):  
 # Input data.
   train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
   train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
   valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
 # Variables.
   embeddings = tf.Variable( \
     tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
   softmax_weights = tf.Variable( \
     tf.truncated_normal([vocabulary_size, embedding_size], \
                          stddev=1.0 / math.sqrt(embedding_size)))
   softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
 # Model.
   embed = tf.nn.embedding_lookup(embeddings, train_dataset)
   loss = tf.reduce_mean( \
                          tf.nn.sampled_softmax_loss(weights=softmax_weights, \   
                          biases=softmax_biases, inputs=embed, labels=train_labels, \ 
                          num_sampled=num_sampled, num_classes=vocabulary_size))
 # Optimizer.
 optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)  
 # use the cosine distance:
 norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
   normalized_embeddings = embeddings / norm
   valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
   similarity = tf.matmul(valid_embeddings,  tf.transpose(normalized_embeddings))
 
  • Train the word2vec model with the batches constructed, for 100k steps.
 
num_steps = 100001with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, \
                 train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 
      # 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
      # note that this is expensive (~20% slowdown if computed every 
      # 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()
  • The following shows how the loss function decreases with the increase in training steps.
  • During the training process, the words that become semantically near come closer in the embedding space.
Image for post
  • Use t-SNE plot to map the following words from 128-dimensional embedding space to 2 dimensional manifold and visualize.
 
words = ['রাজা', 'রাণী', 'ভারত','বাংলাদেশ','দিল্লী','কলকাতা','ঢাকা',
         'পুরুষ','নারী','দুঃখ','লেখক','কবি','কবিতা','দেশ',
         'বিদেশ','লাভ','মানুষ', 'এবং', 'ও', 'গান', 'সঙ্গীত', 'বাংলা', 
         'ইংরেজি', 'ভাষা', 'কাজ', 'অনেক', 'জেলার', 'বাংলাদেশের', 
         'এক', 'দুই', 'তিন', 'চার', 'পাঁচ', 'দশ', '১', '৫', '২০', 
         'নবম', 'ভাষার', '১২', 'হিসাবে', 'যদি', 'পান', 'শহরের', 'দল', 
         'যদিও', 'বলেন', 'রান', 'করেছে', 'করে', 'এই', 'করেন', 'তিনি', 
         'একটি', 'থেকে', 'করা', 'সালে', 'এর', 'যেমন', 'সব',  'তার', 
         'খেলা',  'অংশ', 'উপর', 'পরে', 'ফলে',  'ভূমিকা', 'গঠন',  
         'তা', 'দেন', 'জীবন', 'যেখানে', 'খান', 'এতে',  'ঘটে', 'আগে', 
         'ধরনের', 'নেন', 'করতেন', 'তাকে', 'আর', 'যার', 'দেখা', 
         'বছরের', 'উপজেলা', 'থাকেন', 'রাজনৈতিক', 'মূলত', 'এমন', 
         'কিলোমিটার', 'পরিচালনা', '২০১১', 'তারা', 'তিনি', 'যিনি', 'আমি',  
         'তুমি', 'আপনি', 'লেখিকা', 'সুখ', 'বেদনা', 'মাস', 'নীল', 'লাল', 
         'সবুজ', 'সাদা', 'আছে', 'নেই', 'ছুটি', 'ঠাকুর',
         'দান', 'মণি', 'করুণা', 'মাইল', 'হিন্দু', 'মুসলমান','কথা', 'বলা',     
         'সেখানে', 'তখন', 'বাইরে', 'ভিতরে', 'ভগবান' ]
indices = []
for word in words:
    #print(word, dictionary[word])
    indices.append(dictionary[word])
two_d_embeddings = tsne.fit_transform(final_embeddings[indices, :])
plot(two_d_embeddings, words)
  • The following figure shows how the words similar in meaning are mapped to embedding vectors that are close to each other.
  • Also, note that arithmetic property of the word embeddings: e.g., the words ‘রাজা’ and ‘রাণী’ are approximately along the same distance and direction as the words ‘লেখক’ and ‘লেখিকা’, reflecting the fact that the nature of the semantic relatedness in terms of gender is same.
Image for post
  • The following animation shows how the embedding is learnt to preserve the semantic similarity in the 2D-manifold more and more as training proceeds.

Generating song-like texts with LSTM from Tagore’s Bangla songs

Text generation with Character LSTM

  • Let’s import the required libraries first.
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop, Adam
import io, re
  • Read the input file, containing few selected songs of Tagore in Bangla.
raw_text = open('rabindrasangeet.txt','rb').read().decode('utf8')
print(raw_text[0:1000])

পূজা

অগ্নিবীণা বাজাও তুমি

অগ্নিবীণা বাজাও তুমি কেমন ক’রে !
আকাশ কাঁপে তারার আলোর গানের ঘোরে ।।
তেমনি ক’রে আপন হাতে ছুঁলে আমার বেদনাতে,
নূতন সৃষ্টি জাগল বুঝি জীবন-‘পরে ।।
বাজে ব’লেই বাজাও তুমি সেই গরবে,
ওগো প্রভু, আমার প্রাণে সকল সবে ।
বিষম তোমার বহ্নিঘাতে বারে বারে আমার রাতে
জ্বালিয়ে দিলে নূতন তারা ব্যথায় ভ’রে ।।

অচেনাকে ভয় কী
অচেনাকে ভয় কী আমার ওরে?
অচেনাকেই চিনে চিনে উঠবে জীবন ভরে ।।
জানি জানি আমার চেনা কোনো কালেই ফুরাবে না,
চিহ্নহারা পথে আমায় টানবে অচিন ডোরে ।।
ছিল আমার মা অচেনা, নিল আমায় কোলে ।
সকল প্রেমই অচেনা গো, তাই তো হৃদয় দোলে ।
অচেনা এই ভুবন-মাঝে কত সুরেই হৃদয় বাজে-
অচেনা এই জীবন আমার, বেড়াই তারি ঘোরে ।।অন্তর মম
অন্তর মম বিকশিত করো অন্তরতর হে-
নির্মল করো, উজ্জ্বল করো, সুন্দর করো হে ।।
জাগ্রত করো, উদ্যত করো, নির্ভয় করো হে ।।
মঙ্গল করো, নিরলস নিঃসংশয় করো হে ।।
যুক্ত করো হে সবার সঙ্গে, মুক্ত করো হে বন্ধ ।
সঞ্চার করো সকল কর্মে শান্ত তোমার ছন্দ ।
চরণপদ্মে মম চিত নিস্পন্দিত করো হে ।
নন্দিত করো, নন্দিত করো, নন্দিত করো হে ।।

অন্তরে জাগিছ অন্তর্যামী
অন্তরে জাগিছ অন্তর্যামী ।
  • Here we shall be using a many-to-many RNN as shown in the next figure.
  • Pre-process the text and create character indices to be used as the input in the model.
 
processed_text = raw_text.lower()
print('corpus length:', len(processed_text))
# corpus length: 207117
chars = sorted(list(set(processed_text)))
print('total chars:', len(chars))
# total chars: 89
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
  • Cut the text in semi-redundant sequences of maxlen characters.
 
def is_conjunction(c):
  h = ord(c) # print(hex(ord(c)))
  return (h >= 0x980 and h = 0x9bc and h = 0x9f2)
		 
maxlen = 40
step = 2
sentences = []
next_chars = []
i = 0
while i < len(processed_text) - maxlen:
  if is_conjunction(processed_text[i]):
    i += 1
    continue
  sentences.append(processed_text[i: i + maxlen])
  next_chars.append(processed_text[i + maxlen])
  i += step
  print('nb sequences:', len(sentences))
  # nb sequences: 89334
  • Create one-hot-encodings.
 
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
    x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
  • Build a model, a single LSTM.
 
model = Sequential()
model.add(LSTM(256, input_shape=(maxlen, len(chars))))
model.add(Dense(128, activation='relu'))
model.add(Dense(len(chars), activation='softmax'))
optimizer = Adam(lr=0.01) #RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
  • The following figure how the model architecture looks like:
Image for post
  • Print the model summary.
 
model.summary()

Model: "sequential" 
_________________________________________________________________ 
Layer (type)                 Output Shape              Param #    
================================================================= 
lstm (LSTM)                  (None, 256)               354304     
_________________________________________________________________ 
dense (Dense)                (None, 128)               32896      
_________________________________________________________________ 
dense_1 (Dense)              (None, 89)                11481      
================================================================= 
Total params: 398,681 Trainable params: 398,681 Non-trainable params: 0 
_________________________________________________________________
  • Use the following helper function to sample an index from a probability array.
 
def sample(preds, temperature=1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)
  • Fit the model and register a callback to print the text generated by the model at the end of each epoch.
 
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(x, y, batch_size=128, epochs=60, callbacks=[print_callback])
  • The following animation shows how the model generates song-like texts with given seed texts, for different values of the temperature parameter.
Image for post

Text Generation with Word LSTM

  • Pre-process the input text, split by punctuation characters and create word indices to be used as the input in the model.
processed_text = raw_text.lower()
from string import punctuation
r = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
words = r.split(processed_text)
print(len(words))
words[:16]
39481
# ['পূজা',
# 'অগ্নিবীণা',
# 'বাজাও',
# 'তুমি',
# 'অগ্নিবীণা',
# 'বাজাও',
# 'তুমি',
# 'কেমন',
# 'ক’রে',
# 'আকাশ',
# 'কাঁপে',
# 'তারার',
# 'আলোর',
# 'গানের',
# 'ঘোরে',
# '।।']

unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))
index_unique_word = dict((i, c) for i, c in enumerate(unique_words))
  • Create a word-window of length 5 to predict the next word.
WORD_LENGTH = 5
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
    prev_words.append(words[i:i + WORD_LENGTH])
    next_words.append(words[i + WORD_LENGTH])
print(prev_words[1])
# ['অগ্নিবীণা', 'বাজাও', 'তুমি', 'অগ্নিবীণা', 'বাজাও']
print(next_words[1])
# তুমি
print(len(unique_words))
# 7847
  • Create OHE for input and output words as done for character-RNN. Fit the model on the pre-rpocessed data.
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(X, Y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback])
  • The following animation shows the song -like text generated by the word-LSTM at the end of an epoc.

Bangla Sentiment Analysis using LSTM with Daily Astrological Prediction Dataset

  • Let’s first create sentiment analysis dataset by crawling the daily astrological predictions (রাশিফল) page of the online edition of আনন্দবাজার পত্রিকা (e.g., for the year 2013), a leading Bangla newspaper and then manually labeling the sentiment of each of the predictions corresponding to each moon-sign.
  • Read the csv dataset, the first few lines look like the following.
 
df = pd.read_csv('horo_2013_labeled.csv')
pd.set_option('display.max_colwidth', 135) 
df.head(20)
  • Transform each text in texts in a sequence of integers.
 
tokenizer = Tokenizer(num_words=2000, split=' ')
tokenizer.fit_on_texts(df['আপনার আজকের দিনটি'].values)
X = tokenizer.texts_to_sequences(df['আপনার আজকের দিনটি'].values)
X = pad_sequences(X)
X
#array([[   0,    0,    0, ...,   26,  375,    3],        
#       [   0,    0,    0, ...,   54,    8,    1],        
#       [   0,    0,    0, ...,  108,   42,   43],        
#       ...,        
#       [   0,    0,    0, ..., 1336,  302,   82],        
#       [   0,    0,    0, ..., 1337,  489,  218],        
#       [   0,    0,    0, ...,    2,  316,   87]])
  • Here we shall use a many-to-one RNN for sentiment analysis as shown below.
  • Build an LSTM model that takes a sentence as input and outputs the sentiment label.
model = Sequential()
model.add(Embedding(2000, 128,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_10 (Embedding)     (None, 12, 128)           256000    
_________________________________________________________________
spatial_dropout1d_10 (Spatia (None, 12, 128)           0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 258       
=================================================================
Total params: 387,842
Trainable params: 387,842
Non-trainable params: 0
_________________________________________________________________
None
  • Divide the dataset into train and validation (test) dataset and train the LSTM model on the dataset.
 
Y = pd.get_dummies(df['sentiment']).values
X_train, X_test, Y_train, Y_test, _, indices = train_test_split(X,Y, np.arange(len(X)), test_size = 0.33, random_state = 5)
model.fit(X_train, Y_train, epochs = 5, batch_size=32, verbose = 2)

#Epoch 1/5  - 3s - loss: 0.6748 - acc: 0.5522 
#Epoch 2/5  - 1s - loss: 0.5358 - acc: 0.7925 
#Epoch 3/5  - 1s - loss: 0.2368 - acc: 0.9418 
#Epoch 4/5  - 1s - loss: 0.1011 - acc: 0.9761 
#Epoch 5/5  - 1s - loss: 0.0578 - acc: 0.9836 
  • Predict the sentiment labels of the (held out) test dataset.
 
result = model.predict(X[indices],batch_size=1,verbose = 2)
df1 = df.iloc[indices]
df1['neg_prob'] = result[:,0]
df1['pos_prob'] = result[:,1]
df1['pred'] = np.array(['negative', 'positive'])[np.argmax(result, axis=1)]
df1.head()
  • Finally, compute the accuracy of the model for the positive and negative ground-truth sentiment corresponding to daily astrological predictions.
 
df2 = df1[df1.sentiment == 'positive']
print('positive accuracy:' + str(np.mean(df2.sentiment == df2.pred)))
#positive accuracy:0.9177215189873418
df2 = df1[df1.sentiment == 'negative']
print('negative accuracy:' + str(np.mean(df2.sentiment == df2.pred)))
#negative accuracy:0.9352941176470588

Building a very simple Bangla Chatbot with RASA NLU

  • The following figure shows how to design a very simple Bangla chatbot to order food from restaurants using RASA NLU.
  • We need to design the intents, entities and slots to extract the entities properly and then design stories to define how the chatbot will respond to user inputs (core / dialog).
  • The following figure shows how the nlu, domain and stories files are written for the simple chatbot.
  • A sequence-to-sequence deep learning model is trained under the hood for intent classification. The next code block shows how the model can be trained.
import rasa
model_path = rasa.train('domain.yml', 'config.yml', ['data/'], 'models/')
  • The following gif demonstrates how the chatbot responds to user inputs.

References

Travelling Salesman Problem (TSP) with Python

In this blog we shall discuss on the Travelling Salesman Problem (TSP) — a very famous NP-hard problem and will take a few attempts to solve it (either by considering special cases such as Bitonic TSP and solving it efficiently or by using algorithms to improve runtime, e.g., using Dynamic programming, or by using approximation algorithms, e.g., for Metric TSP and heuristics, to obtain not necessarily optimal but good enough solutions, e.g., with Simulated Annealing and Genetic Algorithms) and work on the corresponding python implementations. Few of the problems discussed here appeared as programming assignments in the Coursera course Advanced Algorithms and Complexity and some of the problem statements are taken from the course.

Improving the runtime of the Travelling Salesman Problem with Dynamic Programming

In this problem we shall deal with a classical NP-complete problem called Traveling Salesman Problem. Given a graph with weighted edges, you need to find the shortest cycle visiting each vertex exactly once. Vertices correspond to cities. Edges weights correspond to the cost (e.g., time) to get from one vertex to another one. Some vertices may not be connected by an edge in the general case.

  • Here we shall use dynamic programming to solve TSP: instead of solving one problem we will solve a collection of (overlapping) subproblems.
  • A subproblem refers to a partial solution
  • A reasonable partial solution in case of TSP is the initial part of a cycle
  • To continue building a cycle, we need to know the last vertex as well as the set of already visited vertices
  • It will be convenient to assume that vertices are integers from 1 to n and that the salesman starts his trip in (and also returns back to) vertex 1.
  • The following figure shows the Dynamic programming subproblems, the recurrence relation and the algorithm for TSP with DP.
Image for post
Image for post

Implementation tips

  • In order to iterate through all subsets of {1, . . . , n}, it will be helpful to notice that there is a natural one-to-one correspondence between integers in the range from 0 and 2^n − 1 and subsets of {0, . . . , n − 1}: k ↔ {i : i -th bit of k is 1}.
  • For example, k = 1 (binary 001) corresponds to the set {0}, where k = 5 (binary 101) corresponds to the set {0,2}
  • In order to find out the integer corresponding to S − {j} (for j ∈ S), we need to flip the j-th bit of k (from 1 to 0). For this, in turn, we can compute a bitwise XOR of k and 2^j (that has 1 only in j-th position)
  • In order to compute the optimal path along with the cost, we need to maintain back-pointers to store the path.

The following python code shows an implementation of the above algorithm.

import numpy as np
from itertools import combinationsdef TSP(G):
   n = len(G)
   C = [[np.inf for _ in range(n)] for __ in range(1 << n)]
   C[1][0] = 0 # {0} <-> 1
   for size in range(1, n):
      for S in combinations(range(1, n), size):
	  S = (0,) + S
	  k = sum([1 << i for i in S])
	  for i in S:
	      if i == 0: continue
	      for j in S:
		 if j == i: continue
		    cur_index = k ^ (1 << i)
		    C[k][i] = min(C[k][i], C[cur_index][j]+ G[j][i])     
		                                       #C[S−{i}][j]
   all_index = (1 << n) - 1
   return min([(C[all_index][i] + G[0][i], i) \
                            for i in range(n)])
  • The following animation shows how the least cost solution cycle is computed with the DP for a graph with 4 vertices. Notice that in order to represent C(S,i) from the algorithm, the vertices that belong to the set S are colored with red circles, the vertex i where the path that traverses through all the nodes in S ends at is marked with a red double-circle.
  • The next animation also shows how the DP table gets updated. The DP table for a graph with 4 nodes will be of size 2⁴ X 4, since there are 2⁴=16 subsets of the vertex set V={0,1,2,3} and a path going through a subset of the vertices in V may end in any of the 4 vertex.
  • The transposed DP table is shown in the next animation, here the columns correspond to the subset of the vertices and rows correspond to the vertex the TSP ends at.
Image for post
Image by Author
  • The following animation shows how the least cost solution cycle is computed with the DP for a graph with 5 nodes.
Image for post
Image by Author

The following animation / figure shows the TSP optimal path is computed for increasing number of nodes (where the weights for the input graphs are randomly generated) and the exponential increase in the time taken.

Image for post
Image by Author
Image for post
Image by Author

Solving TSP with Integer Linear Program

Image for post

Solving with the mip package using the following python code, produces the output shown by the following animation, for a graph with randomly generated edge-weights.


from mip import Model, xsum, minimize, BINARY
   
def TSP_ILP(G):
   
   start = time()
   V1 =  range(len(G))
   n, V = len(G), set(V1)
   model = Model()   # binary variables indicating if arc (i,j) is used 
   # on the route or not
   x = [[model.add_var(var_type=BINARY) for j in V] for i in V]   # continuous variable to prevent subtours: each city will have a
   # different sequential id in the planned route except the 1st one
   y = [model.add_var() for i in V]   # objective function: minimize the distance
    model.objective = minimize(xsum(G[i][j]*x[i][j] \
                               for i in V for j in V))
   
   # constraint : leave each city only once
   for i in V:
      model += xsum(x[i][j] for j in V - {i}) == 1   # constraint : enter each city only once
   for i in V:
      model += xsum(x[j][i] for j in V - {i}) == 1   # subtour elimination
   for (i, j) in product(V - {0}, V - {0}):
      if i != j:
         model += y[i] - (n+1)*x[i][j] >= y[j]-n   # optimizing
   model.optimize()   # checking if a solution was found
   if model.num_solutions:
      print('Total distance {}'.format(model.objective_value))
      nc = 0 # cycle starts from vertex 0
      cycle = [nc]
      while True:
         nc = [i for i in V if x[nc][i].x >= 0.99][0]
	 cycle.append(nc)
	 if nc == 0:
	    break
	
   return (model.objective_value, cycle)

The constraint to prevent the subtours to appear in the solution is necessary, if we run without the constraint, we get a solution with subtours instead of a single cycle going through all the nodes, as shown below:

Image for post
Image by Author
Image for post
Image by Author

Comparing with Dynamic programming based solution, we can see that ILP is much more efficient for higher n values.

Image for post
Image by Author
Image for post
Image by Author

Bitonic TSP

Image for post
Image by Author
Image for post
Image by Author

The following python code snippet implements the above DP algorithm.


def dist(P, i, j):
   return np.sqrt((P[i][0]-P[j][0])**2+(P[i][1]-P[j][1])**2)
	
def BTSP(P):
   n = len(P)
   D = np.ones((n,n))*np.inf
   path = np.ones((n,n), dtype=int)*(-1)
   D[n-2,n-1] = dist(P, n-2, n-1)
   path[n-2,n-1] = n-1
   for i in range(n-3,-1,-1):
      m = np.inf
      for k in range(i+2,n):
	 if m > D[i+1,k] + dist(P,i,k):
	    m, mk = D[i+1,k] + dist(P,i,k), k
      D[i,i+1] = m
      path[i,i+1] = mk
      for j in range(i+2,n):
	 D[i,j] = D[i+1,j] + dist(P,i,i+1)
	 path[i,j] = i+1	
      D[0,0] = D[0,1] + dist(P,0,1)
      path[0,0] = 1
      return D, path
	
def get_tsp_path(path, i, j, n):
    if n < 0:
	return []
    if i <= j:
 	k = path[i,j]
	return [k] + get_tsp_path(path, k, j, n-1)
    else:
	k = path[j,i]
	return get_tsp_path(path, i, k, n-1) + [k]

The following animation shows how the DP table is computed and the optimal path for Bitonic TSP is constructed. It also shows the final optimal path.

Image for post
Image by Author
Image for post
Image by Author

2-OPT Approximation Algorithm for Metric TSP

Image for post

The next code snippet implements the above 2-OPT approximation algorithm.


import numpy as np
import queue	def dfs(adj, x):
   visited = [False]*len(adj)
   stack = [x]
   visited[x] = True
   path = []
   while len(stack) > 0:
      u = stack.pop(-1)
      path.append(u)
      for v in adj[u]:
	 if not visited[v]:
	    stack.append(v)	
   	    visited[v] = True
	return pathdef mst(adj):
   inf = np.inf
   c = [inf]*n
   s = 0
   c[s] = 0
   visited = [False]*n
   parent = [None]*n
   h = queue.PriorityQueue()
   for v in range(n):
      h.put((c[v], v))
   edges = []
   while not h.empty():
      w, u = h.get()
      if visited[u]: continue
           visited[u] = True
	   if parent[u] != None:
		edges.append((parent[u], u))
		for v in range(n):
    		   if v == u: continue
		   if (not visited[v]) and (c[v] > adj[u][v]):
			c[v] = adj[u][v]
			parent[v] = u
			h.put((c[v], v))
   adj = [[] for _ in range(n)]
   for i in range(n):
      if parent[i] != None:
           adj[parent[i]].append(i)
   path = dfs(adj, 0)
   path += [path[0]]
   return path

The following animation shows the TSP path computed with the above approximation algorithm and compares with the OPT path computed using ILP for 20 points on 2D plane. The MST is computed with Prim’s algorithm.

Image for post
Image by Author

TSP with Simulated Annealing

Image for post
Image for post

The following python code snippet shows how to implement the Simulated Annealing to solve TSP, here G represents the adjacency matrix of the input graph.


def TSP_SA(G):
   s = list(range(len(G)))
   c = cost(G, s)
   ntrial = 1
   T = 30
   alpha = 0.99
   while ntrial <= 1000:
      n = np.random.randint(0, len(G))
      while True:
         m = np.random.randint(0, len(G))
         if n != m:
            break
      s1 = swap(s, m, n)
      c1 = cost(G, s1)
      if c1 < c:
         s, c = s1, c1
      else:
         if np.random.rand() < np.exp(-(c1 - c)/T):
            s, c = s1, c1
      T = alpha*T
      ntrial += 1def swap(s, m, n):
   i, j = min(m, n), max(m, n)
   s1 = s.copy()
   while i < j:
      s1[i], s1[j] = s1[j], s1[i]
      i += 1
      j -= 1
   return s1
 
def cost(G, s):
   l = 0
   for i in range(len(s)-1):
      l += G[s[i]][s[i+1]]
   l += G[s[len(s)-1]][s[0]] 
   return l

The following animations show how the algorithm works:

Image for post
Image by Author
Image for post
Image by Author

The following animation shows the TSP path computed with SA for 100 points in 2D.

Image for post
Image by Author

TSP with Genetic Algorithm

Image for post

Here in the following implementation of the above algorithm we shall have the following assumptions:

  • We shall assume the crossover rate is 1.0, i.e., all individuals in a population participate in crossover. The mutation probability to be used is 0.1.
  • With each crossover operation between two parent chromosomes, couple of children are generated, cant just swap portions of parents chromosomes, need to be careful to make sure that the offspring represents valid TSP path.
  • Mutation is similar to swap operation implemented earlier.
  • For each generation we shall keep a constant k=20 (or 30) chromosomes (representing candidate solutions for TSP).
  • The fitness function will be the cost of the TSP path represented by each chromosome. Hence, we want to minimize the value of the fitness function — i.e., less the value of a chromosome, more fit is it to survive.
  • We shall use rank selection, i.e., after crossover and mutation, only the top k fittest offspring (i.e., with least fitness function value) will survive for the next generation.
  • The following python code shows the implementation of the above algorithm with the above assumptions.

import numpy as np

def do_crossover(s1, s2, m):
   s1, s2 = s1.copy(), s2.copy()
   c1 = s2.copy()
   for i in range(m, len(s1)): c1.remove(s1[i])
   for i in range(m, len(s1)): c1.append(s1[i])
   c2 = s1.copy()
   for i in range(m, len(s2)): c2.remove(s2[i])
   for i in range(m, len(s2)): c2.append(s2[i])	
   return (c1, c2)
		
def do_mutation(s, m, n):
   i, j = min(m, n), max(m, n)
   s1 = s.copy()
   while i < j:
	s1[i], s1[j] = s1[j], s1[i]
	i += 1
	j -= 1
   return s1
	
def compute_fitness(G, s):
   l = 0
   for i in range(len(s)-1):
	l += G[s[i]][s[i+1]]
	l += G[s[len(s)-1]][s[0]]	
   return l
	
def get_elite(G, gen, k):
   gen = sorted(gen, key=lambda s: compute_fitness(G, s))
   return gen[:k]
		
def TSP_GA(G, k=20, ntrial = 200):
    n_p = k
    mutation_prob = 0.1
    gen = []
    path = list(range(len(G)))
    while len(gen) < n_p:
  	path1 = path.copy()
	np.random.shuffle(path1)
	if not path1 in gen:
		gen.append(path1)
	
    for trial in range(ntrial):
	gen = get_elite(G, gen, k)
	gen_costs = [(round(compute_fitness(G, s),3), s) \
                      for s in gen]
	next_gen = []
	for i in range(len(gen)):
   	   for j in range(i+1, len(gen)):
   	       c1, c2 = do_crossover(gen[i], gen[j], \
		        np.random.randint(0, len(gen[i])))
	       next_gen.append(c1)
	       next_gen.append(c2)
	   if np.random.rand() < mutation_prob:
	       m = np.random.randint(0, len(gen[i]))
	       while True:
	          n = np.random.randint(0, len(gen[i]))
		  if m != n:
		     break
	       c = do_mutation(gen[i], m, n)
	       next_gen.append(c)
	gen = next_gen
Image for post
Image by Author

The following animation shows the TSP path computed with GA for 100 points in 2D.

Image for post

Coping with a few NP-Hard Problems with Python

Solving classic NP-hard problems such as 3-Coloring and Hamiltonian path with SAT solvers

In this blog we shall continue our discussion on a few NP-complete / NP-hard problems and will attempt to solve them (e.g., encoding the problem to satisfiability problem and solving with a SAT-solver) and the corresponding python implementations. The problems discussed here appeared as programming assignments in the coursera course Advanced Algorithms and Complexity. The problem statements are taken from the course itself.

Coloring a Graph with 3 colors using a SAT solver

Given a graph, we need to color its vertices into 3 different colors, so that any two vertices connected by an edge need to be of different colors. Graph coloring is an NP-complete problem, so we don’t currently know an efficient solution to it, and you need to reduce it to an instance of SAT problem which, although it is NP-complete, can often be solved efficiently in practice using special programs called SAT-solvers.

We can reduce the real-world problem about assigning frequencies to the transmitting towers of the cells in a GSM network to a problem of proper coloring a graph into 3 colors.Colors correspond to frequencies, vertices correspond to cells, and edges connect neighboring cells, as shown below.

Image for post
Image taken from here
Image for post
  • We need to output a boolean formula in the conjunctive normal form (CNF) in a specific format. If it is possible to color the vertices of the input graph in 3 colors such that any two vertices connected by an edge are of different colors, the formula must be satisfiable. Otherwise, the formula must be unsatisfiable.
  • We shall use pySAT SAT-solver to solve the clauses generated from the graph.
  • In particular, if there are n nodes and m edges in the graph that is to be colored with 3 colors, we need to generate 3*n variables and 3*m + n clauses.
  • The i-th vertex will correspond to 3 variables, with id i, n+i and 2*n+i. They will represent whether the node is to be colored by red, green or blue.
  • Any two vertices forming an edge must not have the same color.
  • Each vertex must be colored by one from the 3 colors.

The following python code snippet shows how the encoding needs to be implemented to output the desired clauses for the CNF, subsequently solved by the SAT-solver and the solution assignments is used to color the vertices of the graph. Again, there are following 3 basic steps:

  • encode the input graph into SAT formula (reduce the 3-coloring problem to a satisfiability problem)
  • solve with a SAT-solver and obtain the solution in terms of variable assignments
  • Decode the solution — use the solution assignments by SAT-solver to color the vertices of the graph

import numpy as np
from pysat.solvers import Glucose3def get_colors(assignments):
    all_colors = np.array(['red', 'green', 'blue'])
    colors = {}
    for v in range(n):
	colors[v+1] = all_colors[[assignments[v]>0, \
	      assignments[n+v]>0, assignments[2*n+v]>0]][0]
	return colorsdef print_clauses(clauses):
    for c in clauses:
	vars = []
	for v in c:
    	   vars.append('{}x{}'.format('¬' if v < 0 else '', abs(v)))
	print('(' + ' OR '.join(vars) + ')')
		
def print_SAT_solution(assignments):
    sol = ''
    for x in assignments:
	sol += 'x{}={} '.format(abs(x),x>0)
    print(sol)
	
def solve_graph_coloring():
    # encode the input graph into SAT formula
    n_clauses = 3*m+n
    n_vars = 3*n
    clauses = []
    for u, v in edges:
 	clauses.append((-u, -v)) # corresponding to red color
	clauses.append((-(n+u), -(n+v))) # corresponding to green 
 	clauses.append((-(2*n+u), -(2*n+v))) # corresponds to blue
    for v in range(1, n+1):
        clauses.append((v, n+v, 2*n+v)) # at least one color
    print_clauses(clauses)
    # solve SAT and obtain solution in terms of variable assignments
    g = Glucose3()
    for c in clauses:
        g.add_clause(c)
    status = g.solve()
    assignments = g.get_model()
    print(status)
    print_SAT_solution(assignments)
    # use the solution assignment by SAT to color the graph
    colors = get_colors(assignments)
    print(colors)

The following animations show the input graphs, the corresponding variables and clauses generated for the CNF to be solved with the SAT-solver, the solution obtained in terms of truth-assignments of the variables and then how they are used solve the graph-coloring problem, to color the graph with 3 colors. Each row in the right subgraph represents a clause in the CNF.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

The last graph is the Petersen graph, that has chromatic number 3, so can be colored with 3 colors and a 3-coloring solution is obtained by the SAT-solver output assignments.

Now, the following example shows an input graph which is NOT 3-colorable, with the CNF formed by the clauses generated to be solved by the SAT-solver being unsatisfiable / inconsistent.

Image for post
Image by Author

Solving the Hamiltonian Path problem with SAT-solver

In this problem, we shall learn how to solve the classic Hamiltonian Path problem, by designing and implementing an efficient algorithm to reduce it to SAT.

Image for post

The following python snippet shows

  1. how the Hamiltonian Path problem is reduced to SAT

2. Then it’s solved by the pysat SAT-solver.

3. The solution is interpreted to construct the Hamiltonian path for the following input graphs.


def get_hamiltonian_path(assignments):
    path = [None]*n
    for i in range(n):
	for j in range(n):
  	   if assignments[i*n+j] > 0: # True
		path[i] = j+1
    return path
    
def reduce_Hamiltonian_Path_to_SAT_and_solve(edges):
    
    def index(i, j):
	return n*i + j + 1
		
    m = len(edges)
    n_clauses = 2*n + (2*n*n-n-m)*(n-1)
    n_vars = n*n
    clauses = []
	
    for j in range(n):
  	clause = []
	for i in range(n):
    	    clause.append(index(i,j))
	clauses.append(clause)
    for i in range(n):
	clause = []
	for j in range(n):
  	    clause.append(index(i,j)) 
	clauses.append(clause)
    for j in range(n):
 	for i in range(n):
   	    for k in range(i+1, n):
		clauses.append((-index(i,j), -index(k,j)))
    for i in range(n):
        for j in range(n):
   	    for k in range(j+1, n):
	        clauses.append((-index(i,j), -index(i,k)))
    for k in range(n-1):
        for i in range(n):
  	    for j in range(n):
		if i == j: continue
	  	    if not [i+1, j+1] in edges:
			clauses.append((-index(k,i), -index(k+1,j)))
    print_clauses(clauses)
    g = Glucose3()
    for c in clauses:
	g.add_clause(c)
    status = g.solve()
    assignments = g.get_model()
    print(status)
    print_SAT_solution(assignments)
    path = get_hamiltonian_path(assignments)
    print(path)

The following figure shows the output Hamiltonian Path obtained for the line input graph using the solution obtained by SAT-solver.

Image for post
Image by Author

The following figure shows the Hamiltonian Path obtained with the SAT-solver for the input Petersen’s graph, which indeed has a Hamiltonian Path.

Image for post
Image by Author

To be continued…

In medium: https://medium.com/swlh/coping-with-a-few-np-complete-problems-with-python-93602379a231

Graph Algorithms with Python

In this blog we shall discuss about a few popular graph algorithms and their python implementations. The problems discussed here appeared as programming assignments in the coursera course Algorithms on Graphs and on Rosalind. The problem statements are taken from the course itself.

The basic building blocks of graph algorithms such as computing the number of connected components, checking whether there is a path between the given two vertices, checking whether there is a cycle, etc are used practically in many applications working with graphs: for example, finding shortest paths on maps, analyzing social networks, analyzing biological data.

For all the problems it can be assumed that the given input graph is simple, i.e., it does not contain self-loops (edges going from a vertex to itself) and parallel edges.

Checking if two vertices are Reachable

Given an undirected graph G=(V,E) and two distinct vertices 𝑢 and 𝑣, check if there is a path between 𝑢 and 𝑣.

Steps

  1. Starting from the node u, we can simply use breadth first search (bfs) or depth-first search (dfs) to explore the nodes reachable from u.
  2. As soon as we find v we can return the nodes are reachable from one-another.
  3. If v is not there in the nodes explored, we can conclude v is not reachable from u.
  4. The following implementation (demonstrated using the following animations) uses iterative dfs (with stack) to check if two nodes (initially colored pink) are reachable.
  5. We can optionally implement coloring of the nodes w.r.t. the following convention: initially the nodes are all white, when they are visited (pushed onto stack) they are marked as gray and finally when all the adjacent (children) nodes for a given node are visited, the node can be marked black.
  6. We can store the parent of each node in an array and we can extract the path between two given nodes using the parent array, if they are reachable.
Image for post
Image taken from this lecture notes

A very basic python implementation of the iterative dfs are shown below (here adj represents the adjacency list representation of the input graph):

def reach(adj, x, y): 
 visited = [False]*len(adj)
 stack = [x]
 visited[x] = True
 while len(stack) > 0:
  u = stack.pop(-1)
  for v in adj[u]:
   if not visited[v]:
    stack.append(v)
    visited[v] = True
    if v == y:
     return 1
 return 0

The following animations demonstrate how the algorithm works, the stack is also shown at different points in time during the execution. Finally the path between the nodes are shown if they are reachable.

Image for post
Image by Author
Image for post
Image by Author

The same algorithm can be used for finding an Exit from a Maze (check whether there is a path from a given cell to a given exit).

Find Connected Components in an Undirected Graph

Given an undirected graph with 𝑛 vertices and 𝑚 edges, compute the number of connected components in it.

Steps

  1. The following simple modification in dfs can be used to find the number of connected components in an undirected graph, as shown in the following figure.
  2. From each node we need to find all the nodes yet to be explored.
  3. We can find the nodes in a given component by finding all the nodes reachable from a given node.
  4. The same iterative dfs implementation was used (demonstrated with the animations below).
  5. The nodes in a component found were colored using the same color.
Image for post
Image taken from this lecture note

Python Code

def number_of_components(adj):
 result = 0
 n = len(adj)
 visited = [False]*len(adj)
 for x in range(n):
  if not visited[x]:
   result += 1
   stack = [x]
   visited[x] = True
   while len(stack) > 0:
    u = stack.pop(-1)
    for v in adj[u]:
     if not visited[v]:
      stack.append(v) 
      visited[v] = True
 return result
Image for post
Image by Author
Image for post
Image by Author

The same algorithm can be used to decide that there are no dead zones in a maze, that is, that at least one exit is reachable from each cell.

Find Euler Tour / Circuit in a Directed Graph

Given A directed graph that contains an Eulerian tour, where the graph is given in the form of an adjacency list.

Steps

  1. While solving the famous Königsberg Bridge Problem, Euler proved that an Euler circuit in an undirected graph exists iff all its nodes have even degree.
  2. For an Euler tour to exist in an undirected graph, if there exists odd-degree nodes in the graph, there must be exactly 2 nodes with odd degrees — the tour will start from one such node and end in another node.
  3. The tour must visit all the edges in the graph exactly once.
  4. The degree test can be extended to directed graphs (with in-degrees and out-degrees) and can be used to determine the existence of Euler tour / circuit in a Digraph.
  5. Flury’s algorithm can be used to iteratively remove the edges (selecting the edges not burning the bridges in the graph as much as possible) from the graph and adding them to the tour.
  6. DFS can be modified to obtain the Euler tour (circuit) in a DiGraph.
  7. The following figure shows these couple of algorithms for finding the Euler tour / circuit in a graph if one exists (note in the figure path is meant to indicate tour). We shall use DFS to find Euler tour.
Image for post
Image created from Youtube Videos cited inside the image

The following code snippets represent the functions used to find Euler tour in a graph. Here, variables n and m represent the number of vertices and edges of the input DiGraph, respectively, whereas adj represents the corresponding adjacency list.

Python Code

def count_in_out_degrees(adj):
 n = len(adj)
 in_deg, out_deg = [0]*n, [0]*n
 for u in range(n):
  for v in adj[u]:
   out_deg[u] += 1
   in_deg[v] += 1
 return in_deg, out_degdef get_start_if_Euler_tour_present(in_deg, out_deg):
 start, end, tour = None, None, True
 for i in range(len(in_deg)): 
  d = out_deg[i] - in_deg[i]
  if abs(d) > 1:
   tour = False
   break
  elif d == 1:
   start = i
  elif d == -1:
   end = i
 tour = (start != None and end != None) or \
        (start == None and end == None)
 if tour and start == None: # a circuit
  start = 0
 return (tour, start)def dfs(adj, v, out_deg, tour):
 while out_deg[v] > 0:
  out_deg[v] -= 1
  dfs(adj, adj[v][out_deg[v]], out_deg, tour)
 tour[:] = [v] + tourdef compute_Euler_tour(adj):
 n, m = len(adj), sum([len(adj[i]) for i in range(len(adj))])
 in_deg, out_deg = count_in_out_degrees(adj)
 tour_present, start = get_start_if_Euler_tour_present(in_deg, \
                                                       out_deg)
 if not tour_present:
  return None
 tour = []
 dfs(adj, start, out_deg, tour)
 if len(tour) == m+1:
  return tour
return None

The following animations show how the algorithm works.

Image for post
Image by Author
Image for post
Image by Author

Cycle Detection in a Directed Graph

Check whether a given directed graph with 𝑛 vertices and 𝑚 edges contains a cycle.

The following figure shows the classification of the edges encountered in DFS:

Image for post
Image taken from this lecture notes

It can be shown that whether a Directed Graph is acyclic (DAG) or not (i.e. it contains a cycle), can be checked using the presence of a back-edge while DFS traversal.

Steps

  1. Use the recursive DFS implementation (pseudo-code shown in the below figure)
  2. Track if a node to be visited is already on the stack, if it’s there, it forms a back edge.
  3. Use parents array to obtain the directed cycle, if found.
Image for post
Image taken from this lecture notes

Python code

def acyclic(adj):
 n = len(adj)
 visited = [False]*n
 parents = [None]*n
 on_stack = [False]*n
 cycle = []
 
 def dfs_visit(adj, u, cycle):
  visited[u] = True
  on_stack[u] = True
  for v in adj[u]:
   if not visited[v]:
    parents[v] = u
    dfs_visit(adj, v, cycle)
   elif on_stack[v]:
    x = u
    while x != v:
     cycle.append(x)
     x = parents[x]
    cycle = [v] + cycle
    #print(cycle)    
  on_stack[u] = False
  
 for v in range(n):
  if not visited[v]:
   dfs_visit(adj, v, cycle)
 
 return int(len(cycle) > 0)
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

The above algorithm can be used to check consistency in a curriculum (e.g., there is no cyclic dependency in the prerequisites for each course listed).

Topologically Order a Directed Graph

Compute a topological ordering of a given directed acyclic graph (DAG) with 𝑛 vertices and 𝑚 edges.

The following idea can be used to obtain such an ordering in a DiGraph G:

  • Find sink.
  • Put at end of order.
  • Remove from graph.
  • Repeat.

It can be implemented efficiently with dfs by keeping track of the time of pre and post visiting the nodes.

Steps

  1. Use the recursive implementation of dfs.
  2. When visiting a node with a recursive call, record the pre and post visit times, at the beginning and the end (once all the children of the given node already visited) of the recursive call, respectively.
  3. Reorder the nodes by their post-visit times in descending order, as shown in the following figure.
Image for post
Image taken from this lecture notes

Python code


def toposort(adj):
 order = []
 n = len(adj)
 visited = [False]*n
 previsit = [0]*n
 postvisit = [0]*n
 
 def dfs_visit(adj, u):
  global clock
  visited[u] = True
  previsit[u] = clock
  clock += 1
  for v in adj[u]:
   if not visited[v]:
    dfs_visit(adj, v)
  postvisit[u] = clock
  clock += 1
 
 for v in range(n):
  if not visited[v]:
   dfs_visit(adj, v)
 
 order = [x for _, x in sorted(zip(postvisit, range(n)), \
                   key=lambda pair: pair[0], reverse=True)]
 
 return order
Image for post
Image for post

The above algorithm can be used to determine the order of the courses in a curriculum, taking care of the pre-requisites dependencies.

KosaRaju’s Algorithm to find the Strongly Connected Components (SCCs) in a Digraph

Compute the number of strongly connected components of a given directed graph with 𝑛 vertices and 𝑚 edges.

Note the following:

  • dfs can be used to find SCCs in a Digraph.
  • We need to make it sure that dfs traversal does not leave a such component with an outgoing edge.
  • The sink component is one that does not have an outgoing edge. We need to find a sink component first.
  • The vertex with the largest post-visit time is a source component for dfs.
  • The reverse (or transpose) graph of G has same SCC as the original graph.
  • Source components of the transpose of G are sink components of G.

With all the above information, the following algorithm can be implemented to obtain the SCCs in a digraph G.

Image for post
Image taken from this lecture notes

The runtime of the algorithm is again O(|V |+|E|). Alternatively, the algorithm can be represented as follows:

Image for post
Taken from these lecture notes

Python Code


def number_of_strongly_connected_components(adj):
 result = 0
 visited = [False]*n
 previsit = [0]*n
 postvisit = [0]*n
 
 def reverse_graph(adj):
  n = len(adj)
  new_adj = [ [] for _ in range(n)]
  for i in range(n):
   for j in adj[i]:
    new_adj[j].append(i)
  return new_adj
 
 def dfs_visit(adj, u):
  global clock
  visited[u] = True
  previsit[u] = clock
  clock += 1
  for v in adj[u]:
   if not visited[v]:
    dfs_visit(adj, v)
  postvisit[u] = clock
  clock += 1
 
 for v in range(n):
  if not visited[v]:
   dfs_visit(adj, v)
 post_v = [x for _, x in sorted(zip(postvisit, range(n)), \
                   key=lambda pair: pair[0], reverse=True)]
 rev_adj = reverse_graph(adj)
 visited = [False]*n
 for v in post_v:
  if not visited[v]:
   dfs_visit(rev_adj, v)
   result += 1return result

The following animations show how the algorithm works:

Image for post
Image by Author
Image for post
Image by Author
Image for post

Shortest Path in an Undirected Graph with BFS

Given an undirected graph with 𝑛 vertices and 𝑚 edges and two vertices 𝑢 and 𝑣, compute the length of a shortest path between 𝑢 and 𝑣 (that is, the minimum number of edges in a path from 𝑢 to 𝑣).

  • The following figure shows the algorithm for bfs.
  • It uses a queue (FIFO) instead of a stack (LIFO) to store the nodes to be explored.
  • The traversal is also called level-order traversal, since the nodes reachable with smaller number of hops (shorter distances) from the source / start node are visited earlier than the higher-distance nodes.
  • The running time of bfs is again O(|E| + |V |).
Image for post
Image taken from this lecture notes

Python code


def distance(adj, s, t):
 inf = 10**9 #float('Inf')
 d = [inf]*len(adj)
 queue = [s]
 d[s] = 0
 while len(queue) > 0:
  u = queue.pop(0)
  for v in adj[u]:
   if d[v] ==  inf:
    queue.append(v)
    d[v] = d[u] + 1
    if v == t:
     return d[t]
 return -1

The following animations demonstrate how the algorithm works. The queue used to store the vertices to be traversed is also shown.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

The above algorithm can be used to compute the minimum number of flight segments to get from one city to another one.

Checking if a Graph is Bipartite (2-Colorable)

Given an undirected graph with 𝑛 vertices and 𝑚 edges, check whether it is bipartite.

Steps

  1. Note that a graph bipartite iff its vertices can be colored using 2 colors (so that no adjacent vertices have the same color).
  2. Use bfs to traverse the graph from a starting node and color nodes in the alternate levels (measured by distances from source) with red (even level) and blue (odd level).
  3. If at any point in time two adjacent vertices are found that are colored using the same color, then the graph is not 2-colorable (hence not bipartite).
  4. If no such cases occur, then the graph is bipartite
  5. For a graph with multiple components, use bfs on each of them.
  6. Also, note that a graph is bipartite iff it contains an odd length cycle.

Python code


def bipartite(adj):
 color = [None]*len(adj)
 for vertex in range(len(adj)):
  if not color[vertex]:
   queue = [vertex]
   color[vertex] = 'red'
   while len(queue) > 0:
    u = queue.pop(0)
    for v in adj[u]:
     if color[v] ==  color[u]:
      return 0
     if not color[v]:
      queue.append(v)
      color[v] = 'red' if color[u] == 'blue' else 'blue'
 return 1

The following animations demonstrate how the algorithm works.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image for post
Image by Author

Notice that the last graph contained a length of cycle 4 (even length cycle) and it was bipartite graph. The graph prior to this one contained a triangle in it (cycle of odd length 3), it was not bipartite.

Shortest Path in a Weighted Graph with Dijkstra

Given an directed graph with positive edge weights and with 𝑛 vertices and 𝑚 edges as well as two vertices 𝑢 and 𝑣, compute the weight of a shortest path between 𝑢 and 𝑣 (that is, the minimum total weight of a path from 𝑢 to 𝑣).

  • Optimal substructure property : any subpath of an optimal path is also optimal.
  • Initially, we only know the distance to source node S and relax all the edges from S.
  • Maintain a set R of vertices for which dist is already set correctly (known region).
  • On each iteration take a vertex outside of R with the minimum dist-value, add it to R, and relax all its outgoing edges.
  • The next figure shows the pseudocode of the algorithm.
  • The algorithm works for any graph with non-negative edge weights.
  • The algorithm does not work if the input graph has negative edge weights (known region has the assumption that the dist can be reduced further, this assumption does not hold if the graph has negative edges).
Image for post
Image created from this lecture notes

Python code


import queue

def distance(adj, cost, s, t):
 inf = 10**19
 n = len(adj)
 d = [inf]*n
 d[s] = 0
 visited = [0]*n
 h = queue.PriorityQueue()
 for v in range(n):
  h.put((d[v], v))
 while not h.empty():
  u = h.get()[1]
  if visited[u]: continue
  visited[u] = True
  for i in range(len(adj[u])):
   v = adj[u][i]
   if d[v] > d[u] + cost[u][i]:
    d[v] = d[u] + cost[u][i]
    h.put((d[v], v))
 return d[t] if d[t] != inf else -1

The following animations show how the algorithm works.

Image for post
Image by Author
Image for post
Image by Author

The algorithm works for the undirected graphs as well. The following animations show how the shortest path can be found on undirected graphs.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image for post
Image by Author

Detecting Negative Cycle in a Directed Graph with Bellman-Ford

Given an directed graph with possibly negative edge weights and with 𝑛 vertices and 𝑚 edges, check whether it contains a cycle of negative weight. Also, given a vertex 𝑠, compute the length of shortest paths from 𝑠 to all other vertices of the graph.

  • For a Digraph with n nodes (without a negative cycle), the shortest path length in between two nodes (e.g., the source node and any other node) can be at most n-1.
  • Relax edges while dist changes (at most n-1 times, most of the times the distances will stop changing much before that).
  • This algorithm works even for negative edge weights.
  • The following figure shows the pseudocode of the algorithm.
Image for post
Image created from this lecture notes

The above algorithm can also be used to detect a negative cycle in the input graph.

  • Run n = |V | iterations of Bellman–Ford algorithm (i.e., just run the edge-relaxation for once more, for the n-th time).
  • If there exists a vertex, the distance of which still decreases, it implies that there exist a negative-weight cycle in the graph and the vertex is reachable from that cycle.
  • Save node v relaxed on the last iteration v is reachable from a negative cycle
  • Start from x ← v, follow the link x ← prev[x] for |V | times — will be definitely on the cycle
  • Save y ← x and go x ← prev[x] until x = y again

The above algorithm can be used to detect infinite arbitrage, with the following steps:

  • Do |V | iterations of Bellman–Ford, save all nodes relaxed on V-th iteration-set A
  • Put all nodes from A in queue Q
  • Do breadth-first search with queue Q and find all nodes reachable from A
  • All those nodes and only those can have infinite arbitrage

Python code


def negative_cycle(adj, cost):
 inf = 10**19 # float('Inf')
 n = len(adj)
 d = [inf]*n
 d[0] = 0
 for k in range(n-1):
  for u in range(n):
   for i in range(len(adj[u])):
    v = adj[u][i]
    if d[v] > d[u] + cost[u][i]:
     d[v] = d[u] + cost[u][i]
 for u in range(n):
  for i in range(len(adj[u])):
   v = adj[u][i]
   if d[v] > d[u] + cost[u][i]:
    d[v] = d[u] + cost[u][i]
    return 1
 return 0

The following animations demonstrate how the algorithm works.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

The last animation shows how the shortest paths can be computed with the algorithm in the presence of negative edge weights.

Find Minimum Spanning Tree in a Graph with Prim’s Greedy Algorithm

Given 𝑛 points on a plane, connect them with segments of minimum total length such that there is a path between any two points.

Steps

  • Let’s construct the following undirected graph G=(V, E): each node (in V) in the graph corresponds to a point in the plane and the weight of an edge (in E) in between any two nodes corresponds to the distance between the nodes.
  • The length of a segment with endpoints (𝑥1, 𝑦1) and (𝑥2, 𝑦2) is equal to the Euclidean distance √︀((𝑥1 − 𝑥2)² + (𝑦1 − 𝑦2)²).
  • Then the problem boils down to finding a Minimum Spanning Tree in the graph G.
  • The following figure defines the MST problem and two popular greedy algorithms Prim and Kruskal to solve the problem efficiently.
Image for post
Image created from this lecture notes

We shall use Prim’s algorithm to find the MST in the graph. Here are the algorithm steps:

  • It starts with an initial vertex and is grown to a spanning tree X.
  • X is always a subtree, grows by one edge at each iteration
  • Add a lightest edge between a vertex of the tree and a vertex not in the tree
  • Very similar to Dijkstra’s algorithm
  • The pseudocode is shown in the following figure.
Image for post
Image taken from this lecture notes

Python code


def minimum_distance(x, y):
 result = 0.
 inf = 10**19
 n = len(x)
 adj = [[0 for _ in range(n)] for _ in range(n)]
 for i in range(n):
  for j in range(i+1, n):
   adj[i][j] = adj[j][i] = math.sqrt((x[i]-x[j])**2 + \
                                     (y[i]-y[j])**2)
 c = [inf]*n
 s = 0
 c[s] = 0
 visited = [False]*n
 parent = [None]*n
 h = queue.PriorityQueue()
 for v in range(n):
  h.put((c[v], v))
 while not h.empty():
  w, u = h.get()
  if visited[u]: continue
  visited[u] = True
  for v in range(n):
   if v == u: continue
   if (not visited[v]) and (c[v] > adj[u][v]):
    c[v] = adj[u][v]
    parent[v] = u
    h.put((c[v], v))
 spanning_tree = []
 for i in range(n):
  spanning_tree.append((i, parent[i]))
  if parent[i] != None: 
   result += adj[i][parent[i]]
 #print(spanning_tree)
 return result

The following animations demonstrate how the algorithm works for a few different set of input points in 2-D plane.

Image for post
Image by Author

The following two animations show the algorithm steps on the input points and the corresponding graph formed, respectively.

Image for post
Image by Author
Image for post
Image by Author

The following animation shows how Prim’s algorithm outputs an MST for 36 input points in a 6×6 grid.

Image for post
Image by Author

As expected, the cost of the MCST formed is 35 for the 36 points, since a spanning tree with n nodes has n-1 edges, each with unit length.

The above algorithm can be used to build roads between some pairs of the given cities such that there is a path between any two cities and the total length of the roads is minimized.

Finding Minimum Spanning Tree and Hierarchical Clustering with Kruskal

Given 𝑛 points on a plane and an integer 𝑘, compute the largest possible value of 𝑑 such that the given points can be partitioned into 𝑘 non-empty subsets in such a way that the distance between any two points from different subsets is at least 𝑑.

Steps

  • We shall use Kruskal’s algorithm to solve the above problem. Each point can be thought of a node in the graph, as earlier, with the edge weights between the nodes being equal to the Euclidean distance between them.
  • Start with n components, each node representing one.
  • Run iterations of the Kruskal’s algorithm and merge the components till there are exactly k (< n) components left.
  • These k components will be the k desired clusters of the points and we can compute d to be the largest distance in between two points belonging to different clusters.

Here are the steps for the Kruskal’s algorithm to compute MST:

  • Start with each node in the graph as a single-node tree in a forest X.
  • Repeatedly add to X the next lightest edge e that doesn’t produce a cycle.
  • At any point of time, the set X is a forest (i.e., a collection of trees)
  • The next edge e connects two different trees — say, T1 and T2
  • The edge e is the lightest between T1 and V / T1, hence adding e is safe
Image for post
Image taken from this lecture notes
  • Use disjoint sets data structure for implementation
  • Initially, each vertex lies in a separate set
  • Each set is the set of vertices of a connected component
  • To check whether the current edge {u, v} produces a cycle, we check whether u and v belong to the same set (find).
  • Merge two sets using union operation.
  • The following figure shows a few algorithms to implement union-find abstractions for disjoint sets.
Image for post
Image created from this lecture notes

Python code


def clustering(x, y, k):
 result = 0.
 inf = float('Inf') #10**19
 n = len(x)
 adj = []
 for i in range(n):
  for j in range(i+1, n):
   adj.append((math.sqrt((x[i]-x[j])**2 + (y[i]-y[j])**2), i, j))
 adj = sorted(adj)
 indices = {i:i for i in range(n)}
 while len(set(indices.values())) > k:
  d, u, v = adj.pop(0)
  iu, iv = indices[u], indices[v]
  # implemented quick-find here
  # To-do: implement weighted union with path-compression heuristic      
  if iu != iv:
   indices[u] = indices[v] = min(iu, iv) 
   for j in range(n):
    if indices[j] == max(iu, iv):
      indices[j] = min(iu, iv) 
 clusters = {}
 for i in range(n):
  ci = indices[i]
  clusters[ci] = clusters.get(ci, []) + [i]
 #print(clusters)
 d = inf
 for i in list(clusters.keys()):
  for j in list(clusters.keys()):
   if i == j: continue
   for vi in clusters[i]:
    for vj in clusters[j]:
     d = min(d, math.sqrt((x[vi]-x[vj])**2 + (y[vi]-y[vj])**2))
 return d

The following animations demonstrate how the algorithm works. Next couple of animations show how 8 points in a plane are (hierarchically) clustered using Kruskal’s algorithm and finally the MST is computed. The first animation shows how the points are clustered and the next one shows how Kruskal works on the corresponding graph created from the points.

Image for post
Image by Author
Image for post
Image by Author

The next animation again shows how a set of points in 2-D are clustered using Kruskal and MST is computed.

Image for post
Image by Author

Clustering is a fundamental problem in data mining. The goal is to partition a given set of objects into subsets (or clusters) in such a way that any two objects from the same subset are close (or similar) to each other, while any two objects from different subsets are far apart.

Now we shall use Kruskal’s algorithm to cluster a small real-world dataset named the Iris dataset (can be downloaded from the UCI Machine learning repository), often used as a test dataset in Machine learning.

Next 3 animations show how Kruskal can be used to cluster Iris dataset . The first few rows of the dataset is shown below:

Image for post

Let’s use the first two features (SepalLength and SepalWidth), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

Image for post
Image by Author

Now, let’s use the second and third feature variables (SepalWidth and PetalLength), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

Image for post
Image by Author

Finally, let’s use the third and fourth feature variables (PetalLength and PetalWidth), project the dataset in 2-D and use Kruskal to cluster the dataset, as shown in the following animation.

Image for post
Image by Author

Friend Suggestion in Social Networks — Bidirectional Dijkstra

Compute the distance between several pairs of nodes in the network.

Steps

  • Build reverse graph GR
  • Start Dijkstra from s in G and from t in GR
  • Alternate between Dijkstra steps in G and in GR
  • Stop when some vertex v is processed both in G and in GR
  • Compute the shortest path between s and t

Meet-in-the-middle

  • More general idea, not just for graphs
  • Instead of searching for all possible objects, search for first halves and for second halves separately
  • Then find compatible halves
  • Typically roughly O(√N) instead of O(N)

Friends suggestions in social networks

  • Find the shortest path from Michael to Bob via friends connections
  • For the two “farthest” people, Dijkstra has to look through 2 billion people
  • If we only consider friends of friends of friends for both Michael and Bob, we will find a connection
  • Roughly 1M friends of friends of friends
  • 1M + 1M = 2M people – 1000 times less
  • Dijkstra goes in circles
  • Bidirectional search idea can reduce the search space
  • Bidirectional Dijkstra can be 1000s times faster than Dijkstra for social networks
  • The following figure shows the algorithm
Image for post
Created from this lecture note

The following python function implements the bidirectional Dijkstra, the arguments adj and adjR represents the adjacency lists for the original and the reverse graph, respectively.

Python code


def distance(adj, cost, adjR, costR, s, t):
  
 def process(u, adj, cost, h, d, prev, visited):
  for i in range(len(adj[u])):
   v = adj[u][i]
   if d[v] > d[u] + cost[u][i]:
    d[v] = d[u] + cost[u][i]
    h.put((d[v], v))
    prev[v] = u
  visited[u] = True
 
 def shortest_path(s, dist, prev, t, distR, prevR, visited_any):
  distance = inf
  ubest = None
  for u in visited_any:
   if dist[u] + distR[u] < distance:
    ubest = u
    distance = dist[u] + distR[u]
  return distance if ubest != None else -1
 
 inf = 10**19
 n = len(adj)
 d, dR = [inf]*n, [inf]*n
 d[s] = 0
 dR[t] = 0
 visited, visitedR = [False]*n, [False]*n
 visited_any = set([])
 prev, prevR = [None]*n, [None]*n
 h = queue.PriorityQueue()
 h.put((d[s], s))
 hR = queue.PriorityQueue()
 hr.put((dr[t], t)) while True:
  u = h.get()[1]
  if visited[u]: continue
  process(u, adj, cost, h, d, prev, visited)
  visited_any.add(u)
  if visitedR[u]:
   return shortest_path(s, d, prev, t, dr, prevR, visited_any)
  uR = hR.get()[1]
  if visitedR[uR]: continue
  process(ur, adjR, costR, hR, dR, prevR, visitedR)
  visited_any.add(uR)
  if visited[uR]:
   return shortest_path(s, d, prev, t, dR, prevR, visited_any)
  if h.empty() or hr.empty():
   return -1

The following animation shows how the algorithm works:

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

Computing Distance Faster Using Coordinates with A* search

Compute the distance between several pairs of nodes in the network. The length l between any two nodes u an v is guaranteed to satisfy 𝑙 ≥ √︀((𝑥(𝑢) − 𝑥(𝑣))2 + (𝑦(𝑢) − 𝑦(𝑣))2).

  • Let’s say we are submitting a query to find a shortest path between nodes (s,t) to a graph.
  • A potential function 𝜋(v) for a node v in a is an estimation of d(v, t) — how far is it from here to t?
  • If we have such estimation, we can often avoid going wrong direction directed search
  • Take some potential function 𝜋 and run Dijkstra algorithm with edge weights ℓ𝜋
  • For any edge (u, v), the new length ℓ𝜋(u, v) must be non-negative, such 𝜋 is called feasible
  • Typically 𝜋(v) is a lower bound on d(v, t)
  • A* is a directed search algorithm based on Dijkstra and potential functions
  • Run Dijkstra with the potential 𝜋 to find the shortest path — the resulting algorithm is A* search, described in the following figure
  • On a real map a path from v to t cannot be shorter than the straight line segment from v to t
  • For each v, compute 𝜋(v) = dE (v, t)
  • If 𝜋(v) gives lower bound on d(v, t)
  • Worst case: 𝜋(v) = 0 for all v the same as Dijkstra
  • Best case: 𝜋(v) = d(v, t) for all v then ℓ𝜋(u, v) = 0 iff (u, v) is on a shortest path to t, so search visits only the edges of shortest s − t paths
  • It can be shown that the tighter are the lower bounds the fewer vertices will be scanned
Image for post

The following animations show how the A* search algorithm works on road networks (and the corresponding graph representations). Each node in the graph has the corresponding (distance, potential) pairs.

Image for post
Image by Author
Image for post
Image by Author

The following animations show how the shortest path is computed between two nodes for the USA road network for the San Francisco Bay area (the corresponding graph containing 321270 nodes and 800172 edges), using Dijkstra, bidirectional Dijkstra and A* algorithms respectively, the data can be downloaded from 9th DIMACS Implementation Challenge — Shortest Paths. Here, with bidirectional Dijkstra, the nodes from forward and reverse priority queues are popped in the same iteration.

Image for post
Image for post
Image for post

In medium: https://medium.com/python-in-plain-english/graph-algorithms-with-python-1fe2f29d663c

Implementing a few Advanced Algorithms with Python

In this blog we shall discuss about a few advanced algorithms and their python implementations (from scratch). The problems discussed here appeared as programming assignments in the coursera course Advanced Algorithms and Complexity and on Rosalind. The problem statements are taken from the course itself.

Ford-Fulkerson / Edmonds-Karp Algorithm for Max-flow

In this problem, we shall apply an algorithm for finding maximum flow in a network to determine how fast people can be evacuated from the given city.

Image for post
Image for post

The following animations show how the algorithm works. The red nodes represent the source and the sink in the network.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

In this problem, we shall apply an algorithm for finding maximum matching in a bipartite graph to assign airline crews to flights in the most efficient way. Later we shall see how an Integer Linear Program (ILP) can be formulated to solve the bipartite matching problem and then solved with a MIP solver.

With Max-flow

The following animations show how the algorithm finds the maximum matching for a few bipartite graphs (the blue and green nodes represent the self-edge-disjoint sets for the graph):

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

With Integer Linear Program

Image for post

Using the mip package to solve the integer program and using a binary cost matrix for the input bipartite graph G (where c[i,j]=1 iff i ∈ A and j ∈ B: A, B being disjoint vertex sets), the maximum bipartite matching problem can be solved with an ILP solver as shown in the following code snippet:

from mip import Model, xsum, maximize, BINARY

def mip_bipartite_matching(c):
   
   n, m = len(adj), len(adj[0])
   A, B = list(range(n)), list(range(m))
   
   model = Model()
   x = [[model.add_var(var_type=BINARY) for j in B] for i in A]
   
   model.objective = maximize(xsum(c[i][j]*x[i][j] \
                              for i in A for j in B))
   for i in A:
      model += xsum(x[i][j] for j in B) <= 1
   for j in B:
      model += xsum(x[i][j] for i in A) <= 1
      
   model.optimize()
   print(model.objective_value)
   return [(i,j) for i in A for j in B if x[i][j].x >= 0.99]
Image for post
Image by Author

The ILP for the maximum bipartite matching problem can also be represented as follows in terms of the incidence matrix:

Image for post
Image by Author

where the incidence matrix can be represented as follows, for the the above bipartite graph:

Image for post
Image by Author

The above matrix can be shown to be totally unimodular (i.e., the determinant of any submatrix is either 0,-1 or 1, e.g., out of 215 submatrices of the above incidence matrix, 206 submatrices have determinant 0, 8 submatrices have deteminant 1 and 1 submatrix has determinat -1), implying that every basic feasible solution (hence the optimal solution too) for the LP will have integer values for every variable.

On the contrary, the solution obtained by the max-flow algorithm for the same input bipartite graph is shown follows:

Image for post
Image by Author

Solving 2-SAT problem and Integrated Circuit Design

VLSI or Very Large-Scale Integration is a process of creating an integrated circuit by combining thousands of transistors on a single chip. You want to design a single layer of an integrated circuit.

You know exactly what modules will be used in this layer, and which of them should be connected by wires. The wires will be all on the same layer, but they cannot intersect with each other. Also, each wire can only be bent once, in one of two directions — to the left or to the right. If you connect two modules with a wire, selecting the direction of bending uniquely defines the position of the wire. Of course, some positions of some pairs of wires lead to intersection of the wires, which is forbidden. You need to determine a position for each wire in such a way that no wires intersect.

This problem can be reduced to 2-SAT problem — a special case of the SAT problem in which each clause contains exactly 2 variables. For each wire 𝑖, denote by 𝑥_𝑖 a binary variable which takes value 1 if the wire is bent to the right and 0 if the wire is bent to the left. For each 𝑖, 𝑥_𝑖 must be either 0 or 1.

Also, some pairs of wires intersect in some positions. For example, it could be so that if wire 1 is bent to the left and wire 2 is bent to the right, then they intersect. We want to write down a formula which is satisfied only if no wires intersect. In this case, we will add the clause (𝑥1 𝑂𝑅 𝑥2) to the formula which ensures that either 𝑥1 (the first wire is bent to the right) is true or 𝑥2 (the second wire is bent to the left) is true, and so the particular crossing when wire 1 is bent to the left AND wire 2 is bent to the right doesn’t happen whenever the formula is satisfied. We will add such a clause for each pair of wires and each pair of their positions if they intersect when put in those positions. Of course, if some pair of wires intersects in any pair of possible positions, we won’t be able to design a circuit. Your task is to determine whether it is possible, and if yes, determine the direction of bending for each of the wires.

The input represents a 2-CNF formula. We need to find a satisfying assignment to the formula if one exists, or report that the formula is not satisfiable otherwise.

Image for post

The following animations show how the algorithm works.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author

Computing the Maximum Independent Set in a Tree

Image for post
Image for post

The following animations show how the dynamic programming algorithm computes the DP table (initialized to -1 for all nodes, as shown) and the corresponding maximum weighted independent sets for the subproblems and uses them to compute the MWIS of the entire tree. The pink nodes represent the nodes belonging to the maximum weighted independent set for a given subproblem (the root of the corresponding subtree is marked with double circles) and also for the final problem.

Image for post
Image by Author
Image for post
Image by Author

Computing the Convex Hull for a set of points in 2D

In this problem we shall implement Jarvis’ March gift wrapping algorithm to compute the convex hull for a given set of 2D points.

The following code snippet shows a python implementation of the algorithm. The points are assumed to be stored as list of (x,y) tuples.

import numpy as np

def cross_prod(p0, p1, p2):
    return (p1[0] - p0[0]) * (p2[1] - p0[1]) - \
           (p1[1] - p0[1]) * (p2[0] - p0[0]) 
	
def left_turn(p0, p1, p2): 
    return cross_prod(p0, p1, p2) > 0
  
def convex_hull(points): 
    n = len(points)
    # the first point 
    _, p0 = min([((x, y), i) for i,(x,y) in  enumerate(points)])
    start_index = p0     chull = []
    while(True): 		  
  	chull.append(p0) 
 	p1 = (p0 + 1) % n #  make sure p1 != p0
	for p2 in range(n): 			  
	   # if p2 is more counter-clockwise than 
	   # current p1, then update p1  
	   if left_turn(points[p0], points[p1], points[p2]): 
		p1 = p2 
	p0 = p1 
	# came back to first point? 
	if p0 == start_index: 
	   break    return chull

The following animations show how the algorithm works.

Image for post
Image by Author

The following animation shows a bad case for the algorithm where all the given points are on the convex hull.

Image for post
Image by Author
Image for post
Image by Author
Image for post
Image by Author
Image for post

In medium: https://medium.com/python-in-plain-english/implementing-a-few-advanced-algorithms-with-python-dbf2fe35687f

Solving a few AI problems with Python

In this blog we shall discuss about a few problems in artificial intelligence and their python implementations. The problems discussed here appeared as programming assignments in the edX course CS50’s Introduction to Artificial Intelligence with Python (HarvardX:CS50 AI). The problem statements are taken from the course itself.

Degrees

Write a program that determines how many “degrees of separation” apart two actors are.

According to the Six Degrees of Kevin Bacon game, anyone in the Hollywood film industry can be connected to Kevin Bacon within six steps, where each step consists of finding a film that two actors both starred in.

  • In this problem, we’re interested in finding the shortest path between any two actors by choosing a sequence of movies that connects them. For example, the shortest path between Jennifer Lawrence and Tom Hanks is 2: Jennifer Lawrence is connected to Kevin Bacon by both starring in “X-Men: First Class,” and Kevin Bacon is connected to Tom Hanks by both starring in “Apollo 13.”
  • We can frame this as a search problem: our states are people. Our actions are movies, which take us from one actor to another (it’s true that a movie could take us to multiple different actors, but that’s okay for this problem).
  • Our initial state and goal state are defined by the two people we’re trying to connect. By using breadth-first search, we can find the shortest path from one actor to another.

The following figure shows how small samples from the dataset look like:

In the csv file for the people, each person has a unique id, corresponding with their id in IMDb’s database. They also have a name, and a birth year.

  • In the csv file for the movies, each movie also has a unique id, in addition to a title and the year in which the movie was released.
  • The csv file for the stars establishes a relationship between the people and the movies from the corresponding files. Each row is a pair of a person_id value and movie_id value.
  • Given two actors nodes in the graph we need to find the distance (shortest path) between the nodes. We shall use BFS to find the distance or the degree of separation. The next figure shows the basic concepts required to define a search problem in AI.

The next python code snippet implements BFS and returns the path between two nodes source and target in the given input graph.

def shortest_path(source, target):
    """
    Returns the shortest list of (movie_id, person_id) pairs
    that connect the source to the target.
    If no possible path, returns None.
    """
 
    explored = set([])
    frontier = 
    parents = {}
    while len(frontier) > 0:
        person = frontier.pop(0)
        if person == target:
            break
        explored.add(person)
        for (m, p) in neighbors_for_person(person):
            if not p in frontier and not p in explored:
                frontier.append(p)
                parents[p] = (m, person)
    if not target in parents:
        return None
    path = []
    person = target
    while person != source:
        m, p = parents[person]
        path.append((m, person))
        person = p
    path = path[::-1]
    return path

The following animation shows how BFS finds the minimum degree(s) of separation (distance) between the query source and target actor nodes (green colored). The shortest path found between the actor nodes (formed of the edges corresponding to the movies a pair of actors acted together in) is colored in red.

The following shows the distribution of the degrees of separation between the actors.

Tic-Tac-Toe

Using Minimax, implement an AI to play Tic-Tac-Toe optimally.

  • For this problem, the board is represented as a list of three lists (representing the three rows of the board), where each internal list contains three values that are either X, O, or EMPTY.
  • Implement a player function that should take a board state as input, and return which player’s turn it is (either X or O).
  • In the initial game state, X gets the first move. Subsequently, the player alternates with each additional move.
  • Any return value is acceptable if a terminal board is provided as input (i.e., the game is already over).
  • Implement an action function that should return a set of all of the possible actions that can be taken on a given board.
  • Each action should be represented as a tuple (i, j) where i corresponds to the row of the move (0, 1, or 2) and j corresponds to which cell in the row corresponds to the move (also 0, 1, or 2).
  • Possible moves are any cells on the board that do not already have an X or an O in them.
  • Any return value is acceptable if a terminal board is provided as input.
  • Implement a result function that takes a board and an action as input, and should return a new board state, without modifying the original board.
  • If action is not a valid action for the board, your program should raise an exception.
  • The returned board state should be the board that would result from taking the original input board, and letting the player whose turn it is make their move at the cell indicated by the input action.
  • Importantly, the original board should be left unmodified: since Minimax will ultimately require considering many different board states during its computation. This means that simply updating a cell in board itself is not a correct implementation of the result function. You’ll likely want to make a deep copy of the board first before making any changes.
  • Implement a winner function that should accept a board as input, and return the winner of the board if there is one.
  • If the X player has won the game, your function should return X. If the O player has won the game, your function should return O.
  • One can win the game with three of their moves in a row horizontally, vertically, or diagonally.
  • You may assume that there will be at most one winner (that is, no board will ever have both players with three-in-a-row, since that would be an invalid board state).
  • If there is no winner of the game (either because the game is in progress, or because it ended in a tie), the function should return None.
  • Implement a terminal function that should accept a board as input, and return a boolean value indicating whether the game is over.
  • If the game is over, either because someone has won the game or because all cells have been filled without anyone winning, the function should return True.
  • Otherwise, the function should return False if the game is still in progress.
  • Implement an utility function that should accept a terminal board as input and output the utility of the board.
  • If X has won the game, the utility is 1. If O has won the game, the utility is -1. If the game has ended in a tie, the utility is 0.
  • You may assume utility will only be called on a board if terminal(board) is True.
  • Implement a minimax function that should take a board as input, and return the optimal move for the player to move on that board.
  • The move returned should be the optimal action (i, j) that is one of the allowable actions on the board. If multiple moves are equally optimal, any of those moves is acceptable.
  • If the board is a terminal board, the minimax function should return None.
  • For all functions that accept a board as input, you may assume that it is a valid board (namely, that it is a list that contains three rows, each with three values of either X, O, or EMPTY).
  • Since Tic-Tac-Toe is a tie given optimal play by both sides, you should never be able to beat the AI (though if you don’t play optimally as well, it may beat you!)
  • The following figure demonstrates the basic concepts of adversarial search and Game and defines the different functions for tic-tac-toe.
  • Here we shall use Minimax with alpha-beta pruning to speedup the game when it is computer’s turn.

The following python code fragment shows how to implement the minimax algorithm with alpha-beta pruning:

def max_value_alpha_beta(board, alpha, beta):
    if terminal(board):
       return utility(board), None
    v, a = -np.inf, None
    for action in actions(board):
        m, _ = min_value_alpha_beta(result(board, action), 
                                    alpha, beta)
        if m > v:
           v, a = m, action   
        alpha = max(alpha, v)
        if alpha >= beta:
           break     
    return (v, a)       
 
def min_value_alpha_beta(board, alpha, beta):
    if terminal(board):
       return utility(board), None
    v, a = np.inf, None
    for action in actions(board):
        m, _ = max_value_alpha_beta(result(board, action), 
                                    alpha, beta)
        if m < v:
           v, a = m, action 
        beta = min(beta, v)
        if alpha >= beta:
           break
    return (v, a)       
 
def minimax(board):
    """
    Returns the optimal action for the current player on the board.
    """
    if terminal(board):
       return None
    cur_player = player(board)
    if cur_player == X:
       _, a = max_value_alpha_beta(board, -np.inf, np.inf)  
    elif cur_player == O:
       _, a = min_value_alpha_beta(board, -np.inf, np.inf)
    return a

The following sequence of actions by the human player (plays as X, the Max player) and the computer (plays as O, the Min player) shows what the computer thinks to come up with optimal position of O and the game tree it produces using Minimax with alpha-beta pruning algorithm shown in the above figure.

The following animations show how the game tree is created when the computer thinks for the above turn. The row above the game board denotes the value of the utility function and it’s color-coded: when the value of the corresponding state is +1 (in favor of the Max player) it’s colored as green, when the value is -1 (in favor of the Min player) it’s colored as red, otherwise it’s colored gray if the value is 0.

Notice the game tree is full-grown to the terminal leaves. The following figure shows a path (colored in blue) in the game tree that leads to a decision taken by the computer to choose the position of O.

Now let’s consider the next sequence of actions by the human player and the computer. As can be seen, the computer again finds the optimal position.

The following animation shows how the game tree is created when the computer thinks for the above turn.

The following figure shows a path (colored in blue) in the game tree that leads to a decision taken by the computer to choose the position of O (note that the optimal path corresponds to a tie in a terminal node).

Finally, let’s consider the next sequence of actions by the human player and the computer. As can be seen, the computer again finds the optimal position.

The following animations show how the game tree is created when the computer thinks for the above turn.

The next figure shows the path (colored in blue) in the game tree that leads to a decision taken by the computer to choose the position of O (note that the optimal path corresponds to a tie in a terminal node).

Now let’s consider another example game. The following sequence of actions by the computer player (AI agent now plays as X, the Max player) and the human player (plays as O, the Min player) shows what the computer thinks to come up with optimal position of O and the game tree it produces using Minimax with alpha-beta pruning algorithm.

The following animations show how the game tree is created when the computer thinks for the above turn.

The next figure shows the path (colored in blue) in the game tree that leads to a decision taken by the computer to choose the next position of X (note that the optimal path corresponds to a win in a terminal node).

Note from the above tree that no matter what action the human player chooses next, AI agent (the computer) can always win the game. An example final outcome of the game be seen from the next game tree.

Knights

Write a program to solve logic puzzles.

In 1978, logician Raymond Smullyan published “What is the name of this book?”, a book of logical puzzles. Among the puzzles in the book were a class of puzzles that Smullyan called “Knights and Knaves” puzzles.

  • In a Knights and Knaves puzzle, the following information is given: Each character is either a knight or a knave.
  • A knight will always tell the truth: if knight states a sentence, then that sentence is true.
  • Conversely, a knave will always lie: if a knave states a sentence, then that sentence is false.
  • The objective of the puzzle is, given a set of sentences spoken by each of the characters, determine, for each character, whether that character is a knight or a knave.

The task in this problem is to determine how to represent these puzzles using propositional logic, such that an AI running a model-checking algorithm could solve these puzzles.

Puzzles

  1. Contains a single character: A. – A says “I am both a knight and a knave.”
  2. Has two characters: A and B. – A says “We are both knaves.” – B says nothing.
  3. Has two characters: A and B. – A says “We are the same kind.” – B says “We are of different kinds.”
  4. Has three characters: A, B, and C. – A says either “I am a knight.” or “I am a knave.”, but you don’t know which. – B says “A said ‘I am a knave.’” – B then says “C is a knave.” – C says “A is a knight.”
  • In this problem, we shall use model checking to find the solutions to the above puzzles. The following figure shows the theory that are relevant.
  • The following python code snippet shows the base class corresponding to a propositional logic sentence (taken from the code provided for the assignment).
import itertoolsclass Sentence():    
    def evaluate(self, model):
        """Evaluates the logical sentence."""
        raise Exception("nothing to evaluate")    def formula(self):
        """Returns string formula representing logical sentence."""
        return ""    def symbols(self):
        """Returns a set of all symbols in the logical sentence."""
        return set()    @classmethod
    def validate(cls, sentence):
        if not isinstance(sentence, Sentence):
            raise TypeError("must be a logical sentence")
  • Now, a propositional symbol and an operator (unary / binary) can inherit this class and implement the methods specific to the symbol as shown for the boolean operator AND (the code taken from the resources provided for the assignment).
class And(Sentence):
    def __init__(self, *conjuncts):
        for conjunct in conjuncts:
            Sentence.validate(conjunct)
        self.conjuncts = list(conjuncts)    
   
    def __eq__(self, other):
        return isinstance(other, And) and \
               self.conjuncts == other.conjuncts    
               
    def __hash__(self):
        return hash(
            ("and", tuple(hash(conjunct) for conjunct in 
                         self.conjuncts))
        )    
        
    def __repr__(self):
        conjunctions = ", ".join(
            [str(conjunct) for conjunct in self.conjuncts]
        )
        return f"And({conjunctions})"   
         
    def add(self, conjunct):
        Sentence.validate(conjunct)
        self.conjuncts.append(conjunct)    
        
    def evaluate(self, model):
        return all(conjunct.evaluate(model) for conjunct in 
                   self.conjuncts)    def formula(self):
        if len(self.conjuncts) == 1:
            return self.conjuncts[0].formula()
        return " ∧ ".join([Sentence.parenthesize(conjunct.formula())
                           for conjunct in self.conjuncts])    
    
    def symbols(self):
        return set.union(*[conjunct.symbols() \
               for conjunct in self.conjuncts])
  • Finally, we need to represent the KB for every puzzle, as the one shown in the code below:
AKnight = Symbol("A is a Knight")
AKnave = Symbol("A is a Knave")BKnight = Symbol("B is a Knight")
BKnave = Symbol("B is a Knave")CKnight = Symbol("C is a Knight")
CKnave = Symbol("C is a Knave")# Puzzle 1
# A says "I am both a knight and a knave."
knowledge0 = And(
    # TODO
    Or(And(AKnight, Not(AKnave)), And(Not(AKnight), AKnave)),
    Implication(AKnight, And(AKnight, AKnave)),
    Implication(AKnave, Not(And(AKnight, AKnave)))
)

The following section demonstrates the outputs obtained with model checking implemented using the above code (by iteratively assigning all possible models / truth values and checking if the KB corresponding to a puzzle evaluates to true for a model, and then outputting all the models for which the KB evaluates to true). To start solving a puzzle we must define the KB for the puzzle as shown above.

  • The following figure shows the KB (presented in terms of a propositional logic sentence and) shown as an expression tree for the Puzzle 1.
  • The following animation shows how model-checking solves Puzzle 1 (the solution is marked in green).
  • The following figure shows the KB in terms of an expression tree for the Puzzle 2
  • The following animation shows how model-checking solves Puzzle 2 (the solution is marked in green).
  • The following figure shows the KB in terms of an expression tree for the Puzzle 3
  • The following animation shows how model-checking solves Puzzle 3 (the solution is marked in green).
  • The following figure shows the KB in terms of an expression tree for the Puzzle 4
  • The following animation shows how model-checking solves Puzzle 4 (the solution is marked in green).

Minesweeper

Write an AI agent to play Minesweeper.

  • Minesweeper is a puzzle game that consists of a grid of cells, where some of the cells contain hidden “mines.”
  • Clicking on a cell that contains a mine detonates the mine, and causes the user to lose the game.
  • Clicking on a “safe” cell (i.e., a cell that does not contain a mine) reveals a number that indicates how many neighboring cells — where a neighbor is a cell that is one square to the left, right, up, down, or diagonal from the given cell — contain a mine.
  • The following figure shows an example Minesweeper game. In this 3 X 3 Minesweeper game, for example, the three 1 values indicate that each of those cells has one neighboring cell that is a mine. The four 0 values indicate that each of those cells has no neighboring mine.
  • Given this information, a logical player could conclude that there must be a mine in the lower-right cell and that there is no mine in the upper-left cell, for only in that case would the numerical labels on each of the other cells be accurate.
  • The goal of the game is to flag (i.e., identify) each of the mines.
  • In many implementations of the game, the player can flag a mine by right-clicking on a cell (or two-finger clicking, depending on the computer).

Propositional Logic

  • Your goal in this project will be to build an AI that can play Minesweeper. Recall that knowledge-based agents make decisions by considering their knowledge base, and making inferences based on that knowledge.
  • One way we could represent an AI’s knowledge about a Minesweeper game is by making each cell a propositional variable that is true if the cell contains a mine, and false otherwise.
  • What information does the AI agent have access to? Well, the AI would know every time a safe cell is clicked on and would get to see the number for that cell.
  • Consider the following Minesweeper board, where the middle cell has been revealed, and the other cells have been labeled with an identifying letter for the sake of discussion.
  • What information do we have now? It appears we now know that one of the eight neighboring cells is a mine. Therefore, we could write a logical expression like the following one to indicate that one of the neighboring cells is a mine: Or(A, B, C, D, E, F, G, H)
  • But we actually know more than what this expression says. The above logical sentence expresses the idea that at least one of those eight variables is true. But we can make a stronger statement than that: we know that exactly one of the eight variables is true. This gives us a propositional logic sentence like the below.
Or(
And(A, Not(B), Not(C), Not(D), Not(E), Not(F), Not(G), Not(H)),
And(Not(A), B, Not(C), Not(D), Not(E), Not(F), Not(G), Not(H)),
And(Not(A), Not(B), C, Not(D), Not(E), Not(F), Not(G), Not(H)),
And(Not(A), Not(B), Not(C), D, Not(E), Not(F), Not(G), Not(H)),
And(Not(A), Not(B), Not(C), Not(D), E, Not(F), Not(G), Not(H)),
And(Not(A), Not(B), Not(C), Not(D), Not(E), F, Not(G), Not(H)),
And(Not(A), Not(B), Not(C), Not(D), Not(E), Not(F), G, Not(H)),
And(Not(A), Not(B), Not(C), Not(D), Not(E), Not(F), Not(G), H)
)
  • That’s quite a complicated expression! And that’s just to express what it means for a cell to have a 1 in it. If a cell has a 2 or 3 or some other value, the expression could be even longer.
  • Trying to perform model checking on this type of problem, too, would quickly become intractable: on an 8 X 8 grid, the size Microsoft uses for its Beginner level, we’d have 64 variables, and therefore 2⁶⁴ possible models to check — far too many for a computer to compute in any reasonable amount of time. We need a better representation of knowledge for this problem.

Knowledge Representation

  • Instead, we’ll represent each sentence of our AI’s knowledge like the below. {A, B, C, D, E, F, G, H} = 1
  • Every logical sentence in this representation has two parts: a set of cells on the board that are involved in the sentence, and a number count , representing the count of how many of those cells are mines. The above logical sentence says that out of cells A, B, C, D, E, F, G, and H, exactly 1 of them is a mine.
  • Why is this a useful representation? In part, it lends itself well to certain types of inference. Consider the game below:
  • Using the knowledge from the lower-left number, we could construct the sentence {D, E, G} = 0 to mean that out of cells D, E, and G, exactly 0 of them are mines.
  • Intuitively, we can infer from that sentence that all of the cells must be safe. By extension, any time we have a sentence whose count is 0 , we know that all of that sentence’s cells must be safe.
  • Similarly, consider the game below.
  • Our AI agent would construct the sentence {E, F, H} = 3 . Intuitively, we can infer that all of E, F, and H are mines.
  • More generally, any time the number of cells is equal to the count , we know that all of that sentence’s cells must be mines.
  • In general, we’ll only want our sentences to be about cells that are not yet known to be either safe or mines. This means that, once we know whether a cell is a mine or not, we can update our sentences to simplify them and potentially draw new conclusions.
  • For example, if our AI agent knew the sentence {A, B, C} = 2 , we don’t yet have enough information to conclude anything. But if we were told that C were safe, we could remove C from the sentence altogether, leaving us with the sentence {A, B} = 2 (which, incidentally, does let us draw some new conclusions.)
  • Likewise, if our AI agent knew the sentence {A, B, C} = 2 , and we were told that C is a mine, we could remove C from the sentence and decrease the value of count (since C was a mine that contributed to that count), giving us the sentence {A, B} = 1 . This is logical: if two out of A, B, and C are mines, and we know that C is a mine, then it must be the case that out of A and B, exactly one of them is a mine.
  • If we’re being even more clever, there’s one final type of inference we can do
  • Consider just the two sentences our AI agent would know based on the top middle cell and the bottom middle cell. From the top middle cell, we have {A, B, C} = 1 . From the bottom middle cell, we have {A, B, C, D, E} = 2 . Logically, we could then infer a new piece of knowledge, that {D, E} = 1 . After all, if two of A, B, C, D, and E are mines, and only one of A, B, and C are mines, then it stands to reason that exactly one of D and E must be the other mine.
  • More generally, any time we have two sentences set1 = count1 and set2 = count2 where set1 is a subset of set2 , then we can construct the new sentence set2 — set1 = count2 — count1 . Consider the example above to ensure you understand why that’s true.
  • So using this method of representing knowledge, we can write an AI agent that can gather knowledge about the Minesweeper board, and hopefully select cells it knows to be safe!
  • The following figure shows examples of inference rules and resolution by inference relevant for this problem:

A few implementation tips

  • Represent a logical sentence by a python class Sentence again.
  • Each sentence has a set of cells within it and a count of how many of those cells are mines.
  • Let the class also contain functions known_mines and known_safes for determining if any of the cells in the sentence are known to be mines or known to be safe.
  • It should also have member functions mark_mine() and mark_safe() to update a sentence in response to new information about a cell.
  • Each cell is a pair (i, j) where i is the row number (ranging from 0 to height — 1 ) and j is the column number (ranging from 0 to width — 1 ).
  • Implement a known_mines() method that should return a set of all of the cells in self.cells that are known to be mines.
  • Implement a known_safes() method that should return a set of all the cells in self.cells that are known to be safe.
  • Implement a mark_mine() method that should first check if cell is one of the cells included in the sentence. If cell is in the sentence, the function should update the sentence so that cell is no longer in the sentence, but still represents a logically correct sentence given that cell is known to be a mine.
  • Likewise mark_safe() method should first check if cell is one of the cells included in the sentence. If yes, then it should update the sentence so that cell is no longer in the sentence, but still represents a logically correct sentence given that cell is known to be safe.
  • Implement a method add_knowledge() that should accept a cell and
  • The function should mark the cell as one of the moves made in the game.
  • The function should mark the cell as a safe cell, updating any sentences that contain the cell as well.
  • The function should add a new sentence to the AI’s knowledge base, based on the value of cell and count , to indicate that count of the cell ’s neighbors are mines. Be sure to only include cells whose state is still undetermined in the sentence.
  • If, based on any of the sentences in self.knowledge , new cells can be marked as safe or as mines, then the function should do so.
  • If, based on any of the sentences in self.knowledge , new sentences can be inferred , then those sentences should be added to the knowledge base as well.
  • Implement a method make_safe_move() that should return a move (i, j) that is known to be safe.
  • The move returned must be known to be safe, and not a move already made.
  • If no safe move can be guaranteed, the function should return None .
  • Implement a method make_random_move() that should return a random move (i, j) .
  • This function will be called if a safe move is not possible: if the AI agent doesn’t know where to move, it will choose to move randomly instead.
  • The move must not be a move that has already been made.
  • he move must not be a move that is known to be a mine.
  • If no such moves are possible, the function should return None .

The following python code snippet shows implementation of few of the above functions:

class MinesweeperAI():
    """
    Minesweeper game player
    """    
    def mark_mine(self, cell):
        """
        Marks a cell as a mine, and updates all knowledge
        to mark that cell as a mine as well.
        """
        self.mines.add(cell)
        for sentence in self.knowledge:
            sentence.mark_mine(cell)    # ensure that no duplicate  
                                        # sentences are added			
    def knowledge_contains(self, sentence):
        for s in self.knowledge:
            if s == sentence:
               return True
        return False    def add_knowledge(self, cell, count):
        """
        Called when the Minesweeper board tells us, for a given
        safe cell, how many neighboring cells have mines in them.        
        This function should:
            1) mark the cell as a move that has been made
            2) mark the cell as safe
            3) add a new sentence to the AI's knowledge base
               based on the value of `cell` and `count`
            4) mark any additional cells as safe or as mines
               if it can be concluded based on the AI's KB
            5) add any new sentences to the AI's knowledge base
               if they can be inferred from existing knowledge
        """
        # mark the cell as a move that has been made
        self.moves_made.add(cell)  
        # mark the cell as safe
        self.mark_safe(cell)
        # add a new sentence to the AI's knowledge base, 
        # based on the value of `cell` and `count
        i, j = cell
        cells = []
        for row in range(max(i-1,0), min(i+2,self.height)):
            for col in range(max(j-1,0), min(j+2,self.width)):
                # if some mines in the neighbors are already known, 
                # make sure to decrement the count
                if (row, col) in self.mines:
                   count -= 1 
                if (not (row, col) in self.safes) and \
                   (not (row, col) in self.mines):
                   cells.append((row, col))
        sentence = Sentence(cells, count)
        # add few more inference rules here
        
    def make_safe_move(self):
        """
        Returns a safe cell to choose on the Minesweeper board.
        The move must be known to be safe, and not already a move
        that has been made.
        This function may use the knowledge in self.mines,  
        self.safes and self.moves_made, but should not modify any of 
        those values.
        """
        safe_moves = self.safes - self.moves_made
        if len(safe_moves) > 0:
           return safe_moves.pop()
        else:
           return None

The following animations show how the Minesweeper AI agent updates safe / mine cells and updates its knowledge-base, for a few example games (the light green and red cells represent known safe and mine cells, respectively, in the 2nd subplot):

PageRank

Write an AI agent to rank web pages by importance.

  • When search engines like Google display search results, they do so by placing more “important” and higher-quality pages higher in the search results than less important pages. But how does the search engine know which pages are more important than other pages?
  • One heuristic might be that an “important” page is one that many other pages link to, since it’s reasonable to imagine that more sites will link to a higher-quality webpage than a lower-quality webpage.
  • We could therefore imagine a system where each page is given a rank according to the number of incoming links it has from other pages, and higher ranks would signal higher importance.
  • But this definition isn’t perfect: if someone wants to make their page seem more important, then under this system, they could simply create many other pages that link to their desired page to arti􀃥cially in􀃦ate its rank.
  • For that reason, the PageRank algorithm was created by Google’s co-founders (including Larry Page, for whom the algorithm was named).
  • In PageRank’s algorithm, a website is more important if it is linked to by other important websites, and links from less important websites have their links weighted less.
  • This definition seems a bit circular, but it turns out that there are multiple strategies for calculating these rankings.

Random Surfer Model

One way to think about PageRank is with the random surfer model, which considers the behavior of a hypothetical surfer on the internet who clicks on links at random.

  • The random surfer model imagines a surfer who starts with a web page at random, and then randomly chooses links to follow.
  • A page’s PageRank, then, can be described as the probability that a random surfer is on that page at any given time. After all, if there are more links to a particular page, then it’s more likely that a random surfer will end up on that page.
  • Moreover, a link from a more important site is more likely to be clicked on than a link from a less important site that fewer pages link to, so this model handles weighting links by their importance as well.
  • One way to interpret this model is as a Markov Chain, where each page represents a state, and each page has a transition model that chooses among its links at random. At each time step, the state switches to one of the pages linked to by the current state.
  • By sampling states randomly from the Markov Chain, we can get an estimate for each page’s PageRank. We can start by choosing a page at random, then keep following links at random, keeping track of how many times we’ve visited each page.
  • After we’ve gathered all of our samples (based on a number we choose in advance), the proportion of the time we were on each page might be an estimate for that page’s rank.
  • To ensure we can always get to somewhere else in the corpus of web pages, we’ll introduce to our model a damping factor d .
  • With probability d (where d is usually set around 0.85 ), the random surfer will choose from one of the links on the current page at random. But otherwise (with probability 1 — d ), the random surfer chooses one out of all of the pages in the corpus at random (including the one they are currently on).
  • Our random surfer now starts by choosing a page at random, and then, for each additional sample we’d like to generate, chooses a link from the current page at random with probability d , and chooses any page at random with probability 1 — d .
  • If we keep track of how many times each page has shown up as a sample, we can treat the proportion of states that were on a given page as its PageRank.

Iterative Algorithm

  • We can also define a page’s PageRank using a recursive mathematical expression. Let PR(p) be the PageRank of a given page p : the probability that a random surfer ends up on that page.
  • How do we define PR(p)? Well, we know there are two ways that a random surfer could end up on the page: 1. With probability 1 — d , the surfer chose a page at random and ended up on page p . 2. With probability d , the surfer followed a link from a page i to page p .
  • The first condition is fairly straightforward to express mathematically: it’s 1 — d divided by N , where N is the total number of pages across the entire corpus. This is because the 1 — d probability of choosing a page at random is split evenly among all N possible pages.
  • For the second condition, we need to consider each possible page i that links to page p . For each of those incoming pages, let NumLinks(i) be the number of links on page i .
  • Each page i that links to p has its own PageRank, PR(i) , representing the probability that we are on page i at any given time.
  • And since from page i we travel to any of that page’s links with equal probability, we divide PR(i) by the number of links NumLinks(i) to get the probability that we were on page i and chose the link to page p.
  • This gives us the following definition for the PageRank for a page p .
  • In this formula, d is the damping factor, N is the total number of pages in the corpus, i ranges over all pages that link to page p , and NumLinks(i) is the number of links present on page i .
  • How would we go about calculating PageRank values for each page, then? We can do so via iteration: start by assuming the PageRank of every page is 1 / N (i.e., equally likely to be on any page).
  • Then, use the above formula to calculate new PageRank values for each page, based on the previous PageRank values.
  • If we keep repeating this process, calculating a new set of PageRank values for each page based on the previous set of PageRank values, eventually the PageRank values will converge (i.e., not change by more than a small threshold with each iteration).
  • Now, let’s implement both such approaches for calculating PageRank — calculating both by sampling pages from a Markov Chain random surfer and by iteratively applying the PageRank formula.

The following python code snippet represents the implementation of PageRank computation with Monte-Carlo sampling.

def sample_pagerank(corpus, damping_factor, n):
    """
    Return PageRank values for each page by sampling `n` pages
    according to transition model, starting with a page at random.    
    Return a dictionary where keys are page names, and values are
    their estimated PageRank value (a value between 0 and 1). All
    PageRank values should sum to 1.
    """
    pages = [page for page in corpus]
    pageranks = {page:0 for page in pages}
    page = random.choice(pages)
    pageranks[page] = 1
    for i in range(n):
        probs = transition_model(corpus, page, damping_factor)
        probs = probs.values() #[probs[p] for p in pages]
        page = random.choices(pages, weights=probs, k=1)[0]
        pageranks[page] += 1
    return {page:pageranks[page]/n for page in pageranks}
  • The following animation shows how PageRank changes with sampling-based implementation.
  • Again, the following python code snippet represents the iterative PageRank implementation.
def iterate_pagerank(corpus, damping_factor):
    """
    Return PageRank values for each page by iteratively updating
    PageRank values until convergence.    Return a dictionary where 
    keys are page names, and values are their estimated PageRank value 
    (a value between 0 and 1). All PageRank values should sum to 1.
    """
    incoming = {p:set() for p in corpus}
    for p in corpus:
        for q in corpus[p]:
           incoming[q].add(p)
    pages = [page for page in corpus]
    n = len(pages)
    pageranks = {page:1/n for page in pages}
    diff = float('inf')
    while diff > 0.001:
        diff = 0
        for page in pages:
            p = (1-damping_factor)/n
            for q in incoming[page]:
               p += damping_factor * \ 
                (sum([pageranks[q]/len(corpus[q])]) \
                if len(corpus[q]) > 0 else 1/n)
            diff = max(diff, abs(pageranks[page]-p))
            pageranks[page] = p   
    return {p:pageranks[p]/sum(pageranks.values()) 
            for p in pageranks}

The following animation shows how the iterative PageRank formula is computed for a sample very small web graph:

Heredity

Write an AI agent to assess the likelihood that a person will have a particular genetic trait.

  • Mutated versions of the GJB2 gene are one of the leading causes of hearing impairment in newborns.
  • Each person carries two versions of the gene, so each person has the potential to possess either 0, 1, or 2 copies of the hearing impairment version GJB2.
  • Unless a person undergoes genetic testing, though, it’s not so easy to know how many copies of mutated GJB2 a person has.
  • This is some “hidden state”: information that has an effect that we can observe (hearing impairment), but that we don’t necessarily directly know.
  • After all, some people might have 1 or 2 copies of mutated GJB2 but not exhibit hearing impairment, while others might have no copies of mutated GJB2 yet still exhibit hearing impairment.
  • Every child inherits one copy of the GJB2 gene from each of their parents. If a parent has two copies of the mutated gene, then they will pass the mutated gene on to the child; if a parent has no copies of the mutated gene, then they will not pass the mutated gene on to the child; and if a parent has one copy of the mutated gene, then the gene is passed on to the child with probability 0.5.
  • After a gene is passed on, though, it has some probability of undergoing additional mutation: changing from a version of the gene that causes hearing impairment to a version that doesn’t, or vice versa.
  • We can attempt to model all of these relationships by forming a Bayesian Network of all the relevant variables, as in the one below, which considers a family of two parents and a single child.
  • Each person in the family has a Gene random variable representing how many copies of a particular gene (e.g., the hearing impairment version of GJB2) a person has: a value that is 0, 1, or 2.
  • Each person in the family also has a Trait random variable, which is yes or no depending on whether that person expresses a trait (e.g., hearing impairment) based on that gene.
  • There’s an arrow from each person’s Gene variable to their Trait variable to encode the idea that a person’s genes affect the probability that they have a particular trait.
  • Meanwhile, there’s also an arrow from both the mother and father’s Gene random variable to their child’s Gene random variable: the child’s genes are dependent on the genes of their parents.
  • The task in this problem is to use this model to make inferences about a population. Given information about people, who their parents are, and whether they have a particular observable trait (e.g. hearing loss) caused by a given gene, the AI agent will infer the probability distribution for each person’s genes, as well as the probability distribution for whether any person will exhibit the trait in question.
  • The next figure shows the concepts required for running inference on a Bayes Network
  • Recall that to compute a joint probability of multiple events, you can do so by multiplying those probabilities together. But remember that for any child, the probability of them having a certain number of genes is conditional on what genes their parents have.
  • The following figure shows the given data files
  • Implement a function joint_probability() that should take as input a dictionary of people, along with data about who has how many copies of each of the genes, and who exhibits the trait. The function should return the joint probability of all of those events taking place.
  • Implement a function update() that adds a new joint distribution probability to the existing probability distributions in probabilities .
  • Implement the function normalize() that updates a dictionary of probabilities such that each probability distribution is normalized.
  • PROBS is a dictionary containing a number of constants representing probabilities of various different events. All of these events have to do with how many copies of a particular gene a person has , and whether a person exhibits a particular trait based on that gene.
  • PROBS[“gene”] represents the unconditional probability distribution over the gene (i.e., the probability if we know nothing about that person’s parents).
  • PROBS[“trait”] represents the conditional probability that a person exhibits a trait.
  • PROBS[“mutation”] is the probability that a gene mutates from being the gene in question to not being that gene, and vice versa.

The following python code snippet shows how to compute the joint probabilities:

def joint_probability(people, one_gene, two_genes, have_trait):
    """
    Compute and return a joint probability.
    The probability returned should be the probability that
    * everyone in set `one_gene` has one copy of the gene, and
    * everyone in set `two_genes` has two copies of the gene, and
    * everyone not in `one_gene` / `two_gene` doesn't have the gene
    * everyone in set `have_trait` has the trait, and
    * everyone not in set` have_trait` does not have the trait.
    """
    prob_gene = PROBS['gene']
    prob_trait_gene = PROBS['trait']
    prob_mutation = PROBS['mutation']
    prob_gene_pgenes = { # CPT
        2: {
            0: {
                 0: (1-prob_mutation)*prob_mutation,
                 1: (1-prob_mutation)*(1-prob_mutation) + \
                                   prob_mutation*prob_mutation,
                 2: prob_mutation*(1-prob_mutation)
            },
            1: {
                 0: prob_mutation/2,
                 1: 0.5,
                 2: (1-prob_mutation)/2
            },
            2: {
                 0: prob_mutation*prob_mutation,
                 1: 2*(1-prob_mutation)*prob_mutation, 
                 2: (1-prob_mutation)*(1-prob_mutation)
            }
        },
	# compute the probabilities for the other cases...
	}
	
    all_probs = {}
    for name in people:
        has_trait = name in have_trait
        num_gene = 1 if name in one_gene else 2 if name in two_genes      
                                                              else 0
        prob = prob_trait_gene[num_gene][has_trait]* \
                                        prob_gene[num_gene] 
        if people[name]['mother'] != None:
           mother, father = people[name]['mother'], \
                            people[name]['father']
           num_gene_mother = 1 if mother in one_gene else \
                             2 if mother in two_genes else 0
           num_gene_father = 1 if father in one_gene else 2 
                               if father in two_genes else 0
           prob = prob_trait_gene[num_gene][has_trait] * \     
           prob_gene_pgenes[num_gene_mother][num_gene_father] \
                                            [num_gene]              
        all_probs[name] = prob
    probability = np.prod(list(all_probs.values()))
    return probability

The following animations show how the CPTs (conditional probability tables) are computed for the Bayesian Networks corresponding to the three families, respectively.

The following figure shows the conditional probabilities (computed with simulation) that children does / doesn’t have the gene given the parents does / doesn’t have the gene.

Crossword

Write an AI agent to generate crossword puzzles.

  • Given the structure of a crossword puzzle (i.e., which squares of the grid are meant to be filled in with a letter), and a list of words to use, the problem becomes one of choosing which words should go in each vertical or horizontal sequence of squares.
  • We can model this sort of problem as a constraint satisfaction problem.
  • Each sequence of squares is one variable, for which we need to decide on its value(which word in the domain of possible words will 􀃥ll in that sequence). Consider the following crossword puzzle structure.
  • In this structure, we have four variables, representing the four words we need to 􀃥ll into this crossword puzzle (each indicated by a number in the above image). Each variable is defined by four values: the row it begins on (its i value), the column it begins on (its j value), the direction of the word (either down or across), and the length of the word.
  • Variable 1, for example, would be a variable represented by a row of 1 (assuming 0 indexed counting from the top), a column of 1 (also assuming 0 indexed counting from the left), a direction of across, and a length of 4.
  • As with many constraint satisfaction problems, these variables have both unary and binary constraints. The unary constraint on a variable is given by its length.
  • For Variable 1, for instance, the value BYTE would satisfy the unary constraint, but the value BIT would not (it has the wrong number of letters).
  • Any values that don’t satisfy a variable’s unary constraints can therefore be removed from the variable’s domain immediately.
  • The binary constraints on a variable are given by its overlap with neighboring variables. Variable 1 has a single neighbor: Variable 2. Variable 2 has two neighbors: Variable 1 and Variable 3.
  • For each pair of neighboring variables, those variables share an overlap: a single square that is common to them both.
  • We can represent that overlap as the character index in each variable’s word that must be the same character.
  • For example, the overlap between Variable 1 and Variable 2 might be represented as the pair (1, 0), meaning that Variable 1’s character at index 1 necessarily must be the same as Variable 2’s character at index 0 (assuming 0-indexing, again).
  • The overlap between Variable 2 and Variable 3 would therefore be represented as the pair (3, 1): character 3 of Variable 2’s value must be the same as character 1 of Variable 3’s value.
  • For this problem, we’ll add the additional constraint that all words must be different: the same word should not be repeated multiple times in the puzzle.
  • The challenge ahead, then, is write a program to 􀃥find a satisfying assignment: a different word (from a given vocabulary list) for each variable such that all of the unary and binary constraints are met.
  • The following figure shows the theory required to solve the problem:
  • Implement a function enforce_node_consistency() that should update the domains such that each variable is node consistent.
  • Implement a function ac3() that should, using the AC3 algorithm, enforce arc consistency on the problem. Recall that arc consistency is achieved when all the values in each variable’s domain satisfy that variable’s binary constraints.
  • Recall that the AC3 algorithm maintains a queue of arcs to process. This function takes an optional argument called arcs, representing an initial list of arcs to process. If arcs is None, the function should start with an initial queue of all of the arcs in the problem. Otherwise, the algorithm should begin with an initial queue of only the arcs that are in the list arcs.
  • Implement a function consistent() that should check to see if a given assignment is consistent and a function assignment_complete() that should check if a given assignment is complete.
  • Implement a function select_unassigned_variable() that should return a single variable in the crossword puzzle that is not yet assigned by assignment, according to the minimum remaining value heuristic and then the degree heuristic. Also, it should return the variable with the fewest number of remaining values in its domain.
  • Implement a function backtrack() that should accept a partial assignment as input and, using backtracking search, return a complete satisfactory assignment of variables to values if it is possible to do so.

The following python code snippet shows couple of the above function implementations:

def select_unassigned_variable(self, assignment):
    """
    Return an unassigned variable not already part of `assignment`.
    Choose the variable with the minimum number of remaining values
    in its domain. If there is a tie, choose the variable with the 
    highest degree. If there is a tie, any of the tied variables 
    are acceptable return values.
    """
    unassigned_vars = list(self.crossword.variables - \
                               set(assignment.keys()))
    mrv, svar = float('inf'), None
    for var in unassigned_vars:
        if len(self.domains[var]) < mrv:
           mrv, svar =  len(self.domains[var]), var
        elif len(self.domains[var]) == mrv:
           if self.crossword.neighbors(var) > 
              self.crossword.neighbors(svar):
              svar = var 
    return svar    
    
def backtrack(self, assignment):
    """
    Using Backtracking Search, take as input a partial assignment 
    for the crossword and return a complete assignment if possible 
    to do so.
    `assignment` is a mapping from variables (keys) to words 
    (values).
    If no assignment is possible, return None.
    """
    # Check if assignment is complete
    if len(assignment) == len(self.crossword.variables):
    return assignment    # Try a new variable
    var = self.select_unassigned_variable(assignment)
    for value in self.domains[var]:
        new_assignment = assignment.copy()
        new_assignment[var] = value
        if self.consistent(new_assignment):
           result = self.backtrack(new_assignment)
           if result is not None:
              return result
    return None

The following animation shows how the backtracking algorithm with AC3 generates the crossword with the given structure.

######_
____##_
_##____
_##_##_
_##_##_
#___##_

Now let’s compute the speedup obtained with backtracking with and without AC3 (on average), i.e., compare the time to generate a crossword puzzle (with the above structure), averaged over 1000 runs (each time the puzzle generated may be different though). The next figure shows the result of the runtime experiment and that backtracking with AC3 is way faster on average over backtracking implementation without AC3.

Shopping

Write an AI agent to predict whether online shopping customers will complete a purchase.

  • When users are shopping online, not all will end up purchasing something. Most visitors to an online shopping website, in fact, likely don’t end up going through with a purchase during that web browsing session.
  • It might be useful, though, for a shopping website to be able to predict whether a user intends to make a purchase or not: perhaps displaying different content to the user, like showing the user a discount offer if the website believes the user isn’t planning to complete the purchase.
  • How could a website determine a user’s purchasing intent? That’s where machine learning will come in.
  • The task in this problem is to build a nearest-neighbor classifier to solve this problem. Given information about a user — how many pages they’ve visited, whether they’re shopping on a weekend, what web browser they’re using, etc. — the classifier will predict whether or not the user will make a purchase.
  • The classifier won’t be perfectly accurate — but it should be better than guessing randomly.
  • To train the classifier a dataset from a shopping website from about 12,000 users sessions is provided.
  • How do we measure the accuracy of a system like this? If we have a testing data set, we could run our classifier and compute what proportion of the time we correctly classify the user’s intent.
  • This would give us a single accuracy percentage. But that number might be a little misleading. Imagine, for example, if about 15% of all users end up going through with a purchase. A classifier that always predicted that the user would not go through with a purchase, then, we would measure as being 85% accurate: the only users it classifies incorrectly are the 15% of users who do go through with a purchase. And while 85% accuracy sounds pretty good, that doesn’t seem like a very useful classifier.
  • Instead, we’ll measure two values: sensitivity (also known as the “true positive rate”) and specificity (also known as the “true negative rate”).
  • Sensitivity refers to the proportion of positive examples that were correctly identified: in other words, the proportion of users who did go through with a purchase who were correctly identified.
  • Specificity refers to the proportion of negative examples that were correctly identified: in this case, the proportion of users who did not go through with a purchase who were correctly identified.
  • So our “always guess no” classifier from before would have perfect specificity (1.0) but no sensitivity (0.0). Our goal is to build a classifier that performs reasonably on both metrics.
  • First few rows from the csv dataset are shown in the following figure:
  • There are about 12,000 user sessions represented in this spreadsheet: represented as one row for each user session.
  • The first six columns measure the different types of pages users have visited in the session: the Administrative , Informational , and ProductRelated columns measure how many of those types of pages the user visited, and their corresponding _Duration columns measure how much time the user spent on any of those pages.
  • The BounceRates , ExitRates , and PageValues columns measure information from Google Analytics about the page the user visited. SpecialDay is a value that measures how closer the date of the user’s session is to a special day (like Valentine’s Day or Mother’s Day).
  • Month is an abbreviation of the month the user visited. OperatingSystems , Browser , Region, and TrafficType are all integers describing information about the user themself.
  • VisitorType will take on the value Returning_Visitor for returning visitors and New_Visitor for new visitors to the site.
  • Weekend is TRUE or FALSE depending on whether or not the user is visiting on a weekend.
  • Perhaps the most important column, though, is the last one: the Revenue column. This is the column that indicates whether the user ultimately made a purchase or not: TRUE if they did, FALSE if they didn’t. This is the column that we’d like to learn to predict (the “label”), based on the values for all of the other columns (the “evidence”).
  • Implement a function load_data() that should accept a CSV filename as its argument, open that file, and return a tuple (evidence, labels) . evidence should be a list of all of the evidence for each of the data points, and labels should be a list of all of the labels for each data point.
  • Implement a function train_model() function should accept a list of evidence and a list of labels, and return a scikit-learn nearest-neighbor classifier (a k-nearest-neighbor classifier where k = 1) 􀃥fitted on that training data.
  • Relevant concepts in supervised machine learning (classification) needed for this problem is described in the following figure.

Use the following python code snippet to divide the dataset into two parts: the first one to train the 1-NN model on and the second one being a held-out test dataset to evaluate the model using the performance metrics.

from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
TEST_SIZE = 0.4
# Load data from spreadsheet and split into train and test sets
evidence, labels = load_data('shopping.csv')
X_train, X_test, y_train, y_test = train_test_split(
   evidence, labels, test_size=TEST_SIZE
)
# Train model and make predictions
model = train_model(X_train, y_train)
predictions = model.predict(X_test)
sensitivity, specificity = evaluate(y_test, predictions)

def train_model(evidence, labels):
    """
    Given a list of evidence lists and a list of labels, return a
    fitted k-nearest neighbor model (k=1) trained on the data.
    """
    model = KNeighborsClassifier(n_neighbors=1)
    model.fit(evidence, labels)
    return model

With 60–40 validation with a sample run following result is obtained, by training the model on 60% training data and use the model to predict on the 40% held-out test dataset, and then comparing the test predictions with the ground truth:

Correct: 3881
Incorrect: 1051
True Positive Rate: 31.14%
True Negative Rate: 87.37%

Now, let’s repeat the experiment for 25 times (with different random seeds for train_test_split()) and let’s plot the sensitivity against specificity for every random partition, along with color-coded accuracy, to obtain a plot like the following one.

with the mean performance metrics reported as follows (to reduce the variability in the result):

Sensitivity: 0.3044799546457174 
Specificity: 0.8747533289505658 
Accuracy: 0.7867234387672343

As can be seen from the above result, the specificity if quite good, but sensitivity obtained is quite poor.

Let’s now plot the decision boundary obtained with 1-NN classifier for different 60–40 train-test validation datasets created, with the data projected across two principal components, the output is shown in the following animation (here the red points represent the cases where purchase happened and the blue points represent the cases where it did not).

Nim

Implement an AI agent that teaches itself to play Nim through reinforcement learning.

  • Recall that in the game Nim, we begin with some number of piles, each with some number of objects. Players take turns: on a player’s turn, the player removes any non-negative number of objects from any one non-empty pile. Whoever removes the last object loses.
  • There’s some simple strategy you might imagine for this game: if there’s only one pile and three objects left in it, and it’s your turn, your best bet is to remove two of those objects, leaving your opponent with the third and final object to remove.
  • But if there are more piles, the strategy gets considerably more complicated. In this problem, we’ll build an AI agent to learn the strategy for this game through reinforcement learning. By playing against itself repeatedly and learning from experience, eventually our AI agent will learn which actions to take and which actions to avoid.
  • In particular, we’ll use Q-learning for this project. Recall that in Q-learning, we try to learn a reward value (a number) for every (state,action) pair. An action that loses the game will have a reward of -1, an action that results in the other player losing the game will have a reward of 1, and an action that results in the game continuing has an immediate reward of 0, but will also have some future reward.
  • How will we represent the states and actions inside of a Python program? A “state” of the Nim game is just the current size of all of the piles. A state, for example, might be [1, 1, 3, 5] , representing the state with 1 object in pile 0, 1 object in pile 1, 3 objects in pile 2, and 5 objects in pile 3. An “action” in the Nim game will be a pair of integers (i, j) , representing the action of taking j objects from pile i . So the action (3, 5) represents the action “from pile 3, take away 5 objects.” Applying that action to the state [1, 1, 3, 5] would result in the new state [1, 1, 3, 0] (the same state, but with pile 3 now empty).
  • Recall that the key formula for Q-learning is below. Every time we are in a state s and take an action a , we can update the Q-value Q(s, a) according to: Q(s, a) <- Q(s, a) + alpha * (new value estimate — old value estimate)
  • In the above formula, alpha is the learning rate (how much we value new information compared to information we already have).
  • The new value estimate represents the sum of the reward received for the current action and the estimate of all the future rewards that the player will receive.
  • The old value estimate is just the existing value for Q(s, a) . By applying this formula every time our AI agent takes a new action, over time our AI agent will start to learn which actions are better in any state.
  • The concepts required for reinforcement learning and Q-learning is shown in the following figure.

Next python code snippet shows how to implement q-values update for Q-learning with the update() method from the NimAI class.

class NimAI():    
    def update(self, old_state, action, new_state, reward):
        """
        Update Q-learning model, given an old state, an action taken
        in that state, a new resulting state, and the reward 
        received from taking that action.
        """
        old = self.get_q_value(old_state, action)
        best_future = self.best_future_reward(new_state)
        self.update_q_value(old_state, action, old, reward, 
                                                    best_future)    
    def update_q_value(self, state, action, old_q, reward, 
                                                   future_rewards):
        """
        Update the Q-value for the state `state` and the action 
        `action` given the previous Q-value `old_q`, a current 
        reward `reward`, and an estimate of future rewards   
        `future_rewards`. 
        Use the formula:        Q(s, a) <- old value estimate
        + alpha * (new value estimate - old value estimate)        
        where `old value estimate` is the previous Q-value,
        `alpha` is the learning rate, and `new value estimate`
        is the sum of the current reward and estimated future 
        rewards.
        """
        self.q[(tuple(state), action)] = old_q + self.alpha * (
                              (reward + future_rewards) - old_q)

The AI agent plays 10000 training games to update the q-values, the following animation shows how the q-values are updated in last few games (corresponding to a few selected states and action pairs):

The following animation shows an example game played between the AI agent (the computer) and the human (the green, red and white boxes in each row of the first subplot represent the non-empty, to be removed and empty piles for the corresponding row, respectively, for the AI agent and the human player).

As can be seen from the above figure, one of actions with the highest q-value for the AI agent’s first turn is (3,6) , that’s why it removes 6 piles from the 3rd row. Similarly, in the second tun of the AI agent, one of the best actions is (2,5), that’s why it removes 5 piles from row 2 and likewise.

The following animation shows another example game AI agent plays with human:

Traffic

Write an AI agent to identify which traffic sign appears in a photograph.

  • As research continues in the development of self-driving cars, one of the key challenges is computer vision, allowing these cars to develop an understanding of their environment from digital images.
  • In particular, this involves the ability to recognize and distinguish road signs — stop signs, speed limit signs, yield signs, and more.
  • The following figure contains the theory required to perform the multi-class classification task using a deep neural network.
  • In this problem, we shall use TensorFlow (https://www.tensor􀃦ow.org/) to build a neural network to classify road signs based on an image of those signs.
  • To do so, we shall need a labeled dataset: a collection of images that have already been categorized by the road sign represented in them.
  • Several such data sets exist, but for this project, we’ll use the German Traffic Sign Recognition Benchmark (http://benchmark.ini.rub.de/?section=gtsrb&subsection=news) (GTSRB) dataset, which contains thousands of images of 43 different kinds of road signs.
  • The dataset contains 43 sub-directories, numbered 0 through 42. Each numbered sub-directory represents a different category (a different type of road sign). Within each traffic sign’s directory is a collection of images of that type of traffic sign.
  • Implement the load_data() function which should accept as an argument data_dir, representing the path to a directory where the data is stored, and return image arrays and labels for each image in the data set.
  • While loading the dataset, the images have to be resized to a fixed size. If the images are resized correctly, each of them will have shape as (30, 30, 3).
  • Also, implement the function get_model() which should return a compiled neural network model.
  • The following python code snippet shows how to implement the classification model with keras, along with the output produced when the model is run.
import numpy as np
import tensorflow as tfEPOCHS = 10
IMG_WIDTH = 30
IMG_HEIGHT = 30
NUM_CATEGORIES = 43
TEST_SIZE = 0.4# Get image arrays and labels for all image files
images, labels = load_data('gtsrb')# Split data into training and testing sets
labels = tf.keras.utils.to_categorical(labels)
x_train, x_test, y_train, y_test = train_test_split(
     np.array(images), np.array(labels), test_size=TEST_SIZE
)# Get a compiled neural network
model = get_model()# Fit model on training data
model.fit(x_train, y_train, epochs=EPOCHS)# Evaluate neural network performance
model.evaluate(x_test,  y_test, verbose=2)def get_model():
    """
    Returns a compiled convolutional neural network model. 
    Assume that the `input_shape` of the first layer is 
    `(IMG_WIDTH, IMG_HEIGHT, 3)`.
    The output layer should have `NUM_CATEGORIES` units, one for 
    each category.
    """
    # Create a convolutional neural network
    model = tf.keras.models.Sequential([       
       # Convolutional layer. Learn 32 filters using a 3x3 kernel
       tf.keras.layers.Conv2D(
         32, (3, 3), activation="relu", input_shape=(30, 30, 3)
       ),       
       # Max-pooling layer, using 2x2 pool size
       tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),       
       # Convolutional layer. Learn 32 filters using a 3x3 kernel
       tf.keras.layers.Conv2D(
         64, (3, 3), activation="relu"
       ),       
       # Max-pooling layer, using 2x2 pool size
       tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),
       tf.keras.layers.Conv2D(
         128, (3, 3), activation="relu"
       ),       
       # Max-pooling layer, using 2x2 pool size
       tf.keras.layers.MaxPooling2D(pool_size=(2, 2)),       
       # Flatten units
       tf.keras.layers.Flatten(),       
       # Add a hidden layer with dropout
       tf.keras.layers.Dense(256, activation="relu"),
       tf.keras.layers.Dropout(0.5),       
       # Add an output layer with output units for all 10 digits
       tf.keras.layers.Dense(NUM_CATEGORIES, activation="softmax")
    ])
    print(model.summary)    # Train neural network
    model.compile(
       optimizer="adam",
       loss="categorical_crossentropy",
       metrics=["accuracy"]
    )
    return model
    
# Model: "sequential"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #
# =================================================================
# conv2d (Conv2D)              (None, 28, 28, 32)        896
# _________________________________________________________________
# max_pooling2d (MaxPooling2D) (None, 14, 14, 32)        0
# _________________________________________________________________
# conv2d_1 (Conv2D)            (None, 12, 12, 64)        18496
# _________________________________________________________________
# max_pooling2d_1 (MaxPooling2 (None, 6, 6, 64)          0
# _________________________________________________________________
# conv2d_2 (Conv2D)            (None, 4, 4, 128)         73856
# _________________________________________________________________
# max_pooling2d_2 (MaxPooling2 (None, 2, 2, 128)         0
# _________________________________________________________________
# flatten (Flatten)            (None, 512)               0
# _________________________________________________________________
# dense (Dense)                (None, 256)               131328
# _________________________________________________________________
# dropout (Dropout)            (None, 256)               0
# _________________________________________________________________
# dense_1 (Dense)              (None, 43)                11051
# =================================================================
# Total params: 235,627
# Trainable params: 235,627
# Non-trainable params: 0
# _________________________________________________________________
  • The architecture of the deep network is shown in the next figure:
  • The following output shows how the cross-entropy loss on the training dataset decreases and the accuracy of classification increases over epochs on the training dataset.
Epoch 1/10
15864/15864 [==============================] - 17s 1ms/sample - loss: 2.3697 - acc: 0.4399
Epoch 2/10
15864/15864 [==============================] - 15s 972us/sample - loss: 0.6557 - acc: 0.8133
Epoch 3/10
15864/15864 [==============================] - 16s 1ms/sample - loss: 0.3390 - acc: 0.9038
Epoch 4/10
15864/15864 [==============================] - 16s 995us/sample - loss: 0.2526 - acc: 0.9285
Epoch 5/10
15864/15864 [==============================] - 16s 1ms/sample - loss: 0.1992 - acc: 0.9440
Epoch 6/10
15864/15864 [==============================] - 16s 985us/sample - loss: 0.1705 - acc: 0.9512
Epoch 7/10
15864/15864 [==============================] - 15s 956us/sample - loss: 0.1748 - acc: 0.9536
Epoch 8/10
15864/15864 [==============================] - 15s 957us/sample - loss: 0.1551 - acc: 0.9578
Epoch 9/10
15864/15864 [==============================] - 16s 988us/sample - loss: 0.1410 - acc: 0.9630
Epoch 10/10
15864/15864 [==============================] - 14s 878us/sample - loss: 0.1303 - acc: 0.9667
10577/10577 - 3s - loss: 0.1214 - acc: 0.9707
  • The following animation again shows how the cross-entropy loss on the training dataset decreases and the accuracy of classification increases over epochs on the training dataset.
  • The following figure shows a few sample images from the traffic dataset with different labels.
  • The next animations show the features learnt at various convolution and pooling layers.
  • The next figure shows the evaluation result on test dataset — the corresponding confusion matrix.

Parser

Write an AI agent to parse sentences and extract noun phrases.

  • A common task in natural language processing is parsing, the process of determining the structure of a sentence.
  • This is useful for a number of reasons: knowing the structure of a sentence can help a computer to better understand the meaning of the sentence, and it can also help the computer extract information out of a sentence.
  • In particular, it’s often useful to extract noun phrases out of a sentence to get an understanding for what the sentence is about.
  • In this problem, we’ll use the context-free grammar formalism to parse English sentences to determine their structure.
  • Recall that in a context-free grammar, we repeatedly apply rewriting rules to transform symbols into other symbols. The objective is to start with a non-terminal symbol S (representing a sentence) and repeatedly apply context-free grammar rules until we generate a complete sentence of terminal symbols (i.e., words).
  • The rule S -> N V , for example, means that the S symbol can be rewritten as N V (a noun followed by a verb). If we also have the rule N -> “Holmes” and the rule V -> “sat” , we can generate the complete sentence “Holmes sat.” .
  • Of course, noun phrases might not always be as simple as a single word like “Holmes” . We might have noun phrases like “my companion” or “a country walk” or “the day before Thursday” , which require more complex rules to account for.
  • To account for the phrase “my companion” , for example, we might imagine a rule like: NP -> N | Det N. In this rule, we say that an NP (a “noun phrase”) could be either just a noun ( N ) or a determiner ( Det ) followed by a noun, where determiners include words like “a” , “the” , and “my” . The vertical bar ( | ) just indicates that there are multiple possible ways to rewrite an NP , with each possible rewrite separated by a bar.
  • To incorporate this rule into how we parse a sentence ( S ), we’ll also need to modify our S -> N V rule to allow for noun phrases ( NP s) as the subject of our sentence. See how? And to account for more complex types of noun phrases, we may need to modify our grammar even further.
  • Given sentences that are needed to be parsed by the CFG (to be constructed) are shown below:
Holmes sat.
Holmes lit a pipe.
We arrived the day before Thursday.
Holmes sat in the red armchair and he chuckled.
My companion smiled an enigmatical smile. 
Holmes chuckled to himself.
She never said a word until we were at the door here.
Holmes sat down and lit his pipe.
I had a country walk on Thursday and came home in a dreadful mess.
I had a little moist red paint in the palm of my hand.

Implementation tips

  • The NONTERMINALS global variable should be replaced with a set of context-free grammar rules that, when combined with the rules in TERMINALS , allow the parsing of all sentences in the sentences/ directory.
  • Each rules must be on its own line. Each rule must include the -> characters to denote which symbol is being replaced, and may optionally include | symbols if there are multiple ways to rewrite a symbol.
  • Do not need to keep the existing rule S -> N V in your solution, but the first rule must begin with S -> since S (representing a sentence) is the starting symbol.
  • May add as many nonterminal symbols as needed.
  • Use the nonterminal symbol NP to represent a “noun phrase”, such as the subject of a sentence.
  • A few rules needed to parse the given sentences are shown below (we need to add a few more rules):
NONTERMINALS = """
S -> NP VP | S Conj VP | S Conj S | S P S
NP -> N | Det N | AP NP | Det AP NP | N PP | Det N PP
"""
  • Few rules for the TERMINALS from the given grammar are shown below:
TERMINALS = """
Adj -> "country" | "dreadful" | "enigmatical" | "little" | "moist" | "red"
Conj -> "and"
Det -> "a" | "an" | "his" | "my" | "the"
N -> "armchair" | "companion" | "day" | "door" | "hand" | "he" | "himself"
P -> "at" | "before" | "in" | "of" | "on" | "to" | "until"
V -> "smiled" | "tell" | "were"
"""
  • The next python code snippet implements the function np_chunk() that should accept a nltk tree representing the syntax of a sentence, and return a list of all of the noun phrase chunks in that sentence.
  • For this problem, a “noun phrase chunk” is defined as a noun phrase that doesn’t contain other noun phrases within it. Put more formally, a noun phrase chunk is a subtree of the original tree whose label is NP and that does not itself contain other noun phrases as subtrees.
  • Assume that the input will be a nltk.tree object whose label is S (that is to say, the input will be a tree representing a sentence).
  • The function should return a list of nltk.tree objects, where each element  has the label NP.
  • The documentation for nltk.tree will be helpful for identifying how to manipulate a nltk.tree object.
import nltkdef np_chunk(tree):
    """
    Return a list of all noun phrase chunks in the sentence tree.
    A noun phrase chunk is defined as any subtree of the sentence
    whose label is "NP" that does not itself contain any other
    noun phrases as subtrees.
    """
    chunks = []
    for s in tree.subtrees(lambda t: t.label() == 'NP'):
       if len(list(s.subtrees(lambda t: t.label() == 'NP' and 
                                        t != s))) == 0:
        chunks.append(s)
    return chunks

The following animation shows how the given sentences are parsed with the grammar rules constructed.

Questions

Write an AI agent to answer questions from a given corpus

  • Question Answering (QA) is a field within natural language processing focused on designing systems that can answer questions.
  • Among the more famous question answering systems is Watson , the IBM computer that competed (and won) on Jeopardy!.
  • A question answering system of Watson’s accuracy requires enormous complexity and vast amounts of data, but in this problem, we’ll design a very simple question answering system based on inverse document frequency.
  • Our question answering system will perform two tasks: document retrieval and passage retrieval.
  • Our system will have access to a corpus of text documents.
  • When presented with a query (a question in English asked by the user), document retrieval will first identify which document(s) are most relevant to the query.
  • Once the top documents are found, the top document(s) will be subdivided into passages (in this case, sentences) so that the most relevant passage to the question can be determined.
  • How do we find the most relevant documents and passages? To find the most relevant documents, we’ll use tf-idf to rank documents based both on term frequency for words in the query as well as inverse document frequency for words in the query. The required concepts are summarized in the following figure:
  • Once we’ve found the most relevant documents, there are many possible metrics for scoring passages, but we’ll use a combination of inverse document frequency and a query term density measure.
  • More sophisticated question answering systems might employ other strategies (analyzing the type of question word used, looking for synonyms of query words, lemmatizing to handle different forms of the same word, etc.), but that comes only after the baseline version is implemented.

Implementation tips

  • Implement a function compute_idfs() that should accept a dictionary of documents and return a new dictionary mapping words to their IDF (inverse document frequency) values.
  • Assume that documents will be a dictionary mapping names of documents to a list of words in that document.
  • The returned dictionary should map every word that appears in at least one of the documents to its inverse document frequency value.
  • Recall that the inverse document frequency of a word is defined by taking the natural logarithm of the number of documents divided by the number of documents in which the word appears.
  • Implement the top_files() function that should, given a query (a set of words), files (a dictionary mapping names of files to a list of their words), and idfs (a dictionary mapping words to their IDF values), return a list of the filenames the n top files that that match the query, ranked according to tf-idf.
  • The returned list of should be of length n and should be ordered with the best match first.
  • Files should be ranked according to the sum of tf-idf values for any word in the query that also appears in the file. Words in the query that do not appear in the file should not contribute to the file’s score.
  • Recall that tf-idf for a term is computed by multiplying the number of times the term appears in the document by the IDF value for that term.
  • Implement the function top_sentences() that should, given a query (a set of words), sentences (a dictionary mapping sentences to a list of their words), and idfs (a dictionary mapping words to their IDF values), return a list of the n top sentences that match the query, ranked according to IDF.
  • The returned list of sentences should be of length n and should be ordered with the best match first.
  • Sentences should be ranked according to “matching word measure”: namely, the sum of IDF values for any word in the query that also appears in the sentence. Note that term frequency should not be taken into account here, only inverse document frequency.
  • If two sentences have the same value according to the matching word measure, then sentences with a higher “query term density” should be preferred. Query term density is defined as the proportion of words in the sentence that are also words in the query.
  • The next python code snippet shows how the couple of functions mentioned above are implemented:
def compute_idfs(documents):
    """
    Given a dictionary of `documents` that maps names of documents 
    to a list of words, return a dictionary that maps words to their 
    IDF values.
    Any word that appears in at least one of the documents should 
    be in the resulting dictionary.
    """
    words = set()
    for document in documents:
        words.update(documents[document])
    idfs = dict()
    for word in words:
        f = sum(word in documents[document] for document in  
                                                documents)
        idfs[word] = math.log(len(documents) / f)
    return idfsdef top_files(query, files, idfs, n):
    """
    Given a `query` (a set of words), `files` (a dictionary mapping 
    names of files to a list of their words), and `idfs` (a 
    dictionary mapping words to their IDF values), return a list of 
    the filenames of the the `n` top files that match the query, 
    ranked according to tf-idf.
    """
    scores = []
    for filename in files:
        score = 0
        for word in query:
            tf = len(list(filter(lambda w: w == word, 
                                           files[filename])))
            score += tf * idfs.get(word, 0)
        scores.append((score, filename))
    scores.sort(key = lambda x: -x[0])
    return [item[1] for item in scores][:n]

The following animation shows the answers to the questions found from the given corpus: for each question, the best candidate text (document) likely containing the answer is first found inside the corpus and then the top 5 candidate answer sentences is found inside that document.

Artificial Intelligence is a vast area of research and it contains many different sub-areas those themselves are huge. In this blog, we discussed some problems from Search, Knowledge (Logic) and Uncertainty, Optimization, Learning, Neural Network and Language. We discussed how to implement AI agents for an adversarial search game (tic-tac-toe), a logic-based game (minesweeper) and a reinforcement learning-based game (Nim). We have also seen how to generate crossword puzzles by representing the problem as a CSP problem. Finally, we have seen application of AI in Computer Vision (traffic sign classification needed for self driving cars) and NLP (question-answering system with information retrieval).

In wix: https://sandipandey.wixsite.com/simplydatascience/post/solving-a-few-ai-problems-with-python

Gaussian Process Regression With Python

In this blog, we shall discuss on Gaussian Process Regression, the basic concepts, how it can be implemented with python from scratch and also using the GPy library. Then we shall demonstrate an application of GPR in Bayesian optimization with the GPyOpt library. The problems appeared in this coursera course on Bayesian methods for Machine Learning by UCSanDiego HSE and also in this Machine learning course provided at UBC.

Gaussian Process

A GP is a Gaussian distribution over functions, that takes two parameters, namely the mean (m) and the kernel function K (to ensure smoothness). In this article, we shall implement non-linear regression with GP.

Given training data points (X,y) we want to learn a (non-linear) function f:R^d -> R (here X is d-dimensional), s.t., y = f(x).

Then use the function f to predict the value of y for unseen data points Xtest, along with the confidence of prediction.

The following figure describes the basic concepts of a GP and how it can be used for regression. We need to use the conditional expectation and variance formula (given the data) to compute the posterior distribution for the GP.

Image for post

Here, we shall first discuss on Gaussian Process Regression. Let’s follow the steps below to get some intuition.

  • Generate 10 data points (these points will serve as training datapoints) with negligible noise (corresponds to noiseless GP regression). Use the following python function with default noise variance.

import numpy as np
def generate_noisy_points(n=10, noise_variance=1e-6):
    np.random.seed(777)
    X = np.random.uniform(-3., 3., (n, 1))
    y = np.sin(X) + np.random.randn(n, 1) * noise_variance**0.5
    return X, y
  • Plot the points with the following code snippet.

import matplotlib.pylab as plt
X, y = generate_noisy_points()
plt.plot(X, y, 'x')
plt.show()
Image for post
  • Now, let’s implement the algorithm for GP regression, the one shown in the above figure. First lets generate 100 test data points.
Xtest, ytest = generate_noisy_points(100)
Xtest.sort(axis=0)
  • Draw 10 function samples from the GP prior distribution using the following python code.
n = len(Xtest)
K = kernel(Xtest, Xtest)
L = np.linalg.cholesky(K + noise_var*np.eye(n))
f_prior = np.dot(L, np.random.normal(size=(n, n_samples)))
  • The following animation shows the sample functions drawn from the GP prior dritibution.
Image for post
  • Next, let’s compute the GP posterior distribution given the original (training) 10 data points, using the following python code snippet.
N, n = len(X), len(Xtest)
K = kernel(X, X)
L = np.linalg.cholesky(K + noise_var*np.eye(N))
K_ = kernel(Xtest, Xtest)
Lk = np.linalg.solve(L, kernel(X, Xtest))
mu = np.dot(Lk.T, np.linalg.solve(L, y))
L = np.linalg.cholesky(K_ + noise_var*np.eye(n) - np.dot(Lk.T, Lk))
f_post = mu.reshape(-1,1) + np.dot(L, np.random.normal(size=(n,n_samples)))
  • The following animation shows 10 function samples drawn from the GP posterior distribution. As expected, we get nearly zero uncertainty in the prediction of the points that are present in the training dataset and the variance increase as we move further from the points.
Image for post
  • The kernel function used here is RBF kernel, can be implemented with the following python code snippet.
def kernel(x, y, l2):
    sqdist = np.sum(x**2,1).reshape(-1,1) + \
             np.sum(y**2,1) - 2*np.dot(x, y.T)
    return np.exp(-.5 * (1/l2) * sqdist)
  • Now, let’s predict with the Gaussian Process Regression model, using the following python function:
def posterior(X, Xtest, l2=0.1, noise_var=1e-6):
    # compute the mean at our test points.
    N, n = len(X), len(Xtest)
    K = kernel(X, X, l2)
    L = np.linalg.cholesky(K + noise_var*np.eye(N))
    Lk = np.linalg.solve(L, kernel(X, Xtest, l2))
    mu = np.dot(Lk.T, np.linalg.solve(L, y))
    # compute the variance at our test points.
    K_ = kernel(Xtest, Xtest, l2)
    sd = np.sqrt(np.diag(K_) - np.sum(Lk**2, axis=0))
    return (mu, sd)
  • Use the above function to predict the mean and standard deviation at x=1.
mu, sd = posterior(X, np.array([[1]]))
print(mu, sd)
# [[0.85437633]] [0.26444361]
  • The following figure shows the predicted values along with the associated 3 s.d. confidence.
Image for post
  • As can be seen, the highest confidence (corresponds to zero confidence interval) is again at the training data points. The blue curve represents the original function, the red one being the predicted function with GP and the red “+” points are the training data points.

Noisy Gaussian Regression

  • Now let’s increase the noise variance to implement the noisy version of GP.
  • The following animation shows how the predictions and the confidence interval change as noise variance is increased: the predictions become less and less uncertain, as expected.
Image for post
  • Next, let’s see how varying the RBF kernel parameter l changes the confidence interval, in the following animation.
Image for post

Gaussian processes and Bayesian optimization

Now, let’s learn how to use GPy and GPyOpt libraries to deal with gaussian processes. These libraries provide quite simple and inuitive interfaces for training and inference, and we will try to get familiar with them in a few tasks. The following figure shows the basic concepts required for GP regression again.

Image for post

Gaussian processes Regression with GPy (documentation)

  • Again, let’s start with a simple regression problem, for which we will try to fit a Gaussian Process with RBF kernel.
  • Create RBF kernel with variance sigma_f and length-scale parameter l for 1D samples and compute value of the kernel between points, using the following code snippet.
import GPy
import GPyOpt
import seaborn as sns
sigma_f, l = 1.5, 2
kernel = GPy.kern.RBF(1, sigma_f, l)
sns.heatmap(kernel.K(X, X))
plt.show()
  • The following figure shows how the kernel heatmap looks like (we have 10 points in the training data, so the computed kernel is a 10X10 matrix.
Image for post
  • Let’s fit a GP on the training data points. Use kernel from previous task. As shown in the code below, use GPy.models.GPRegression class to predict mean and vairance at position 𝑥=1, e.g.
X, y = generate_noisy_points(noise_variance=0.01)
model = GPy.models.GPRegression(X,y,kernel) 
mean, variance = model.predict(np.array([[1]]))
print(mean, variance)
# 0.47161301004863576 1.1778512693257484
  • Let’s see the parameters of the model and plot the model
print(model)
model.plot()
plt.show()
# Name : GP regression
# Objective : 11.945995014694255
# Number of Parameters : 3
# Number of Optimization Parameters : 3
# Updates : True
# Parameters:
# <strong>GP_regression.         </strong>  |               value  |  constraints  |  
# priors
# <strong>rbf.variance           </strong>  |  0.5884024388364221  |      +ve      |        
# <strong>rbf.lengthscale        </strong>  |   1.565659066396689  |      +ve      |        
# <strong>Gaussian_noise.variance</strong>  |                 1.0  |      +ve      |
Image for post
  • Observe that the model didn’t fit the data quite well. Let’s try to fit kernel and noise parameters automatically.
model.optimize()
print(model)
# Name : GP regression
# Objective : 0.691713117288967
# Number of Parameters : 3
# Number of Optimization Parameters : 3
# Updates : True
# Parameters:
# <strong>GP_regression.         </strong>  |                  value  |  constraints  |  # priors
# <strong>rbf.variance           </strong>  |     0.5088590289246014  |      +ve      |        
# <strong>rbf.lengthscale        </strong>  |     1.1019439281553658  |      +ve      |        
# <strong>Gaussian_noise.variance</strong>  |  0.0030183424485056066  |      +ve      |
  • Now plot the model to obtain a figure like the following one.
Image for post
  • As can be seen from the above figure, the process generates outputs just right.

Differentiating noise and signal with GP

  • Generate two datasets: sinusoid wihout noise (with the function generate_points() and noise variance 0) and samples from gaussian noise (with the function generate_noise() define below).
def generate_noise(n=10, noise_variance=0.01):
    np.random.seed(777)
    X = np.random.uniform(-3., 3., (n, 1))
    y = np.random.randn(n, 1) * noise_variance**0.5
    return X, y
    
X, y = generate_noise(noise_variance=10)
  • Optimize kernel parameters compute the optimal values of noise component for the noise.
model = GPy.models.GPRegression(X,y,kernel) 
model.optimize()
print(model)
# Name : GP regression
# Objective : 26.895319516885678
# Number of Parameters : 3
# Number of Optimization Parameters : 3
# Updates : True
# Parameters:
# <strong>GP_regression.         </strong>  |              value  |  constraints  |  
# priors
# <strong>rbf.variance           </strong>  |  4.326712527380182  |      +ve      |        
# <strong>rbf.lengthscale        </strong>  |  0.613701590417825  |      +ve      |        
# <strong>Gaussian_noise.variance</strong>  |  9.006031601676087  |      +ve      |
  • Now optimize kernel parameters compute the optimal values of noise component for the signal without noise.
X, y = generate_noisy_points(noise_variance=0)
model = GPy.models.GPRegression(X,y,kernel) 
model.optimize()
print(model)
# Name : GP regression
# Objective : -22.60291145140328
# Number of Parameters : 3
# Number of Optimization Parameters : 3
# Updates : True
# Parameters:
#  <strong>GP_regression.       </strong>  |                  value  |  constraints |  #  priors
#  <strong>rbf.variance         </strong>  |      2.498516381000242  |      +ve     |        
#  <strong>rbf.lengthscale      </strong>  |     2.4529513517426444  |      +ve     |        
#  <strong>Gaussian_noise.variance</strong> |  5.634716625480888e-16  |      +ve    |
  • As can be seen from above, the GP detects the noise correctly with a high value of Gaussian_noise.variance output parameter.

Sparse GP

  • Now let’s consider the speed of GP. Let’s generate a dataset of 3000 points and measure the time that is consumed for prediction of mean and variance for each point.
  • Then let’s try to use inducing inputs and find the optimal number of points according to quality-time tradeoff.
  • For the sparse model with inducing points, you should use GPy.models.SparseGPRegression class.
  • The number of inducing inputs can be set with parameter num_inducing and optimize their positions and values with .optimize() call.
  • Let’s first create a dataset of 1000 points and fit GPRegression. Measure time for predicting mean and variance at position 𝑥=1.
import time
X, y = generate_noisy_points(1000)
start = time.time()
model = GPy.models.GPRegression(X,y,kernel)
mean, variance = model.predict(np.array([[1]]))
time_gp = time.time()-start
print(mean, variance, time_gp)
# [[0.84157973]] [[1.08298092e-06]] 7.353320360183716
  • Then fit SparseGPRegression with 10 inducing inputs and repeat the experiment. Let’s find speedup as a ratio between consumed time without and with inducing inputs.
start = time.time()
model = GPy.models.SparseGPRegression(X,y,kernel,num_inducing=10) 
model.optimize()
mean, variance = model.predict(np.array([[1]]))
time_sgp = time.time()-start
print(mean, variance, time_sgp)
# [[0.84159203]] [[1.11154212e-06]] 0.8615052700042725
  • As can be seen, there is a speedup of more than 8 with sparse GP using only the inducing points.
time_gp / time_sgp
# 8.535432824627119
  • The model is shown in the next figure.
Image for post

Bayesian optimization

  • Bayesian Optimization is used when there is no explicit objective function and it’s expensive to evaluate the objective function.
  • As shown in the next figure, a GP is used along with an acquisition (utility) function to choose the next point to sample, where it’s more likely to find the maximum value in an unknown objective function.
  • A GP is constructed from the points already sampled and the next point is sampled from the region where the GP posterior has higher mean (to exploit) and larger variance (to explore), which is determined by the maximum value of the acquisition function (which is a function of GP posterior mean and variance).
  • To choose the next point to be sampled, the above process is repeated.
  • The next couple of figures show the basic concepts of Bayesian optimization using GP, the algorithm, how it works, along with a few popular acquisition functions.
Image for post
Image for post

Bayesian Optimization with GPyOpt (documentation)

Paramerter Tuning for XGBoost Regressor

  • Let’s now try to find optimal hyperparameters to XGBoost model using Bayesian optimization with GP, with the diabetes dataset (from sklearn) as input. Let’s first load the dataset with the following python code snippet:
from sklearn import datasets
from xgboost import XGBRegressor
from sklearn.model_selection import cross_val_scoredataset = sklearn.datasets.load_diabetes()
X = dataset['data']
y = dataset['target']
  • We will use cross-validation score to estimate accuracy and our goal will be to tune: max_depth, learning_rate, n_estimators parameters. First, we have to define optimization function and domains, as shown in the code below.
# Optimizer will try to find minimum, so let's add a "-" sign.
def f(parameters):
    parameters = parameters[0]
    score = -cross_val_score(
        XGBRegressor(learning_rate=parameters[0],
                     max_depth=int(parameters[2]),
                     n_estimators=int(parameters[3]),
                     gamma=int(parameters[1]),
                     min_child_weight = parameters[4]), 
        X, y, scoring='neg_root_mean_squared_error'
    ).mean()
    score = np.array(score)
    return score
    
# Bounds (define continuous variables first, then discrete!)
bounds = [
    {'name': 'learning_rate',
     'type': 'continuous',
     'domain': (0, 1)},
    {'name': 'gamma',
     'type': 'continuous',
     'domain': (0, 5)},
    {'name': 'max_depth',
     'type': 'discrete',
     'domain': (1, 50)},
    {'name': 'n_estimators',
     'type': 'discrete',
     'domain': (1, 300)},
    {'name': 'min_child_weight',
     'type': 'discrete',
     'domain': (1, 10)}
]
  • Let’s find the baseline RMSE with default XGBoost parameters is . Let’s see if we can do better.
baseline = -cross_val_score(
    XGBRegressor(), X, y, scoring='neg_root_mean_squared_error'
).mean()
baseline
# 64.90693011829266
  • Now, run the Bayesian optimization with GPyOpt and plot convergence, as in the next code snippet:
optimizer = GPyOpt.methods.BayesianOptimization(
    f=f, domain=bounds,
    acquisition_type ='MPI',
    acquisition_par = 0.1,
    exact_eval=True
)
max_iter = 50
max_time = 60
optimizer.run_optimization(max_iter, max_time)
optimizer.plot_convergence()
Image for post
  • Extract the best values of the parameters and compute the RMSE / gain obtained with Bayesian Optimization, using the following code.
optimizer.X[np.argmin(optimizer.Y)]
# array([2.01515532e-01, 1.35401092e+00, 1.00000000e+00,
# 3.00000000e+02, 1.00000000e+00])
print('RMSE:', np.min(optimizer.Y),
'Gain:', baseline/np.min(optimizer.Y)*100)
# RMSE: 57.6844355488563 Gain: 112.52069904249859
  • As can be seen, we were able to get 12% boost without tuning parameters by hand.

Paramerter Tuning for SVR

  • Now, let’s tune a Support Vector Regressor model with Bayesian Optimization and find the optimal values for three parameters: C, epsilon and gamma.
  • Let’s use range (1e-5, 1000) for C, (1e-5, 10) for epsilon and gamma.
  • Let’s use MPI as an acquisition function with weight 0.1.
from sklearn.svm import SVR# Bounds (define continuous variables first, then discrete!)
bounds = [
    {'name': 'C',
     'type': 'continuous',
     'domain': (1e-5, 1000)},    {'name': 'epsilon',
     'type': 'continuous',
     'domain': (1e-5, 10)},    {'name': 'gamma',
     'type': 'continuous',
     'domain': (1e-5, 10)}
]

# Score. Optimizer will try to find minimum, so we will add a "-" sign.
def f(parameters):
    parameters = parameters[0]
    score = -cross_val_score(
        SVR(C = parameters[0],
            epsilon = parameters[1],
            gamma = parameters[2]), 
        X, y, scoring='neg_root_mean_squared_error'
    ).mean()
    score = np.array(score)
    return scoreoptimizer = GPyOpt.methods.BayesianOptimization(
    f=f, domain=bounds,
    acquisition_type ='MPI',
    acquisition_par = 0.1,
    exact_eval=True
)
max_iter = 50*4
max_time = 60*4
optimizer.run_optimization(max_iter, max_time)baseline = -cross_val_score(
    SVR(), X, y, scoring='neg_root_mean_squared_error'
).mean()
print(baseline)
# 70.44352670586173print(optimizer.X[np.argmin(optimizer.Y)])
# [126.64337652   8.49323372   8.59189135]print('RMSE:', np.min(optimizer.Y),
      'Gain:', baseline/np.min(optimizer.Y)*100)
# RMSE: 54.02576574389976 Gain: 130.38876124364006     best_epsilon = optimizer.X[np.argmin(optimizer.Y)][1] 
optimizer.plot_convergence()
Image for post
  • For the model above the boost in performance that was obtained after tuning hyperparameters was 30%.

In medium: https://medium.com/swlh/gaussian-process-regression-with-python-edd11bebdce8

Bayesian Machine Learning: MCMC, Latent Dirichlet Allocation and Probabilistic Programming with Python

In this blog we shall focus on sampling and approximate inference by Markov chain Monte Carlo (MCMC). This class of methods can be used to obtain samples from a probability distribution, e.g. a posterior distribution. The samples aproximately represent the distribution, and can be used to approximate expectations. The problems we shall work on appeared in an assignment here. Next, we shall learn how to use a library for probabilistic programming and inference called PyMC3. These problems appeared in an assignment in the coursera course Bayesian Methods for Machine Learning by UCSanDiego HSE. Some of the problems statements are taken from the course.

Image for post

The Metropolis-Hastings algorithm is useful for approximate computation of the posterior distribution, since the exact computation of posterior distribution is often infeasible, the partition function being computationally intractable.

The following figure describes how to sample from a distribution known upto a (normalizing) constant, using the Metropolis-Hastings algorithm.

Image for post
  • First let’s implement the above MCMC algorithm and use it to perform inference. In particular, we shall first approximate the joint density of a standard bivariate normal distribution: p(x, y)=N(x;0,1)N(y;0,1).
  • We use a proposal distribution Q and accept / reject the next drawn sample using the detailed balance equation.
  • With the irreducible assumption, the Markov chain eventually converges to the desired distribution, known upto a constant.
  • When used with the Gaussian proposal distribution, the Metropolis Hastings algorithm is more specifically called Random Walk Metropolis-Hastings algorithm.
  • We shall work with a Gaussian proposal distribution q(θ; θ(t)) here, whose mean is the previous sample in the Markov chain, and whose variance is ε² . That is, at iteration t of our Metropolis-Hastings algorithm, q(θ; θ(t−1)) = N (θ; θ(t−1), ε² ).

Let’s first define a function mh() implementing the Metropolis Hasting algorithm, using the Gaussian proposal distribution above. The function should take as arguments

  • p_star: a function on θ that is proportional to the density of interest p∗(θ)
  • param_init: the initial sample — a value for θ from where the Markov chain starts
  • num_samples: the number S of samples to generate
  • stepsize: a hyper-parameter specifying the variance of the Gaussian proposal distribution q

and return [θ(1) , . . . , θ(S) ] — a list of S samples from p(θ) ∝ p∗(θ).

import numpy as np
from scipy.stats import multivariate_normal

def mh(p_star, param_init, num_samples=5000, stepsize=1.0):
    
    theta = param_init
    samples = np.zeros((num_samples+1, param_init.shape[0]))
    samples[0] = theta
    for i in range(num_samples):
        theta_cand = multivariate_normal(mean=theta, \
                                  cov=[stepsize**2]*len(theta))\
                                  .rvs(size=1)
        a = min(1, p_star(theta_cand)/p_star(theta))
        u = np.random.uniform()
        if u < a:
            theta = theta_cand
        samples[i+1] = theta
    
    return samples

Let’s use the following function to plot the samples generated.

def plot_samples(samples,  
                 p_star=multivariate_normal(mean=np.zeros(2),   
                 cov=np.identity(2)).pdf):
    plt.figure(figsize=(10,5))
    X, Y = np.meshgrid(np.linspace(-3,3,100),   
                       np.linspace(-3,3,100))
    Z = p_star(np.stack([X, Y], axis=-1))
    plt.subplot(121), plt.grid()
    plt.contourf(X, Y, Z)
    plt.subplot(122), plt.grid()
    plt.scatter(samples[:,0], samples[:,1], alpha=0.25)
    plt.show()
  • Let’s test the function by sampling 5,000 samples from the dsitribution p(x, y)=N(x; 0,1)N(y; 0,1).
  • Let’s Initialize at x=0, y=0 and use ε=1.
  • Generate a scatter plot of the obtained samples, using the following code snippet, to obtain the following output.
samples = mh(p_star=multivariate_normal(mean=np.zeros(2)).pdf, \
                                        param_init=np.zeros(2))
plot_samples(samples)
Image for post

The following animation shows how the Metropolis-Hasting Algorithm works for stepsize=1. The left subplot shows the contours of the original bivariate Gaussian. On the right subplot, the green samples correspond to the ones accepted and red ones correspond to the ones rejected.

Image for post
Image by Author
  • Let’s sample another 5,000 points from p(x, y)=N(x; 0, 1)N(y; 0, 1) using the function mh() with ε=1, but this time initialize at x=7, y=7.
  • Let’s generate a scatter plot of the drawn samples and notice the first 20 samples.
  • As expected, the plot probably shows a “trail” of samples, starting at
    x=7, y=7 and slowly approaching the region of space where most of the probability mass is (i.e., at θ=(0,0)).
Image for post
Image by Author
  • The following plot shows how the Markov chains converge to stationary distribution (the value of shown along the vertical axis) for different initial values of the parameter θ=(x,y), for stepsize=0.1
Image for post
Image by Author
  • The following plot shows how the Markov chains converge to stationary distribution (the value of shown along the vertical axis) for different initial values of the parameter θ=(x,y), for stepsize=0.05
Image for post
Image by Author
  • In practice, we don’t know where the distribution we wish to sample from has high density, so we typically initialize the Markov Chain somewhat arbitrarily, or at the maximum a-posterior sample if available.
  • The samples obtained in the beginning of the chain are typically discarded, as they are not considered to be representative of the target distribution. This initial period between initialization and starting to collect samples is called “warm-up”, or also “burn-in”.
  • Now, let’s try to sample from a bivariate Gaussian distribution with covariance matrix having non-zero off-diagonal entries.
mh(p_star=lambda x: multivariate_normal(mean=np.zeros(2),  \
                    cov=np.array([[1,-1/2], [-1/2,1]])).pdf(x)/10, \
                    param_init=np.zeros(2))

The following animation shows how the algorithm works.

Image for post
  • Finally, let’s sample a bimodal distribution (mixture of 2 Gaussian) using the above algorithm using the following line of code.
def bimodal_gaussian_pdf(x): 
    return multivariate_normal(mean=np.zeros(2)).pdf(x)/2 + \
           multivariate_normal(mean=5*np.ones(2)).pdf(x)/2mh(p_star=lambda x:   
                                   bimodal_gaussian_pdf(x),param_init=2*np.ones(2))

The next animation shows how the samples are generated.

Image for post

Bayesian Logistic Regression: Approximating the Posterior Distribution for Logistic Regression with MCMC using Metropolis-Hastings

  • Let’s use the following dataset (first few records shown below) for training a logistic regression model, where the response variable is the binary variable income_more_50k and the predictor variables are age and educ, using which the model will try to predict the response.
import pandas as pd
data = pd.read_csv("adult_us_postprocessed.csv")
len(data)
#32561
data.head(20)
Image for post
Image for post
  • First of all let’s set up a Bayesian logistic regression model (i.e. define priors on the parameters 𝛼 and 𝛽 of the model) that predicts the value of income_more_50K based on person’s age and education:
Image for post
  • Here 𝑥1 is a person’s age, 𝑥2 is his/her level of education, y indicates his/her level of income, 𝛼, 𝛽1 and 𝛽2 are parameters of the model which can be represented as the parameter vector θ.
  • Let’s compute the posterior (upto a constant, i.e., P(θ|D) ∝ P(D|θ)P(θ)) as a product of the likelihood P(D|θ) and the prior P(θ) and use the Metropolis-Hastings algorithm to sample from the approximate (true) posterior, as shown in the following figure.
Image for post
  • In practice we compute log-likelihood, log-prior and log-p* to prevent numerical underflow.
  • The following python code snippet shows how to compute the log-p* and also slightly modified version of the Metropolis-Hastings algorithm in order to be able to work on the log domain.
eps = 1e-12

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def log(x):
    x = x.copy()
    x[x <= eps] = 0
    x[x > 0] = np.log(x[x > 0])  
    return x

def log_p_star(data, theta, s=100):
    y = np.array(data['income_more_50K'].tolist())
    p = sigmoid(np.dot(np.hstack((np.ones((len(data), 1)), 
                 data[['age', 'educ']].to_numpy())), theta))
    return np.sum(y*log(p)+(1-y)*log(1-p)) \
           - np.dot(theta.T, theta)/2/s**2 - 3*np.log(s)

def mh(log_p_star, param_init, num_samples=5000, stepsize=1.0):
    
    theta = param_init
    samples = np.zeros((num_samples+1, param_init.shape[0]))
    samples[0] = theta
    for i in range(num_samples):
        theta_cand = multivariate_normal(mean=theta, 
                     cov=[stepsize**2]*len(theta)).rvs(size=1)
        ll_cand = log_p_star(theta_cand)
        ll_old = log_p_star(theta)
        if ll_cand > ll_old:
            theta = theta_cand
        else:
            u = np.random.uniform()
            if u < np.exp(ll_cand - ll_old):
                theta = theta_cand
        samples[i+1] = theta
    
    return samplesmh(log_p_star=lambda x: log_p_star(data, x), \
                     param_init=np.random.rand(3), stepsize=0.1)

The following trace-plot animation shows how the posterior distribution is computed for each of the parameters 𝛼, 𝛽1 and 𝛽2, using 3 Markov chains corresponding to 3 different initialization of the parameters (with flat prior N(0,10000) on each of them).

Image for post
Image by Author

Bayesian Poisson Regression

  • Consider a Bayesian Poisson regression model, where outputs y_n are generated from a Poisson distribution of rate exp(α.x_n + β), where the x_n are the inputs (covariates), and α and β the parameters of the regression model for which we assume a Gaussian prior:
Image for post
  • We are interested in computing the posterior density of the parameters (α, β) given the data D above.
  • Let’s derive and implement a function p* of the parameters α and β that is proportional to the posterior density p(α, β | D), and which can thus be used as target density in the Metropolis Hastings algorithm. For numerical accuracy, let’s perform the computation in the log domain as shown in the next figure:
Image for post
Image by Author
  • Let’s use the Metropolis Hastings algorithm to draw 5,000 samples from the posterior density p(α,β|D).
  • Let’s set the (hyper-) parameters of the Metropolis-Hastings algorithm to:
    • param_init= (αinit, βinit) = (0, 0),
    • stepsize 1,
    • number of warm-up steps W = 1000.
  • Let’s plot the drawn samples with x-axis α and y-axis β and report the posterior mean of α and β, as well as their correlation coefficient under the posterior.

The following python code snippet implements the algorithm:


from scipy.special import factorial

def log_p_star(x, y, theta, s=10):
    alpha, beta = theta[0], theta[1]
    x1 = np.hstack((np.ones((len(x), 1)), np.reshape(x, (-1,1))))
    sx = np.dot(x1, theta)
    return np.sum(y*sx - np.log(factorial(y)) - np.exp(sx)) - \
                 (alpha**2 + beta**2)/2/s**2 - 2*np.log(s)

def mh(log_p_star, param_init,  num_samples=5000, stepsize=1.0, 
       warm_up=1000):
    
    theta = param_init
    samples = np.zeros((num_samples+1, param_init.shape[0]))
    samples[0] = theta
    for i in range(num_samples):
        theta_cand = multivariate_normal(mean=theta, \
                     cov=[stepsize**2]*len(theta)).rvs(size=1)
        ll_cand = log_p_star(theta_cand)
        ll_old = log_p_star(theta)
        if ll_cand > ll_old:
            theta = theta_cand
        else:
            u = np.random.uniform()
            if u < np.exp(ll_cand - ll_old):
               theta = theta_cand
        samples[i+1] = theta    
     return samples[warm_up+1:]

samples = mh(log_p_star=lambda theta: log_p_star(x, y, theta), \
             param_init=np.zeros(2), stepsize=1)np.mean(samples, axis=0)
# [-0.25056742,  0.91365977]np.corrcoef(samples[:,0], samples[:,1])
# [[ 1.        , -0.62391813],
   [-0.62391813,  1.        ]]
  • As can be seen from above, the parameters are negatively correlated, as can also be seen from the following animation.
Image for post
Image by Author
  • The following trace plot animation shows how 3 Markov chains initialized with different parameter values converge to the stationary posteriors for the parameters.
Image for post

Gibbs Sampling

Suppose is a 2D multivariate Gaussian random variable, i.e., x ∼ N (μ, Σ), where μ = (0, 0) and Σ = (1, −0.5; −0.5, 1).

  • Let’s implement Gibbs sampling to estimate the 1D marginals, p(x1) and p(x2) and plot these estimated marginals as histograms, by superimposing a plot of the exact marginals.
  • In order to do Gibbs sampling, you will need the two conditionals distributions, namely, p(x1|x2) and p(x2|x1). We assume that we know how to sample from these 1D conditional (Gaussian) distributions, although we don’t know how to sample from the original 2D Gaussian.
  • The following figure shows the conditional and marginal distributions for a given 2D Gaussian:
Image for post
  • We shall use the built-in python function numpy.random.normal() to sample from (conditional) 1D Gaussians, but we must not sample from a 2D Gaussian directly.
  • The following figure describes the Gibbs sampling algorithm.
Image for post

The following python code snippet shows a straightforward implementation of the above algorithm:

import numpy as np
μ = np.array([0, 0])
Σ = np.array([[1, -0.5], [-0.5, 1]])
μx, μy = μ[0], μ[1]
Σxx, Σyy = Σ[0,0], Σ[1,1]
Σxy = Σyx = Σ[0,1]
x0, y0 = np.random.rand(), np.random.rand()
N = 500
for i in range(N):
    # draw a sample from p(x | y=y0)
    x1 = np.random.normal(loc = μx + Σxy * (y0 - μy) / Σyy, \
                          scale = np.sqrt(Σxx - Σxy*Σyx/Σyy)) 
    # draw a sample from p(y | x=x1)
    y1 = np.random.normal(loc = μy + Σyx * (x1 - μx) / Σxx, \
                          scale = np.sqrt(Σyy - Σyx*Σxy/Σxx)) 
    x0, y0 = x1, y1
    #print(x1, y1)

The following animation shows how the Gibbs sampling algorithm approximates the 2D Gaussian distribution. In this case, since the 1D Gaussians are correlated, the Gibbs sampling algorithm is slow to explore the space.

Image for post

The original and the estimated marginal distributions for the random variables x and y are shown below (obtained only with 250 iterations with the Gibbs sampling algorithm):

Image for post
Image by Author

Application in NLP: Implementing Topic Modeling with Latent Dirichlet Allocation (LDA) using collapsed Gibbs Sampling

Collapsed Gibbs sampling can be used to implement topic modeling with Latent Dirichlet allocation (LDA). Let’s first understand the Dirichlet distribution (which is a distribution of distributions) and it properties (e.g., the conjugate prior, as shown in the following figure).

Image for post

The following animation shows the Dirichlet simplex for m=3. As can be seen, for α < 1, the density gets concentrated more to the corners of the simplex, for α=1 it becomes uniform over the simplex and α > 1 the density gets concentrated more towards the center of the simplex. Here we considered the symmetric Dirichlet, i.e., α s (concentration parameter) for which α1= α2= α3 (e.g., α=(2,2,2) is denoted as α=2 in the animation). θ1, θ2 and θ3 represent 3 corners of the simplex.

Image for post
Image by Author

The following animation shows examples of a few asymmetric Dirichlet simplex:

Image for post

Latent Dirichlet Allocation (LDA)

  • Each document is a (different) mixture of topics
  • LDA lets each word be on a different topic
  • For each document, d:
  • Choose a multinomial distribution θ_d over topics for that document
  • For each of the N words w_n in the document
  • Choose a topic z_n with p(topic) = θ_d
  • Choose a word w_n from a multinomial conditioned on z_n with p(w=w_n|topic=z_n)
  • Note: each topic has a different probability of generating each word

The following figure shows the algorithm for LDA, using collapsed Gibbs Sampling:

Image for post

For each of M documents in the corpus,

  • Choose the topic distribution θ ~ Dirichlet(α)
  • For each of the N words w_n:
  • Choose a topic z ~ Multinomial(θ)
  • Choose a word w_n ~ Multinomial(β_z)

Implementation To be continued…

Probabilistic Programming with PyMC3

Now, we will learn how to use the library PyMC3 for probabilistic programming and inference.

Bayesian Logistic regression with PyMC3

  • Logistic regression is a powerful model that allows us to analyze how a set of features affects some binary target label. Posterior distribution over the weights gives us an estimation of the influence of each particular feature on the probability of the target being equal to one. 
  • But most importantly, posterior distribution gives us the interval estimates for each weight of the model. This is very important for data analysis when we want to not only provide a good model but also estimate the uncertainty of our conclusions.
  • In this problem, we will learn how to use PyMC3 library to perform approximate Bayesian inference for logistic regression (this part is based on the logistic regression tutorial by Peadar Coyle and J. Benjamin Cook).
  • We shall use the same dataset as before. The problem here is to model how the probability that a person has salary ≥ $50K is affected by his/her age, education, sex and other features.
  • Let y_i=1 if i-th person’s salary is ≥ $50K and y_i =0 otherwise. Let x_{ij} be the j-th feature of the i-th person.
  • As explained above, the Logistic regression models this probability in the following way: p(y_i=1∣β)=σ(β1.x_{i1}+β2.x_{i2}+⋯+βk.x_{ik}), where σ(t)=1/(1+exp(−t))

Odds ratio

  • Let’s try to answer the following question: does the gender of a person affects his or her salary? To do it we will use the concept of odds.
  • If we have a binary random variable y (which may indicate whether a person makes $50K) and if the probabilty of the positive outcome p(y=1) is for example 0.8, we will say that the odds are 4 to 1 (or just 4 for short), because succeding is 4 time more likely than failing p(y=1)/p(y=0)=0.8/0.2=4.
  • Now, let’s return to the effect of gender on the salary. Let’s compute the ratio between the odds of a male having salary ≥ $50K and the odds of a female (with the same level of education, experience and everything else) having salary ≥ $50K. 
  • The first feature of each person in the dataset is gender. Specifically, x_{i1}=0 if the person is female and x_{i1}=1, otherwise. Consider two people i and j having all but one features the same with the only difference in x_{i1}≠x_{j1}.
  • If the logistic regression model above estimates the probabilities exactly, the odds for a male will be: 
  • Now the ratio of the male and female odds will be:
  • So given the correct logistic regression model, we can estimate odds ratio for some feature (gender in this example) by just looking at the corresponding coefficient. 
  • But of course, even if all the logistic regression assumptions are met we cannot estimate the coefficient exactly from real-world data, it’s just too noisy. 
  • So it would be really nice to build an interval estimate, which would tell us something along the lines “with probability 0.95 the odds ratio is greater than 0.8 and less than 1.2, so we cannot conclude that there is any gender discrimination in the salaries” (or vice versa, that “with probability 0.95 the odds ratio is greater than 1.5 and less than 1.9 and the discrimination takes place because a male has at least 1.5 higher probability to get >$50k than a female with the same level of education, age, etc.”). In Bayesian statistics, this interval estimate is called credible interval.
  • Unfortunately, it’s impossible to compute this credible interval analytically. So let’s use MCMC for that!

Credible interval

  • A credible interval for the value of exp(β1) is an interval [a,b] such that p(a ≤ exp(β1) ≤ b ∣ Xtrain,ytrain) is 0.95 (or some other predefined value). To compute the interval, we need access to the posterior distribution p(exp(β1) ∣ Xtrain,ytrain).
  • Lets for simplicity focus on the posterior on the parameters 
    p(β1 ∣ Xtrain,ytrain), since if we compute it, we can always find [a,b] such that p(loga ≤β1 ≤logb ∣ Xtrain,ytrain)= p(a ≤exp(β1) ≤b ∣ Xtrain,ytrain) = 0.95.

MAP inference

Let’s read the dataset again. This is a post-processed version of the UCI Adult dataset.

data = pd.read_csv("adult_us_postprocessed.csv")

Each row of the dataset is a person with his (her) features. The last column is the target variable y, i.e., y = 1 indicates that this person’s annual salary is more than $50K.

First of all let’s set up a Bayesian logistic regression model (i.e. define priors on the parameters α and β of the model) that predicts the value of “income_more_50K” based on person’s age and education:

where x_1 is a person’s age, x_2 is his/her level of education, y indicates his/her level of income, α, β1 and β2 are parameters of the model.


with pm.Model() as manual_logistic_model:

    alpha = pm.Normal('alpha', mu=0, sigma=100)
    beta1 = pm.Normal('beta1', mu=0, sigma=100)
    beta2 = pm.Normal('beta2', mu=0, sigma=100)
    
    x1 = data.age.to_numpy()
    x2 = data.educ.to_numpy()
    z = alpha + beta1 * x1 + beta2 * x2
    
    p = pm.invlogit(z)
    target = data.income_more_50K.to_numpy()
    y = pm.Bernoulli('income_more_50K', p=p, observed=target) 
    
    map_estimate = pm.find_MAP()
    print(map_estimate)
    # {'alpha': array(-6.74809299), 'beta1': array(0.04348244), 
    #  'beta2': array(0.3621089)}
  • There’s a simpler interface for generalized linear models in pyMC3. Let’s compute the MAP estimations of corresponding coefficients using it.

with pm.Model() as logistic_model:
    pm.glm.GLM.from_formula('income_more_50K ~ age + educ', data,
                            family=pm.glm.families.Binomial())
    map_estimate = pm.find_MAP()
    print(map_estimate)
    # {'Intercept': array(0.0579459), 'age': array(1.29479444), 
    #  'educ': array(0.67566864)}

MCMC

To find credible regions let’s perform MCMC inference.

  • Let’s use the following function to visualize the sampling process.

<strong>def</strong> plot_traces(traces, burnin=200):
       
    ax = pm.traceplot(traces[burnin:], 
                      figsize = (12,len(traces.varnames)*1.5),
                      lines={k: v['mean'] <strong>for</strong> k, v <strong>in</strong>  
                            pm.summary(traces[burnin:]).iterrows()})

    <strong>for</strong> i, mn <strong>in</strong> enumerate(pm.summary(traces[burnin:])['mean']):
        ax[i,0].annotate('<strong>{:.2f}</strong>'.format(mn), xy=(mn,0), 
                         xycoords='data', xytext=(5,10), 
                         textcoords='offset points', rotation=90,
                         va='bottom', fontsize='large', 
                         color='#AA0022')

Metropolis-Hastings

  • Let’s use the Metropolis-Hastings algorithm for finding the samples from the posterior distribution.
  • Let’s explore the hyperparameters of Metropolis-Hastings such as the proposal distribution variance to speed up the convergence. The plot_traces function can be used to visually inspect the convergence.

We may also use MAP-estimate to initialize the sampling scheme to speed things up. This will make the warmup (burn-in) period shorter since we shall start from a probable point.


with pm.Model() as logistic_model:
    
    pm.glm.GLM.from_formula('income_more_50K ~ age + educ', data,
                            family=pm.glm.families.Binomial())
    step = pm.Metropolis()
    trace = pm.sample(2000, step) <em># draw 2000 samples</em>

plot_traces(trace)

The trace plot obtained is shown in the following figure (using 2 Markov chains, by default):

Image by Author
  • Since it is unlikely that the dependency between the age and salary is linear, let’s now include age squared into features so that we can model dependency that favors certain ages. 
  • Let’s train Bayesian logistic regression model on the following features: sex, age, age², educ, hours and use pm.sample() to run MCMC to train this model, using the following code snippet:

with pm.Model() as logistic_model:
    
    pm.glm.GLM.from_formula(
              'income_more_50K ~ sex + age + age^2 + educ + hours', 
               data, family=pm.glm.families.Binomial())
    step = pm.Metropolis()
    trace = pm.sample(2000, step)</pre>
<!-- /wp:preformatted -->

<!-- wp:preformatted -->
<pre class="wp-block-preformatted">plot_traces(trace)

The trace plot obtained this time is shown in the following figure:

Image by Author
  • Now, let’s use the NUTS sampler to obtain quicker convergence, with drawing only 400 samples, using adapt_diag initialization, as shown in the next python code snippet:

with pm.Model() as logistic_model:
    
    pm.glm.GLM.from_formula(
               'income_more_50K ~ sex + age + age^2 + educ + hours',   
               data,
               family=pm.glm.families.Binomial())
    trace = pm.sample(400, init = 'adapt_diag') 
  • We don’t need to use a large burn-in here, since we have initialized sampling from a good point (from our approximation of the most probable
    point (MAP) to be more precise).

burnin = 100
b = trace['sex[T. Male]'][burnin:]
plt.hist(np.exp(b), bins=20, density=True)
plt.xlabel("Odds Ratio")
plt.show()
  • Finally, we can find a credible interval (recall that credible intervals are Bayesian and confidence intervals are frequentist) for this quantity. This may be the best part about Bayesian statistics: we get to interpret credibility intervals the way we’ve always wanted to interpret them. We are 95% confident that the odds ratio lies within our interval!

lb, ub = np.percentile(b, 2.5), np.percentile(b, 97.5)
print("P(<strong>%.3f</strong> &lt; Odds Ratio &lt; <strong>%.3f</strong>) = 0.95" % (np.exp(lb), np.exp(ub)))
# P(2.983 &lt; Odds Ratio &lt; 3.450) = 0.95

Euler Method to solve Ordinary differential Equations with R

In this article, the simplest numeric method, namely the Euler method to solve 1st order ODEs will be demonstrated with examples (and then use it to solve 2nd order ODEs). Also, we shall see how to plot the phase lines (gradient fields) for an ODE and understand from examples how to qualitatively find a solution curve with the phaselines. All the problems are taken from the edx Course: MITx — 18.03x: Introduction to Differential Equations.

Let’s implement the above algorithm steps with the following R code snippet.


library(tidyverse)
library(latex2exp)
library(gridExtra)

Euler <- function(f, xn, yn, h, N, n=0) {
  df <- NULL
  An <- f(xn, yn)
  while (n < N) {
    df <- rbind(df, data.frame(n=n, Xn=xn, Yn=yn, An=An, hAn=h*An))
    An <- f(xn, yn)
    xn <- xn + h
    yn <- yn + h*An
    n <- n + 1
  }
  return(df)
}

Let’s define the function f next.


f <- function(x, y) {
  return(y*sin(x))
}

df <- Euler(f, -3, 1, 0.1, 100)
head(df)

The below table shows the solution curve computed by the Euler method for the first few iterations:

Let’s implement the following R functions to draw the isoclines and the phase lines for an ODE.


plot_isoclines <- function(f, rx, ry, rm, l, xgap, title) {
  lx <- rx[1]
  ux <- rx[2]
  x <- seq(lx, ux, xgap)
  idf <- NULL 
  for (m in rm) {
    y <- f(x, m)
    dx <- l / 2 / sqrt(1 + m^2)
    idf %>% ggplot() + 
  geom_segment(aes(x = x0, y = y0, xend = x1, yend = y1, col = m)) +
            xlim(rx[1], rx[2]) +
            ylim(ry[1], ry[2]) +
            scale_color_gradient2(low = "blue", mid = "green", high = "red") +
            ggtitle(title) +
            xlab('x') +
            ylab('y') +
            theme_bw() +
            theme(plot.title = element_text(hjust = 0.5))
}


plot_phaselines <- function(f, rx, ry, dx, dy, l, ms=-1:1, title=NULL) {

  lx <- rx[1]
  ux <- rx[2]
  ly <- ry[1]
  uy <- ry[2]
  x <- seq(lx, ux, dx)
  y <- seq(ly, uy, dy)
  pdf <- expand.grid(x = x, y = y)
  pdf$m <- f(pdf$x, pdf$y)
  pdf$dx <- l / 2 / sqrt(1 + pdf$m^2)
  pdf$x0 <- pdf$x - pdf$dx
  pdf$x1 <- pdf$x + pdf$dx
  pdf$y0 <- pdf$y - pdf$m*pdf$dx
  pdf$y1 <- pdf$y + pdf$m*pdf$dx
  pdf %>% ggplot() + 
  geom_segment(aes(x = x0, y = y0, xend = x1, yend = y1, col = m)) +
           xlim(rx[1], rx[2]) +
           ylim(ry[1], ry[2]) +
           ggtitle(title) +
           xlab('x') +
           ylab('y') +
           theme_bw() +
           theme(plot.title = element_text(hjust = 0.5))
}

Now, let’s plot the 0 (the nullcline), 1, 2, 3 isoclines for the ODE: y’ = y.sin(x).


ifunc <- function(x, m) m / sin(x)

rx <- c(-4,4)
ry <- c(-3,3)
rm <- seq(-3,3,0.5) 
l <- 0.2
xgap <- 0.05 
ygap <- 0.05 
plot_isoclines(ifunc, rx, ry, rm, l, xgap, TeX('m-isoclines for $\\frac{dy}{dx} = y.sin(x)$'))


xgap <- 0.25 #05
ygap <- 0.25 #05
plot_phaselines(f, rx, ry, xgap, ygap, l, -1:1, TeX('phaselines and m-isoclines for $\\frac{dy}{dx} = y.sin(x)$'))

Now, let’s solve the ODE with Euler’s method, using step size h = 0.1 with the initial value y(-3) = 1.


l <- 0.2
xgap <- 0.05 #05
ygap <- 0.05 #05
p <- plot_phaselines(f, rx, ry, xgap, ygap, l, -1:1, TeX('phaselines and m-isoclines for $\\frac{dy}{dx} = y.sin(x)$'))
df <- Euler(f, -3, 1, 0.1, 100)
p + geom_point(aes(Xn, Yn), data=df) + geom_path(aes(Xn, Yn), arrow = arrow(), data=df)

The black curve in the following figure shows the solution path with Euler.

Let’s plot the phase lines for the ODE y’ = y2 – x, solve it with Euler and plot the solution curve again, using the following code.


f <- function(x, y) {
  return(y^2-x)
}

rx <- c(-4,4)
ry <- c(-3,3)
rm <- seq(-3,3,0.5)
l <- 0.2
xgap <- 0.1
ygap <- 0.1
p <- plot_phaselines(f, rx, ry, xgap, ygap, l, -2:2, TeX('phaselines and m-isoclines for $\\frac{dy}{dx} = y^2-x$'))
df <- Euler(f, -1, 0, 0.1, 100)
p + geom_point(aes(Xn, Yn), data=df) + geom_path(aes(Xn, Yn), arrow = arrow(), data=df)

The black solution curve shows the trajectory of the solution obtained with Euler.

Finally, let’ plot the isoclines for the ODE y’ = y2 – x2 and solve it with Euler. Let’s try starting with different initial values and observe the solution curves in each case.


f <- function(x, y) {
  return(y^2-x^2)
}

rx <- c(-4,4)
ry <- c(-4,4)
l <- 0.2
xgap <- 0.1
ygap <- 0.1
p <- plot_phaselines(f, rx, ry, xgap, ygap, l, -2:2, TeX('phaselines and m-isoclines for $\\frac{dy}{dx} = y^2-x^2$, solving with Euler with diff. initial values'))
df <- Euler(f, -3.25, 3, 0.1, 100)
p <- p + geom_point(aes(Xn, Yn), data=df) + geom_path(aes(Xn, Yn), arrow = arrow(), data=df)
df <- Euler(f, -1, 1, 0.1, 50)

As can be seen, the solution curve starting with different initial values lead to entirely different solution paths, it depends on the phase lines and isoclines it gets trapped into.

Finally, let’s notice the impact of the step size on the solution obtained by the Euler method, in terms of the error and the number of iterations taken to obtain the solution curve. As can be seen from the following animation, with bigger step size h=0.5, the algorithm converges to the solution path faster, but with much higher error values, particularly for the initial iterations. On the contrary, smaller step size h=0.1 taken many more iterations to converge, but the error rate is much smaller, the solution curve is pretty smooth and follows the isoclines without any zigzag movement.

Spring-mass-dashpot system – the Damped Harmonic Oscillator

A spring is attached to a wall on its left and a cart on its right. A dashpot is attached to the cart on its left and the wall on its right. The cart has wheels and can roll either left or right.

The spring constant is labeled k, the mass is labeled m, and the damping coefficient of the dashpot is labeled b. The center of the cart is at the rest at the location x equals 0, and we take rightwards to be the positive x direction.

Consider the spring-mass-dashpot system in which mass is m (kg), and spring constant is k (Newton per meter). The position x(t) (meters) of the mass at (seconds) is governed by the DE

where m, b, k > 0. Let’s investigate the effect of the damping constant, on the solutions to the DE.

The next animation shows how the solution changes for different values of 𝑏, given 𝑚=1 and 𝑘=16, with couple of different initial values (c).

Now, let’s observe how the amplitude of the system response increases with the increasing frequency of the input sinusoidal force, achieving the peak at resonance frequency (for m=1, k=16 and b=√14, the resonance occurs at wr = 3)

Solving the 2nd order ODE for the damped Harmonic Oscillator with Euler method

Now, let’s see how the 2nd order ODE corresponding to the damped Harmonic oscillator system can be numerically solved using the Euler method. In order to be able to use Euler, we need to represent the 2nd order ODE as a collection of 1st order ODEs and then solve those 1st order ODEs simultaneously using Euler, as shown in the following figure:

Let’s implement a numerical solver with Euler for a 2nd order ODE using the following R code snippet:


Euler2 <- function(f, tn, xn, un, h, N, n=0) {
  df <- NULL
  while (n < N) {
    df <- rbind(df, data.frame(n=n, tn=tn, xn=xn, un=un))
    tn <- tn + h
    xn <- xn + h*un  
    un <- un + h*f(tn, xn, un)
    n <- n + 1
  }
  return(df)
}

For the damped Harmonic Oscillator, we have the 2nd order ODE 

This image has an empty alt attribute; its file name is image-7.png

from which we get the following two 1st order ODEs:

Let’s solve the damped harmonic oscillator with Euler method:


f <- function(t, x, u) {
    return(-b*u/m - k*x/m)
}

m <- 1
b <- 1
k <- 16
df <- Euler2(f, 0, 1, 0, 0.1, 100)
head(df)


Now, let’s plot the solution curves using the following code block:


par(mfrow=c(1,2))    
plot(df$tn, df$xn, main="x(t) vs. t", xlab='t', ylab='x(t)', col='red', pch=19, cex=0.7, 
       xaxs="i", yaxs="i")
lines(df$tn, df$xn, col='red')
grid(10, 10)
plot(df$tn, df$un, main=TeX('$\\dot{x}(t)$ vs. t'), xlab='t', ylab=TeX('$\\dot{x}(t)$'), col='blue', 
       pch=19, cex=0.7, xaxs="i", yaxs="i")
lines(df$tn, df$un, col='blue')
grid(10, 10)

The following animation shows the solution curves obtained by joining the points computed with the Euler method for the damped Harmonic Oscillator: