Non-Negative Matrix Factorization for Text Classification and Recommendation

In this blog, we shall discuss on how matrix-factorization-based unsupervised machine learning techniques can be applied to solve problems such as text classification (BBC news docs) and movie recommendation (MovieLens ratings predictions). The problems appeared in the coursera course Unsupervised Learning by the university of Colorado Boulder.
More specifically, we shall focus on the application of the NMF (Non-negative matrix factorization) algorithm, as shown in the next figure, along with TSVD (truncated SVD).

Problem 1. Kaggle Competition: BBC News Classification

This Kaggle competition is about categorizing news articles. We shall primarily use unsupervised machine learning, more precisely, the matrix factorization techniques (e.g., truncated SVDNMF) to predict the category of a news article. Also, we shall compare the unsupervised ML model’s performance (trained on the training dataset) in terms of classification accuracy (on held-out test dataset) and analyze the shortcomings of the models.

We shall execute the following steps to achieve the text classification task:

  • start with exploratory data analysis (EDA) procedure (inspect, visualize and clean the noisy text dataset) – a mandatory step prior to the actual data analysis can begin
  • preprocess the dataset – to extract relevant features (e.g., we shall extract bag-of-words and tf-idf features with scikit-learn)
  • formulate the text classification problem as topic modeling (extraction) probelm, build couple of unsupervised models based on martix factorization (namely, truncated SVD and non-negative matrix factorization) with scikit-learn, in order to extract the topics from the articles in an unsupervised manner (i.e., we shall not use the categories provided), and train them on the training text dataset. We shall use the models to predict the category of the test aricles.
  • build a few supervised multi-class classification models (e.g., random forestsupport vector machine and gradient boosting), train them on the training dataset (this time using the categories as the class labels). We shall use the models to predict the category of the test aricles.
  • for model selection we shall use a held-out validation set from the training dataset, which we shall use to evaluate our models.
  • finally we shall compare the unsupervised vs. the supervised learning models in terms of the performace on the test dataset (kaggle leaderboard score).

Exploratory Data analysis (EDA)

Inspect, Visualize and Clean the Data

First we need to import all python packages / functions (need to install with pip if some of them are not already installed) that are required to the clean the texts (from the tweets), for building the RNN models and for visualization. We shall use tensorflow / keras to to train the deep learning models.

import numpy as np 
import pandas as pd 
import os, math

#for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

#for text cleaning / preprocessing
import string, re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

#for data analysis and modeling
from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight
from sklearn import decomposition

import warnings
warnings.filterwarnings('ignore')

Inspect

Read the train and test dataframe, the only columns that we shall use are text (to extract input features) and target (output to predict).

df_train = pd.read_csv('learn-ai-bbc/BBC News Train.csv') #, index_col='id')
df_test = pd.read_csv('learn-ai-bbc/BBC News Test.csv') #, index_col='id')
df_train.head()
ArticleIdTextCategory
01833worldcom ex-boss launches defence lawyers defe…business
1154german business confidence slides german busin…business
21101bbc poll indicates economic gloom citizens in …business
31976lifestyle governs mobile choice faster bett…tech
4917enron bosses in $168m payout eighteen former e…business
df_test.head()
ArticleIdText
01018qpr keeper day heads for preston queens park r…
11319software watching while you work software that…
21138d arcy injury adds to ireland woe gordon d arc…
3459india s reliance family feud heats up the ongo…
41020boro suffer morrison injury blow middlesbrough…

There are 1490 tweets in the training and 736 tweets in the test dataset, respectively.

df_train.shape, df_test.shape
# ((1490, 3), (735, 2))

Maximum number of words present in a tweet is 3345 and 4492, in training and test dataset, respectively.

max_len_train = max(df_train['Text'].apply(lambda x: len(x.split())).values)
max_len_test = max(df_test['Text'].apply(lambda x: len(x.split())).values)
max_len_train, max_len_test
# (3345, 4492)

Visualize

The following plot shows histogram of class labels, the number of positive (disaster) and negative (no distaster) classes in the training dataset. As can be seen, the dataset is slightly imbalanced.

freq_df = pd.DataFrame(df_train['Category'].value_counts()).reset_index()
freq_df.columns = ['Category', 'count']
ncats = len(freq_df)
print(ncats)
ax = sns.barplot(data=freq_df, x='Category', y='count', hue='Category')
ax.set_xticklabels(ax.get_xticklabels(), 
                          rotation=90, 
                          horizontalalignment='right')
plt.margins(x=0.01)
freq_df.head()
# 5
Categorycount
0sport346
1business336
2politics274
3entertainment273
4tech261

Clean / Preprocess

Since the news article texts are likely to contain many junk characters along with very common non-informative words (stopwords, e.g., ‘the’), it is a good idea to clean the text (with the function clean_text() as shown below) and remove unnecessary stuffs before building the models, otherwise they can affect the performance. The function cleans the input text by following the steps below:

  • remove / replace punctuations
  • replace numbers, dollar and pound values (very important – it will be particularly useful for classfication of busniess category news articles that are expected to have a lot of money / dollar related words)
  • split into words (tokens) / tokenize
  • remove stopwords (very common / unimportant words do not add any value to news article classification)
  • remove leftover punctuations

It’s important that we apply the same preprocessing on both the training and test dataset.

def clean_text(txt):
    
    #remove punctuations
    txt  = "".join([char if char not in string.punctuation else ' ' for char in txt ])
    
    #process numbers / money
    txt = txt.replace('£', '$')
    txt = re.sub("\d+", " NUM ", txt) # change numbers to word " NUM "
    txt = re.sub('\s+', ' ', txt)   
    txt = txt.replace("NUM NUM", "NUM")
    txt = txt.replace("$ NUM", "dollar")
    txt = txt.replace("NUM bn", "dollar")
    txt = txt.replace("dollar bn", "dollar")
    txt = txt.replace("dollar dollar", "dollar")
    txt = txt.replace("NUM", "")
    txt = txt.replace("said", "") # the word said is too frequent, we can see it with word cloud
    
    txt = txt.lower() # lowercase
    
    # split into words
    words = word_tokenize(txt)
    
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    
    # removing leftover punctuations
    words = [word for word in words if word.isalpha()]
    
    cleaned_text = ' '.join(words)
    return cleaned_text

Clean train and test news articles, using the above function.

df_train['Text'] = df_train['Text'].apply(lambda txt: clean_text(txt))
df_test['Text'] = df_test['Text'].apply(lambda txt: clean_text(txt))

df_train.head()
ArticleIdTextCategory
01833worldcom ex boss launches defence lawyers defe…business
1154german business confidence slides german busin…business
21101bbc poll indicates economic gloom citizens maj…business
31976lifestyle governs mobile choice faster better …tech
4917enron bosses payout eighteen former enron dire…business

Find the unique categories of new articles, there are 55 of them.

cats = df_train['Category'].unique()
k = len(cats)
print(k, cats)
# 5 ['business' 'tech' 'politics' 'sport' 'entertainment']

Visualize

Now, let’s use the wordcloud library to find the most frequent words in the news articles corresponding to different categroes, using the function plot_wordcloud() defined below.

def plot_wordcloud(text, title, k=10):
  # Create and Generate a Word Cloud Image
  wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, background_color='black', colormap='Set2', collocations=False, stopwords = STOPWORDS).generate(text)
  # top k words
  plt.figure(figsize=(10,5))
  print(f'category {cat}, top {k} words: {list(wordcloud.words_.keys())[:k]}')
  ax = sns.barplot(x=0, y=1, data=pd.DataFrame(wordcloud.words_.items()).head(k))
  ax.set(xlabel = 'words', ylabel='count', title=title)
  plt.show()
  #Display the generated image
  plt.figure(figsize=(15,15))
  plt.imshow(wordcloud, interpolation="bilinear"), plt.title(title, size=20), plt.axis("off")
  plt.show()

The next plots show few of the most frequent words for each category of news articles. The following list shows the frequent words from each category of news articles:

  • business : ‘dollar’, ‘firm’, ‘market’, ‘sale’.
  • tech : ‘mobile’, ‘phone’, ‘service’, ‘newtowrk’, ‘software’.
  • politics : ‘labour’, ‘election’, ‘government’, ‘party’.
  • sport : ‘game’, ‘player’, ‘win’, ‘team’.
  • entertainment : ‘film’, ‘show’, ‘award’, ‘music’.
for cat in df_train['Category'].unique():
    plot_wordcloud(' '.join(df_train[df_train['Category']==cat]['Text'].values), f'words from training dataset (category {cat})')
# category business, top 10 words: ['dollar', 'year', 'us', 'mr', 'firm', 'new', 'market', 'sale', 'growth', 'company']
category tech, top 10 words: ['people', 'new', 'mobile', 'phone', 'game', 'service', 'one', 'year', 'mr', 'user']
category politics, top 10 words: ['mr', 'labour', 'election', 'government', 'blair', 'party', 'minister', 'people', 'new', 'say']
category sport, top 10 words: ['game', 'year', 'first', 'win', 'time', 'england', 'player', 'two', 'back', 'world']
category entertainment, top 10 words: ['film', 'year', 'best', 'award', 'one', 'show', 'new', 'us', 'star', 'music']

Our goal will be to find the topics from the news articles without using the ground-truth categories, using the topic models, where a topic will correspond to a category.

Extracting word features

Before we can start building models, we must process raw texts to feature vectors. Here we shall extract

  • bag-of-words features and (with CountVectorizer), which represents a word (term) using its frequency (TF).
  • term-freq / inverse-doc-freq (with TfidfVectorizer), where a word is represented using its TFIDF score (an IDF score computes how common the corresponding term is in the set of all articles),

as shown in the below figure, using the sklearn.feature_extraction.text module.

  • We shall only consider the top max_features=5000 features ordered by term frequency across the news articles.
  • When building the vocabulary we shall ignore terms that have a news article frequency strictly higher than 22 and less than 95%95% of the news articles.

Reason for choosing the features

  • Bag-of-words features may turn out to be adequate for news classification, since it associates term frequencies (along with presence / absence of terms) with the news categories.
  • TF-IDF features is expected to produce better classification results, since it associates some sort of importance weights to a term (e.g., how rare it is in the collection along with how frequent it is in an article).
n_feats = 5000
cvectorizer = CountVectorizer(stop_words='english', max_features=n_feats) #, tokenizer=LemmaTokenizer())
tvectorizer = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=2, max_features=n_feats) #, tokenizer=LemmaTokenizer())

Model Building

Let’s start by creating a train dataset where the model will be trained, and a valdation dataset where the model will be evaluated on (by matching with the categories provided).

Create Train and Validation Dataset

  • Let’s hold out a small portion (10%10%) of the dataset to be treated as unseen validation dataset and shall be used later for model evaluation.
  • Let’s split the given training dataset further into train and validation dataset, only the train split will be used for model training purpose.
X_train, y_train, X_test = df_train['Text'].values, df_train['Category'].values, df_test['Text'].values
X_train_, X_val_, y_train_, y_val_ = train_test_split(X_train, y_train, test_size=0.1, random_state=0)
X_train.shape, X_train_.shape
# ((1490,), (1341,))

Next let’s try the unsupervised models based on matrix factorization to automatically extract latent topics in the the dataset and then match them with the provided categories.

Also, let’s try to think and answer to the following questions:

1. When training the unsupervised model for matrix factorization, should we include texts (word features) from the test dataset or not as the input matrix? Why or why not?

Since the words that are present in the test dataset articles but not in training dataset articles can’t be associated with the ground-truth training categories, it’s best to ignore (exlude) them and train the vectorizers on the train dataset only.

2. Build a model using the matrix factorization method(s) and predict the train and test data labels. Choose any hyperparameter (e.g., number of word features) to begin with.

We shall perform hyperparameter tuning with unsupervised NMF for the following hyperparamaters (look at the corresponding section)

  • number of word features with TF-IDF vectorizer
  • the vectorizer itself, i.e., bagof-words and TF-IDF
  • the regularization hyperparamaters for NMF model

and record the results by including graphs.

Unsupervised (Matrix Factorization) Models

We shall build a couple of different matrix factorization models, namely

  1. Truncated Singular Value Decomposition (TSVD)
  2. Non-Negative Matric Factoriation (NMF)

1. Truncated SVD

Let’s fit a truncated SVD model to extract 55 latent topics, we shall use the bag-of-words features this time.

tsvd = decomposition.TruncatedSVD(n_components=k, algorithm='randomized', n_iter=20, random_state=0)

vectors = cvectorizer.fit_transform(X_train_).todense() # (documents, vocab)
vocab = np.array(cvectorizer.get_feature_names_out())
vectors.shape, vocab.shape, vocab[:10]
# ((1341, 5000),
# (5000,),
# array(['abbas', 'ability', 'able', 'abroad', 'absence', 'absolute',
#       'absolutely', 'abuse', 'abused', 'ac'], dtype=object))

The function show_words_for_topics() returns top 1010 words for a given topic.

def show_words_for_topics(topic, num_top_words = 10):
    return np.apply_along_axis(lambda x: vocab[(np.argsort(-x))[:num_top_words]], 1, topic)

tsvd.fit(vectors)
topics = tsvd.components_
print(topics.shape)

print(show_words_for_topics(topics))
# (5, 5000)
# [['mr' 'people' 'dollar' 'new' 'year' 'government' 'time' 'uk' 'labour' 'world']
# ['mr' 'labour' 'blair' 'party' 'election' 'brown' 'minister' 'kilroy' 'government' 'prime']
# ['dollar' 'wage' 'minimum' 'increase' 'pay' 'tax' 'government' 'jobs' 'business' 'paid']
# ['best' 'film' 'dollar' 'actor' 'director' 'actress' 'awards' 'year' 'win' 'aviator']
# ['roddick' 'nadal' 'game' 'england' 'break' 'set' 'point' 'serve' 'ireland' 'match']]

From the top words (with heighest weights) for the above 55 latent topics, the last 33 of them clearly corresponds to businessentertainment and sport, but the first two are not clear. Still assigning a category for the first two, we get the following topic dictionary.

topic_dict = {0: 'tech', 1: 'politics', 2: 'business', 3: 'entertainment', 4: 'sport'}

Also, note that the data is still pretty noisy and TSVD can only explain a small percentage of variance in the data.

print(tsvd.explained_variance_ratio_)
print(tsvd.explained_variance_ratio_.sum())
print(tsvd.singular_values_)
#[ 0.02561559 0.02636912 0.02167606 0.01908583 0.0168965 ]
# 0.10964309909422594
# [196.85488794 120.73593918 108.83479115 101.9456552   95.21317426]

Measure the performances of predictions

Now, let’s predict the categories of the articles in the training and validation dataset, using the model (corresponding to the topic with the highest weight). As can be seen, both training and validation accuracies are poor, so this model performs poorly for the dataset.

def get_predictions(model, X, vectorizer, topic_dict):
    vectors = vectorizer.transform(X).todense() # (documents, vocab)
    predictions = model.transform(vectors)
    predictions = np.argmax(predictions, axis=1)
    return [topic_dict[topic] for topic in predictions]

def compute_pred_accuracy(y, pred):
    return np.mean(y == pred)

print(compute_pred_accuracy(y_train_, get_predictions(tsvd, X_train_, cvectorizer, topic_dict)))
print(compute_pred_accuracy(y_val_, get_predictions(tsvd, X_val_, cvectorizer, topic_dict)))
# 0.3616703952274422
# 0.348993288590604

2. Non-Negative Matrix Factorization (NMF)

Now let’s use the non-negative matrix factorization model to find the latent topics, we shall use the tf-idf features this time.

vectors = tvectorizer.fit_transform(X_train_).todense() # (documents, vocab)
vocab = np.array(tvectorizer.get_feature_names_out())

Let’s now train the NMF model on the train-split features, to find 55 latent topics.

nmf = decomposition.NMF(
    n_components=k, 
    random_state=0, 
    init = "nndsvda", 
    beta_loss="frobenius",
    alpha_W=0.001,
    alpha_H=0.001,
    )

W1 = nmf.fit_transform(vectors)
H1 = nmf.components_

show_words_for_topics(H1)
# array([['england', 'game', 'win', 'wales', 'ireland', 'cup', 'france', 'team', 'half', 'play'],
#           ['mr', 'labour', 'blair', 'election', 'brown', 'party', 'government', 'minister', 'tax', 'howard'],
#           ['dollar', 'growth', 'economy', 'year', 'sales', 'economic', 'bank', 'market', 'china', 'rates'],
#           ['film', 'best', 'awards', 'award', 'actor', 'oscar', 'actress', 'films', 'director', 'star'],
#           ['people', 'mobile', 'music', 'phone', 'digital', 'technology', 'users', 'software', 'microsoft', 'new']], 
#             dtype=object)

Visualize Topics

for cat in df_train['Category'].unique():
    plot_wordcloud(' '.join(df_train[df_train['Category']==cat]['Text'].values), f'words from training dataset # # (category {cat})')
# category business, top 10 words: ['dollar', 'year', 'us', 'mr', 'firm', 'new', 'market', 'sale', 'growth', 'company']
category tech, top 10 words: ['people', 'new', 'mobile', 'phone', 'game', 'service', 'one', 'year', 'mr', 'user']
category politics, top 10 words: ['mr', 'labour', 'election', 'government', 'blair', 'party', 'minister', 'people', 'new', 'say']
category sport, top 10 words: ['game', 'year', 'first', 'win', 'time', 'england', 'player', 'two', 'back', 'world']
category entertainment, top 10 words: ['film', 'year', 'best', 'award', 'one', 'show', 'new', 'us', 'star', 'music']

As can be seen from the below output, the topics can be clearly mapped to the following categories.

topic_dict = {0: 'sport', 1: 'politics', 2: 'business', 3: 'entertainment', 4: 'tech'}
# nmf.reconstruction_err_ 
# 35.75224975364894

Measure the performances of predictions

Let’s predict the categories of the articles in the training and validation dataset, using the NMF model. As can be seen, both training and validation accuracies obtained are pretty descent this time.

print(compute_pred_accuracy(y_train_, get_predictions(nmf, X_train_,  tvectorizer, topic_dict)))
print(compute_pred_accuracy(y_val_, get_predictions(nmf, X_val_, tvectorizer, topic_dict)))
# 0.9038031319910514
# 0.8993288590604027

The next confusion matrix corresponding to the classification of the validation dataset by the model is shown beow, which also indicates that the classifier does a nice job.

from sklearn.metrics import confusion_matrix
import seaborn as sns
import pandas as pd
import matplotlib.pylab as plt

def plot_confusion_matrix(m, k=5):
    df_cm = pd.DataFrame(m, range(k), range(k))
    # plt.figure(figsize=(10,7))
    sns.set(font_scale=1.4) # for label size
    sns.heatmap(df_cm, annot=True, annot_kws={"size": 16}) # font size
    plt.show()

plot_confusion_matrix(confusion_matrix(y_val_, get_predictions(nmf, X_val_, tvectorizer, topic_dict)))

Let’s train the model (with same hyperparameters) on the entire training dataset and extract the 55 latent topics again.

vectors = tvectorizer.fit_transform(X_train).todense() # (documents, vocab)
vocab = np.array(tvectorizer.get_feature_names_out())

nmf = decomposition.NMF(
    n_components=k, 
    random_state=0, 
    init = "nndsvda", 
    beta_loss="frobenius",
    alpha_W=0.001,
    alpha_H=0.001
    )

nmf.fit(vectors)
show_words_for_topics(nmf.components_)

# array([['england', 'game', 'win', 'wales', 'ireland', 'cup', 'play', 'team', 'france', 'time'],
#           ['mr', 'labour', 'election', 'blair', 'brown', 'party', 'government', 'minister', 'howard', 'tax'],
#           ['mobile', 'people', 'music', 'phone', 'digital', 'technology', 'phones', 'users', 'software', 'microsoft'],
#           ['film', 'best', 'awards', 'award', 'actor', 'oscar', 'actress', 'films', 'director', 'festival'],
#           ['dollar', 'growth', 'economy', 'year', 'sales', 'economic', 'market', 'bank', 'oil', 'china']], 
#            dtype=object)

Map the topics extracted to categories from the words belonging to the topics.

topic_dict = {0: 'sport', 1: 'politics', 2: 'tech', 3: 'entertainment', 4: 'business'}

Predict the topics of the articles the test dataset, print the first 55 predictions.

predictions = get_predictions(nmf, X_test, tvectorizer, topic_dict)
print(predictions[:5])
# ['sport', 'tech', 'sport', 'business', 'sport']

Finally, let’s save and submit the predictions to Kaggle (late submission!).

def predict_save(predictions, submission_file='submission_df.csv'):
    submission_df = pd.read_csv('BBC News Sample Solution.csv')
    submission_df['Category'] = predictions
    submission_df.to_csv(submission_file, index=False)
    submission_df.head()
    return submission_df

predict_save(predictions)
ArticleIdCategory
01018sport
11319tech
21138sport
3459business
41020sport

Kaggle submission accuracy scores on test articles classification with unsupervised NMF

The following figure shows test accuracy scores obtained on submission. The submissions could not be selected for leaderboard, since the due date is over.

Hyperparamter Tuning for NMF / modifications to improve performance

Let’s now try to improve the performance (accuracy) of classification by tuning a few hyperparamaters for the NMF model

1. Tuning the number of word features

def show_words_for_topics(topic, num_top_words = 10):
    return np.apply_along_axis(lambda x: vocab[(np.argsort(-x))[:num_top_words]], 1, topic)

def get_predictions(model, X, vectorizer, topic_dict):
    vectors = vectorizer.transform(X).todense() # (documents, vocab)
    predictions = model.transform(vectors)
    predictions = np.argmax(predictions, axis=1)
    return [topic_dict[topic] for topic in predictions]

def compute_pred_accuracy(y, pred):
    return np.mean(y == pred)

topic_dicts = {5000: {0: 'sport', 1: 'politics', 2: 'business', 3: 'entertainment', 4: 'tech'},
               2500: {0: 'business', 1: 'politics', 2: 'sport', 3: 'entertainment', 4: 'tech'},
               1000: {0: 'politics', 1: 'sport', 2: 'business', 3: 'entertainment', 4: 'tech'},
               500: {0: 'business', 1: 'sport', 2: 'entertainment', 3: 'politics', 4: 'tech'},
               100: {0: 'sport', 1: 'politics', 2: 'business', 3: 'entertainment', 4: 'tech'}} # obtained using inspection from the top words for the topics

n_featss = [5000, 2500, 1000, 500, 100]
accs = []

for n_feats in n_featss:
    tvectorizer = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=2, max_features=n_feats) #, tokenizer=LemmaTokenizer())
    vectors = tvectorizer.fit_transform(X_train_).todense() # (documents, vocab)
    vocab = np.array(tvectorizer.get_feature_names_out())
    #print(len(vocab))
    nmf = decomposition.NMF(
            n_components=k, 
            random_state=0, 
            init = "nndsvda", 
            beta_loss="frobenius",
            alpha_W=0.001,
            alpha_H=0.001
        )

    nmf.fit(vectors)
    #print(show_words_for_topics(nmf.components_))

    topic_dict = topic_dicts[n_feats]
    accs.append(compute_pred_accuracy(y_val_, get_predictions(nmf, X_val_,  tvectorizer, topic_dict)))

As can be seen from the plot below, the accuracy of classification increases as we increase the number of word features.

plt.plot(n_featss, accs)
plt.grid()
plt.xlabel('number of word features', size=15)
plt.ylabel('accuracy of NMF on validation', size=15)
plt.show()

2. Tuning the feature extraction methods

  • Let’s use bag-of-words features (CountVectorizer), instead of tf-idf features (TfIdfVectorizer), but the validation accuracy decreases, as shown in the next code snippet.
vectors = cvectorizer.fit_transform(X_train_).todense() # (documents, vocab)
vocab = np.array(cvectorizer.get_feature_names_out())

nmf = decomposition.NMF(
        n_components=k, 
        random_state=0, 
        init = "nndsvda", 
        beta_loss="frobenius",
        alpha_W=0.001,
        alpha_H=0.001
    )

nmf.fit(vectors)
print(show_words_for_topics(nmf.components_))

topic_dict = {0: 'tech', 1: 'politics', 2: 'business', 3: 'entertainment', 4: 'sport'}
print(compute_pred_accuracy(y_val_, get_predictions(nmf, X_val_,  tvectorizer, topic_dict)))

# [['people' 'mobile' 'music' 'new' 'technology' 'digital' 'tv' 'like' 'year' 'phone']
#  ['mr' 'labour' 'party' 'blair' 'election' 'brown' 'government' 'new' 'minister' 'kilroy']
#  ['dollar' 'wage' 'minimum' 'increase' 'government' 'people' 'year' 'pay' 'tax' 'business']
#  ['film' 'best' 'actor' 'director' 'actress' 'awards' 'aviator' 'year' 'foxx' 'jamie']
#  ['game' 'roddick' 'nadal' 'england' 'set' 'time' 'world' 'break' 'year' 'win']]
# 0.7046979865771812

3. Tuning Regularization Hyperparameters for NMF

  • Let’s tune the hyperparameters for NMF (e.g., with different values of the regularization hyperparameters alpha_W and alpha_H, different values of init and beta-loss) to obtain a few different test accuracy scores on kaggle for test article classification (shown abobe).
  • One such hyperparameter tuning result is shown below (with alpha_W=alpha_H=0), with  91% 91% validation accuracy (wich actually improves the performance).
nmf = decomposition.NMF(
        n_components=k, 
        random_state=0, 
        init = "nndsvda", 
        beta_loss="frobenius",
        alpha_W=0,
        alpha_H=0
    )

nmf.fit(vectors)
print(show_words_for_topics(nmf.components_))

topic_dict = {0: 'sport', 1: 'politics', 2: 'business', 3: 'entertainment', 4: 'tech'}
print(compute_pred_accuracy(y_val_, get_predictions(nmf, X_val_,  tvectorizer, topic_dict)))
# [['england' 'game' 'win' 'wales' 'cup' 'ireland' 'team' 'france' 'half' 'play']
#  ['mr' 'labour' 'blair' 'election' 'brown' 'party' 'government' 'minister' 'howard' 'prime']
#  ['dollar' 'growth' 'economy' 'sales' 'year' 'bank' 'economic' 'market' 'china' 'oil']
#  ['film' 'best' 'awards' 'award' 'actor' 'oscar' 'actress' 'films' 'director' 'star']
#  ['mobile' 'people' 'music' 'phone' 'digital' 'technology' 'users' 'broadband' 'software' 'microsoft']]
# 0.912751677852349
  • The next code snippet changes the values of the regularization hyperparamaters for training the NMF model and shows the validation accuracy in a tabular form.
vectors = tvectorizer.fit_transform(X_train_).todense() # (documents, vocab)
vocab = np.array(tvectorizer.get_feature_names_out())
acc_df = pd.DataFrame(columns=['alpha_W', 'alpha_H', 'accuracy'])
i = 0
for aW in [0, 0.0001, 0.001]:
    for aH in [0, 0.0001, 0.001]:
        nmf = decomposition.NMF(n_components=k, random_state=0, 
                                init = "nndsvda", beta_loss="frobenius",
                                alpha_W=aW, alpha_H=aH)

        nmf.fit(vectors)
        acc = compute_pred_accuracy(y_val_, get_predictions(nmf, X_val_,  tvectorizer, topic_dict))
        acc_df = acc_df.append({'alpha_W': aW, 'alpha_H': aH, 'accuracy': acc}, ignore_index=True)
#acc_df.head(10)
acc_df = acc_df.pivot(index='alpha_W', columns='alpha_H', values='accuracy')
acc_df.head()
alpha_H0.00000.00010.0010
alpha_W
0.00000.9127520.9060400.906040
0.00010.8926170.8993290.899329
0.00100.8926170.8993290.899329

Visualization

Let’s now plot the result of the above hyperparameter tuning, the accuracy obtained on validation split, using the predictions from the NMF model trained on the training split with different regularization hyperparameter values.

sns.set(font_scale=1.2)
sns.heatmap(acc_df, annot=True, linewidth=.5) 
plt.show()

Supervised (Multi-class) Classification Models

Now, let’s use the catgories as labels from the training dataset provided, to train a few supervised multi-class classification models on the training dataset and use them to predict the categories of the news articles from the test dataset.

Preprocessing

To start with, let’s preprocess the labels in the training dataset to get a OHE (one-hot-encoding) of the labels from the training dataset.

le = LabelEncoder()
y_train = df_train['Category'].values
y_train = le.fit_transform(y_train)
X_train, X_test = df_train['Text'].values, df_test['Text'].values

Extract (tf-idf) features prior to building the models, same way it’s done for the unsupervised case.

from time import time

def load_data_preprocess(train, test):
    t0 = time()
    vectorizer = TfidfVectorizer(sublinear_tf=True, max_df=0.5, min_df=5, stop_words="english")
    X_train = vectorizer.fit_transform(train.Text.values)
    y_train = train.Category.values
    feature_names = vectorizer.get_feature_names_out()
    target_names = list(set(y_train))
    duration_train = time() - t0
    print('feature extraction time (train)',duration_train)
    # Extracting features from the test data using the same vectorizer
    t0 = time()
    X_test = vectorizer.transform(test.Text.values)
    duration_test = time() - t0
    print('feature extraction time (test)', duration_test)    
    return X_train, X_test, y_train, feature_names, target_names

X_train, X_test, y_train, feature_names, target_names = load_data_preprocess(df_train, df_test)
X_train.shape, y_train.shape, X_test.shape, len(feature_names), len(target_names)
# feature extraction time (train) 0.3478715419769287
# feature extraction time (test) 0.1718735694885254
# ((1490, 6852), (1490,), (735, 6852), 6852, 5)

RidgeClassifier model

Let’s train the RidgeClassifier linear model on the training dataset.

Model Evaluation

Let’s evaluate the model with the following couple of different approaches and compute the accuracy on the training and the held-out validation dataset, also using the confusion matrix.

1. with Validation dataset

As before, we shall split the training dataset into two parts, namely a training (to train the model) and a validation (held out to evaluate the trained model) split. First start by instantiating a Ridge classifier object, train the model on the training split and then evaluate the performance of the model (in terms of accuracy) both on train split and on the validation dataset.

from sklearn.metrics import plot_confusion_matrix
from sklearn.linear_model import RidgeClassifier

X_train_, X_val_, y_train_, y_val_ = train_test_split(X_train, y_train, random_state=0)

clf = RidgeClassifier(tol=1e-2, solver="sparse_cg")

clf.fit(X_train_, y_train_)
print('accuracy on training: ', clf.score(X_train_, y_train_))
print('accuracy on validation: ', clf.score(X_val_, y_val_))

fig, ax = plt.subplots(figsize=(10, 10))
plt.rcParams.update({'font.size': 16})
plot_confusion_matrix(clf, X_val_, y_val_, cmap='inferno', ax=ax)  
plt.show()
# accuracy on training:  1.0
# accuracy on validation:  0.9839142091152815

2. with 5-fold cross-validation

Now, let’s try k=5�=5 fold cross-validation and report the accuracies on the validation folds.

from sklearn.model_selection import cross_val_score

cross_val_score(estimator=clf, X=X_train, y=y_train, cv=5)
# array([0.96979866, 0.96308725, 0.98657718, 0.98322148, 0.98322148])

Since the classifier obtains pretty good performance on the vaidation dataset, let’s train it on the whole training dataset and use the trained model to predict the categories of the articles from the test dataset. Write the predictions in a csv file and submit it kaggle.

def train_predict(clf, X_train, y_train, X_test):
    # train
    clf.fit(X_train, y_train)
    # predict on train
    pred = clf.predict(X_train)
    print(f'training accuracy: {np.mean(pred == y_train)}') # compute accuracy
    # predict on test
    pred = clf.predict(X_test)
    return pred

pred_df = predict_save(train_predict(clf, X_train, y_train, X_test), 'learn-ai-bbc/BBC News RidgeClassifier Solution.csv')
pred_df.head()
# training accuracy: 1.0
ArticleIdCategory
01018sport
11319tech
21138sport
3459business
41020sport

As we can see, the accuracy of the ridge classifier model on the training dataset is 100%100%. However, when the prediction on the test dataset was submitted Kaggle. it obtained 98.5%98.5% test accuracy score on kaggle for classification of the articles from the unseen dataset.

Visualizing the Decision Boundaries for Classification

Let’s visualize the decision boundaries for the categories using the linear model with 2D t-SNE projection.

from sklearn.manifold import TSNE

X_train_embedding = TSNE(n_components=2).fit_transform(X_train)
resolution = 100 # 100x100 background pixels
X2d_xmin, X2d_xmax = np.min(X_train_embedding[:,0]), np.max(X_train_embedding[:,0])
X2d_ymin, X2d_ymax = np.min(X_train_embedding[:,1]), np.max(X_train_embedding[:,1])
xx, yy = np.meshgrid(np.linspace(X2d_xmin, X2d_xmax, resolution), np.linspace(X2d_ymin, X2d_ymax, resolution))
background_model = clf.fit(X_train_embedding, clf.predict(X_train)) 
vbg = background_model.predict(np.c_[xx.ravel(), yy.ravel()])
vbg = vbg.reshape((resolution, resolution))
colors = {name: color for (name, color) in zip(target_names, range(len(target_names)))}
plt.figure(figsize=(20,10))
plt.contourf(xx, yy, np.array([colors[x] for x in vbg.ravel()]).reshape(vbg.shape), cmap='jet')
plt.scatter(X_train_embedding[:,0], X_train_embedding[:,1], c=np.array([colors[y] for y in y_train]), cmap='jet')
plt.title('Decision Boundaries after projection on 2D T-SNE manifold', size=30)
plt.show()

Model Selection: Hyperparameter Tuning With GridSearchCV

Now, let’s try GridSearchCV for hyperparameter tuning and model selection, for RidgeClassifierSVC (support vector classifier), random forest and gradient boosting ensemble classifier models.

1. RidgeClassifier Model

from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

def tune_hyperparameters(clf, param_grid, X_train, y_train, X_val, y_val):    
    grid = GridSearchCV(clf, param_grid, refit = True, verbose = 3, n_jobs=None) 
    grid.fit(X_train, y_train) 
    print('Best params:', grid.best_params_) 
    grid_pred = grid.predict(X_val) 
    print('val score:', grid.score(X_val, y_val))    
    print(classification_report(y_val, grid_pred))     

tune_hyperparameters(RidgeClassifier(tol=1e-2, solver="sparse_cg"), 
                     {'alpha': [1, 0.1, 0.01, 0.001, 0.0001, 0]}, 
                     X_train_, y_train_, X_val_, y_val_)  

# Fitting 5 folds for each of 6 candidates, totalling 30 fits
# [CV 1/5] END ...........................alpha=1;, score=0.973 total time=   0.0s
# [CV 2/5] END ...........................alpha=1;, score=0.960 total time=   0.0s
# [CV 3/5] END ...........................alpha=1;, score=0.973 total time=   0.0s
# [CV 4/5] END ...........................alpha=1;, score=0.987 total time=   0.0s
# [CV 5/5] END ...........................alpha=1;, score=0.987 total time=   0.0s
# [CV 1/5] END .........................alpha=0.1;, score=0.969 total time=   0.0s
# [CV 2/5] END .........................alpha=0.1;, score=0.964 total time=   0.0s
# [CV 3/5] END .........................alpha=0.1;, score=0.969 total time=   0.0s
# [CV 4/5] END .........................alpha=0.1;, score=0.991 total time=   0.0s
# [CV 5/5] END .........................alpha=0.1;, score=0.987 total time=   0.0s
# [CV 1/5] END ........................alpha=0.01;, score=0.969 total time=   0.0s
# [CV 2/5] END ........................alpha=0.01;, score=0.960 total time=   0.0s
# [CV 3/5] END ........................alpha=0.01;, score=0.969 total time=   0.0s
# [CV 4/5] END ........................alpha=0.01;, score=0.978 total time=   0.0s
# [CV 5/5] END ........................alpha=0.01;, score=0.987 total time=   0.0s
# [CV 1/5] END .......................alpha=0.001;, score=0.969 total time=   0.0s
# [CV 2/5] END .......................alpha=0.001;, score=0.960 total time=   0.0s
# [CV 3/5] END .......................alpha=0.001;, score=0.969 total time=   0.0s
# [CV 4/5] END .......................alpha=0.001;, score=0.978 total time=   0.0s
# [CV 5/5] END .......................alpha=0.001;, score=0.987 total time=   0.0s
# [CV 1/5] END ......................alpha=0.0001;, score=0.969 total time=   0.0s
# [CV 2/5] END ......................alpha=0.0001;, score=0.960 total time=   0.0s
# [CV 3/5] END ......................alpha=0.0001;, score=0.969 total time=   0.0s
# [CV 4/5] END ......................alpha=0.0001;, score=0.978 total time=   0.0s
# [CV 5/5] END ......................alpha=0.0001;, score=0.987 total time=   0.0s
# [CV 1/5] END ...........................alpha=0;, score=0.969 total time=   0.0s
# [CV 2/5] END ...........................alpha=0;, score=0.960 total time=   0.0s
# [CV 3/5] END ...........................alpha=0;, score=0.969 total time=   0.0s
# [CV 4/5] END ...........................alpha=0;, score=0.978 total time=   0.0s
# [CV 5/5] END ...........................alpha=0;, score=0.987 total time=   0.0s

# Best params: {'alpha': 1}
# val score: 0.9839142091152815
#                       precision    recall  f1-score   support
#     business           0.99      0.98      0.98        86
#entertainment       0.99      0.99      0.99        73
#     politics             0.98      0.97      0.98        63
#     sport                0.97      1.00      0.98        84
#     tech                 1.00      0.99      0.99        67

#     accuracy                                    0.98       373
#    macro avg         0.98      0.98      0.98       373
# weighted avg       0.98      0.98      0.98       373
predict_save(train_predict(RidgeClassifier(tol=1e-2, alpha=1, solver="sparse_cg"), X_train, y_train, X_test), 
              submission_file='learn-ai-bbc/BBC RidgeClassifier Solution.csv') 
# training accuracy: 1.0
ArticleIdCategory
01018sport
11319tech
21138sport
3459business
41020sport
7301923business
731373entertainment
7321704politics
733206business
734471politics

735 rows × 2 columns

2. Support Vector Classfier (SVC) Model

from sklearn.svm import SVC    

tune_hyperparameters(SVC(), 
                     {'C': [0.1, 1, 10, 100],  
                      'gamma':['scale', 'auto'],
                      'kernel': ['linear', 'rbf']}, 
                      X_train_, y_train_, X_val_, y_val_)  

# Fitting 5 folds for each of 16 candidates, totalling 80 fits
# Best params: {'C': 10, 'gamma': 'scale', 'kernel': 'linear'}

#                          precision    recall  f1-score   support
#   business                0.99      0.98      0.98        86
#   entertainment       0.96      0.99      0.97        73
#   politics                  0.97      0.95      0.96        63
#   sport                     0.99      0.99      0.99        84
#   tech                      0.99      0.99      0.99        67

# accuracy                                          0.98       373
# macro avg              0.98      0.98      0.98       373
# weighted avg         0.98      0.98      0.98       373

Now, let’s train the SVC model with the best hyperparameters obtained, this time on the entire training dataset and then make a prediction on the unseen test dataset, so that we shall be ready for another submission.

predict_save(train_predict(SVC(C=10, gamma='scale', kernel="linear"), X_train, y_train, X_test), 
              submission_file='learn-ai-bbc/BBC News SVC Solution.csv') 
# training accuracy: 1.0
ArticleIdCategory
01018sport
11319tech
21138sport
3459business
41020sport
7301923business
731373entertainment
7321704politics
733206business
734471politics

735 rows × 2 columns

A few more ensemble classifier models (e.g., Random ForestGradient Boosting and Stacked Ensemble classifier models) were trained on the training dataset and then used to predict the test article categories.

Kaggle submission accuracy scores on test articles classification with supervised models

The following screenshot shows accuracy scores obtained on kaggle on the unseen test dataset for different supervised models. The score obtained on the unseen test dataset was much less than the accuracy obtained on the training dataset with the GBM model, which implies the model overfitted on the training dataset more than other models.

Comparing the results obtained with supervised classifiers with those with the unsupervised NMF

  • As can be seen from the above result, the supervised multi-class classification models outperform the unsupervised models based on matrix-factorization and topic modeling.
  • The unsupervised approach (NMF) obtained maximum of  91% 91% score, whereas with supervised approach obtained  97% 97% score on the unseen test dataset.
  • The Linear Support Vector Classifier and the Stacked Ensemble Classifier performed the best on the unseen dataset.

Now, let’s try changing the train data size and observe train / test performance changes to note which of the methods (supervised vs. unsupervised) are data-efficient (require a smaller amount of data to achieve similar results) and which of the methods are prone to less overfitting.

X_train, y_train = df_train['Text'].values, df_train['Category'].values
X_train_, X_val_, y_train_, y_val_ = train_test_split(X_train, y_train, random_state=0)
n_feats = 5000
tvectorizer = TfidfVectorizer(stop_words='english', max_df=0.95, min_df=2, max_features=n_feats) #, tokenizer=LemmaTokenizer())
train_vectors = tvectorizer.fit_transform(X_train_)
vocab = np.array(tvectorizer.get_feature_names_out())
test_vectors = tvectorizer.transform(X_val_)
n_train = len(X_train_)
k = 5
topic_dicts = {
              1: {0: 'business', 1: 'politics', 2: 'sport', 3: 'tech', 4: 'entertainment'},
              0.5: {0: 'tech', 1: 'politics', 2: 'sport', 3: 'entertainment', 4: 'business'},
              0.2: {0: 'politics', 1: 'sport', 2: 'business', 3: 'entertainment', 4: 'tech'},
              0.1: {0: 'business', 1: 'tech', 2: 'sport', 3: 'politics', 4: 'entertainment'}
              }
np.random.seed(0)
for p in [1, 0.5, 0.2, 0.1]:
    train_indices = np.random.choice(n_train, int(n_train * p))
    X_train_sub, y_train_sub = train_vectors[train_indices], y_train_[train_indices]
    #print(p) #, X_train_.shape, X_train_sub.shape)
    # supervised
    print('supervised SVC')
    print('test accuracy: {}'.format(
         compute_pred_accuracy(y_val_, train_predict(SVC(), X_train_sub, y_train_sub, test_vectors))))
    # unsupervised
    print('unsupervised NMF')
    nmf = decomposition.NMF(n_components = k, random_state = 0, 
                            init = "nndsvda", beta_loss = "frobenius",
                            alpha_W = 0.001, alpha_H = 0.001)
    nmf.fit(X_train_sub)
    #print(show_words_for_topics(nmf.components_))
    topic_dict = topic_dicts[p]
    print('training accuracy: {}'.format(
          compute_pred_accuracy(y_train_sub, get_predictions(nmf, X_train_[train_indices], tvectorizer, topic_dict)))
    )
    print('test accuracy: {}'.format(
          compute_pred_accuracy(y_val_, get_predictions(nmf, X_val_, tvectorizer, topic_dict))))
  • As can be seen from the above figure, SVC supervised classification model overfits much more than NMF-based unsupervised classification model.
  • SVC outperforms NMF with at least 2020 data selected for training, from the original training data. But when 10%10% data are used for training, NMF outperforms SVC, hence for very small train data size, NMF is more data efficient.

Problem 2. MovieLens Ratings Prediction with NMF

Let’s load the movie ratings data (MovieLens 1M) and use sklearn.decomposition module’s implementation of non-negative matrix factorization technique (NMF) to predict the missing ratings from the test data. Let’s import the required libraries and read the data.

import pandas as pd
import numpy as np
from sklearn.metrics import mean_squared_error as mse

MV_users = pd.read_csv('data/users.csv')
MV_movies = pd.read_csv('data/movies.csv')
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')

print(MV_users.shape, MV_movies.shape, train.shape, test.shape)
train.head()
#test.head()

# (6040, 5) (3883, 21) (700146, 3) (300063, 3)
uIDmIDrating
074412105
1304015844
2145112935
3545531762
4250730745
#sorted(train.uID.unique())
train[(train.uID == 5) & (train.mID == 6)]
uIDmIDrating
334454562

If we pivot the dataset, as can be seen, the data contains a huge number of missing values (NaN) values, the rating matrix being very very sparse.

train_ratings = train.pivot(index='uID', columns='mID', values='rating')
print(train_ratings.shape)
train_ratings.head()
# (6040, 3664)
mID123456789103943394439453946394739483949395039513952
uID
15.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
2NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
3NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
4NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN
5NaNNaNNaNNaNNaN2.0NaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaNNaN

5 rows × 3664 columns

Limitation(s) of sklearn’s non-negative matrix factorization library

The NMF implementation does not handle missing values. Hence we need to fill (impute) the missing values (with non-negative values) before we can use the NMF implementation to generate the ratings for the movies that are not rated by the users, using NMF.

  • To start with, let’s fill the missing ratings in the train dataset by zeros (for example with the function impute_missing()).
  • Then train an NMF model with k components (k = 10, for example) on the train dataset.
  • Use the NMF model fitted on train dataset to predict the ratings for the test dataset (with get_NMF_pred()).
  • Compare the prediction error with RMSE (with the function get_pred_RMSE()).
missing_locs = np.isnan(train_ratings)
train_ratings[missing_locs] = 0
train_ratings.head()
mID123456789103943394439453946394739483949395039513952
uID
15.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.0
23.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.0
33.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.0
43.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.03.0
53.03.03.03.03.02.03.03.03.03.03.03.03.03.03.03.03.03.03.03.0

5 rows × 3664 columns

def impute_missing(ratings, val):
    ratings[np.isnan(ratings)] = val
    return ratings

def get_NMF_pred(train, test, impute_val = 0, impute_missing = impute_missing, k = 10):
    # impute
    train_ratings = train.pivot(index='uID', columns='mID', values='rating')
    train_ratings = impute_missing(train_ratings, impute_val) # replace NaN values
    # train NMF
    nmf = decomposition.NMF(
        n_components=k, 
        random_state=0, 
        init = "nndsvda", 
        beta_loss="frobenius",
        #alpha_W=0.001,
        #alpha_H=0.001,
        )
    W1 = nmf.fit_transform(train_ratings)
    H1 = nmf.components_
    print(f'NMF reconstrunction error: {nmf.reconstruction_err_}')
    # predict with NMF
    pred = W1 @ H1
    pred_df = pd.DataFrame(data = pred,  
                      index = train_ratings.index.values, 
                      columns = train_ratings.columns.values) 
    pred_df['uID'] = pred_df.index.values
    #print(pred_df.head())
    pred_df = pd.melt(pred_df, id_vars=['uID'], var_name='mID', value_name='pred_rating')
    out_df = pred_df.merge(test, on=['uID', 'mID'])
    out_df.head()
    return out_df[['uID', 'mID', 'rating', 'pred_rating']]

def get_pred_RMSE(pred_df):
    return np.sqrt(mse(pred_df['rating'].values, pred_df['pred_rating'].values))

Impute the missing values in the training dataset with zeros, train NMF and predict

pred_df = get_NMF_pred(train, test)
pred_df.head()
# NMF reconstrunction error: 2692.0484632343446
uIDmIDratingpred_rating
06140.717425
18140.704151
221130.253164
323141.319983
426131.377998

RMSE with the NMF model

print(f'RMSE: {get_pred_RMSE(pred_df)}')
# RMSE: 2.911772946856598

RMSE with the Baseline model predict_everything_to_3

pred_df['pred_rating'] = 3
print(f'RMSE: {get_pred_RMSE(pred_df)}')
# RMSE: 1.2585673019351262

As can be seen from above, the prediction with NMF is poorer than the baseline model, comparing in terms of RMSE.

NMF results in poor performance since the matrix is very very sparse and all the missing ratings are kept as zeros in the matrix being factorized (X=WH�=��). With the Frobenius norm loss function, the predicted ratings are pushed towards zero, which is incorrect.

Ways to improve the prediction with the NMF model

The issue can be fixed if the missing components from the rating matrix can be masked out when the loss function is computed, but currently, the sklearn‘s NMF implementation does not allow to change the loss function.

One way to fix this issue is to impute the missing values differently, for example, using the next two approaches, which improves the prediction RMSE a lot.

1. RMSE with the NMF model by imputing the missing values with rating value 33

pred_df = get_NMF_pred(train, test, impute_val = 3)
print(f'RMSE: {get_pred_RMSE(pred_df)}')
# NMF reconstrunction error: 995.1352921379064
# RMSE: 1.1500766186214655

2. RMSE with the NMF model by imputing the missing ratings for an item by average user ratings for the item

As described here (https://stackoverflow.com/questions/39367597/how-to-deal-with-missing-values-in-python-scikit-nmf/77255743#77255743), we can impute he missing ratings for an item by average user ratings for the item.

def impute_missing_avg_item_rating(ratings, val=None):
    missing_locs = np.isnan(ratings)
    mean = ratings.apply(np.nanmean, axis=0)
    ratings.fillna(mean, inplace=True)
    return ratings

pred_df = get_NMF_pred(train, test, impute_missing = impute_missing_avg_item_rating, impute_val = None)
print(f'RMSE: {get_pred_RMSE(pred_df)}')
# NMF reconstrunction error: 807.7526430141833
# RMSE: 0.9651849775012515

Deep Generative Art – Monet Style Transfer with GANs (CycleGAN)

This problem appeared as a project in the coursera course Deep Learning (by the University of Colorado Boulder) and also appeared in a Kaggle Competition.

Brief description of the problem and data

In this project, the goal is to build a GAN that generates 7,000 to 10,000 Monet-style images.

Computer vision has advanced tremendously in recent years and GANs are now capable of mimicking objects in a very convincing way. But creating museum-worthy masterpieces is thought of to be, well, more art than science. So can (data) science, in the form of GANs, trick classifiers into believing that we have created a true Monet? That’s the challenge we shall take on!

A GAN consists of at least two neural networks: a generator model and a discriminator model. The generator is a neural network that creates the images. For our competition, you should generate images in the style of Monet. This generator is trained using a discriminator.

The two models will work against each other, with the generator trying to trick the discriminator, and the discriminator trying to accurately classify the real vs. generated images.

Exploratory Data Analysis (EDA)

In this project we are going to use CycleGAN for style transfer. First we need to import all python packages / functions that are required (install the ones that are not already installed with pip) for building the GAN model. We shall use tensorflow / keras to train the generative model.

import numpy as np
import re, os, shutil
from glob import glob
import tqdm
import matplotlib.pylab as plt

# for building the model
import tensorflow as tf
import tensorflow.keras.backend as K
#! pip install tensorflow_addons
import tensorflow_addons as tfa
import tensorflow_datasets as tfds
from tensorflow import keras
from tensorflow.keras import layers, losses

Let’s read the tfrecords and create a tensorflow ZipDataset by combining the photo and monte style images. We can see that the number of monet style images (300) is much smaller than number of photo images (7038), so the images are not paired.

def load_dataset(filenames, labeled=True, ordered=False, autotune=tf.data.experimental.AUTOTUNE):
    dataset = tf.data.TFRecordDataset(filenames)
    dataset = dataset.map(read_tfrecord, num_parallel_calls=autotune)
    return dataset

def decode_image(image, img_size=[256,256,3]):
    image = tf.image.decode_jpeg(image, channels=3)
    image = (tf.cast(image, tf.float32) / 127.5) - 1        
    image = tf.reshape(image, img_size)             
    return image

def read_tfrecord(example):
    tfrecord_format = {
        "image_name": tf.io.FixedLenFeature([], tf.string),
        "image":      tf.io.FixedLenFeature([], tf.string),
        "target":     tf.io.FixedLenFeature([], tf.string)
    }
    example = tf.io.parse_single_example(example, tfrecord_format)  
    image = decode_image(example['image'])    
    return image

def count_data_items(filenames):
    n = [int(re.compile(r"-([0-9]*)\.").search(filename).group(1)) for filename in filenames]
    return np.sum(n)

data_path = 'gan-getting-started'
monet_filenames = tf.io.gfile.glob(str(os.path.join(data_path, 'monet_tfrec', '*.tfrec')))
photo_filenames = tf.io.gfile.glob(str(os.path.join(data_path, 'photo_tfrec', '*.tfrec')))

monet_ds = load_dataset(monet_filenames)
photo_ds = load_dataset(photo_filenames)

n_monet_samples = count_data_items(monet_filenames)
n_photo_samples = count_data_items(photo_filenames)
dataset = tf.data.Dataset.zip((monet_ds, photo_ds))

n_monet_samples, n_photo_samples
# (300, 7038)

Let’s plot a few sample images from monet style and photo images.

def plot_images(images, title):
    plt.figure(figsize=(15,15))
    plt.subplots_adjust(0,0,1,0.95,0.05,0.05)
    j = 1
    for i in np.random.choice(len(images), 100, replace=False):
        plt.subplot(10,10,j), plt.imshow(images[i] / images[i].max()), plt.axis('off')
        j += 1
    plt.suptitle(title, size=25)
    plt.show()

monet_numpy = list(monet_ds.as_numpy_iterator())
plot_images(monet_numpy, 'Monet images')

Monet Style input images

plot_images(list(photo_ds.as_numpy_iterator()), 'Photo images')

Photo input images

Preprocessing

As can be seen, the images are transformed to have values in between [-1,1], implemented with the function decode_image() above.

Model Architecture

Paired data is harder to find in most domains, and not even possible in some, the unsupervised training capabilities of CycleGAN are quite useful, it does not require paired training data (which we don’t have in this case, we have 300 monet images and ~7k photo images). Hence, the problem can be forumulated as unpaired image-to-image translation and CycleGAN is an ideal model to be used here. We shall train the CycleGAN model on the image dataset provided (to translate from photo to monet style images) and then use the Genrator to generate monet images later. The next figure shows the architecture and the next code snippet provides the implementation:

class CycleGAN(keras.Model):
    def __init__(
        self,
        monet_generator,
        photo_generator,
        monet_discriminator,
        photo_discriminator,
        lambda_cycle=10,
    ):
        super(CycleGAN, self).__init__()
        self.m_gen = monet_generator
        self.p_gen = photo_generator
        self.m_disc = monet_discriminator
        self.p_disc = photo_discriminator
        self.lambda_cycle = lambda_cycle
        
    def compile(
        self,
        m_gen_optimizer,
        p_gen_optimizer,
        m_disc_optimizer,
        p_disc_optimizer,
        gen_loss_fn,
        disc_loss_fn,
        cycle_loss_fn,
        identity_loss_fn
    ):
        super(CycleGAN, self).compile()
        self.m_gen_optimizer = m_gen_optimizer
        self.p_gen_optimizer = p_gen_optimizer
        self.m_disc_optimizer = m_disc_optimizer
        self.p_disc_optimizer = p_disc_optimizer
        self.gen_loss_fn = gen_loss_fn
        self.disc_loss_fn = disc_loss_fn
        self.cycle_loss_fn = cycle_loss_fn
        self.identity_loss_fn = identity_loss_fn

    def generate(self, image):
        return self.m_gen(tf.expand_dims(image, axis=0), training=False)

    def load(
        self, 
        filepath
    ):
        self.m_gen.load_weights(filepath.replace('model_name', 'm_gen'), by_name=True)
        self.p_gen.load_weights(filepath.replace('model_name', 'p_gen'), by_name=True)
        self.m_disc.load_weights(filepath.replace('model_name', 'm_disc'), by_name=True)
        self.p_disc.load_weights(filepath.replace('model_name', 'p_disc'), by_name=True)

    def save(
        self, 
        filepath
    ):
        self.m_gen.save(filepath.replace('model_name', 'm_gen'))
        self.p_gen.save(filepath.replace('model_name', 'p_gen'))
        self.m_disc.save(filepath.replace('model_name', 'm_disc'))
        self.p_disc.save(filepath.replace('model_name', 'p_disc'))

        
    def train_step(self, batch_data):
        real_monet, real_photo = batch_data
        
        with tf.GradientTape(persistent=True) as tape:
            # photo to monet back to photo
            real_photo = tf.expand_dims(real_photo, axis=0)
            real_monet = tf.expand_dims(real_monet, axis=0)

            fake_monet = self.m_gen(real_photo, training=True)
            cycled_photo = self.p_gen(fake_monet, training=True)

            # monet to photo back to monet
            fake_photo = self.p_gen(real_monet, training=True)
            cycled_monet = self.m_gen(fake_photo, training=True)

            # generating itself
            same_monet = self.m_gen(real_monet, training=True)
            same_photo = self.p_gen(real_photo, training=True)

            # discriminator used to check, inputing real images
            disc_real_monet = self.m_disc(real_monet, training=True)
            disc_real_photo = self.p_disc(real_photo, training=True)

            # discriminator used to check, inputing fake images
            disc_fake_monet = self.m_disc(fake_monet, training=True)
            disc_fake_photo = self.p_disc(fake_photo, training=True)

            # evaluates generator loss
            monet_gen_loss = self.gen_loss_fn(disc_fake_monet)
            photo_gen_loss = self.gen_loss_fn(disc_fake_photo)

            # evaluates total cycle consistency loss
            total_cycle_loss = self.cycle_loss_fn(real_monet, cycled_monet, self.lambda_cycle) + self.cycle_loss_fn(real_photo, cycled_photo, self.lambda_cycle)

            # evaluates total generator loss
            total_monet_gen_loss = monet_gen_loss + total_cycle_loss + self.identity_loss_fn(real_monet, same_monet, self.lambda_cycle)
            total_photo_gen_loss = photo_gen_loss + total_cycle_loss + self.identity_loss_fn(real_photo, same_photo, self.lambda_cycle)

            # evaluates discriminator loss
            monet_disc_loss = self.disc_loss_fn(disc_real_monet, disc_fake_monet)
            photo_disc_loss = self.disc_loss_fn(disc_real_photo, disc_fake_photo)

        # Calculate the gradients for generator and discriminator
        monet_generator_gradients = tape.gradient(total_monet_gen_loss, self.m_gen.trainable_variables)
        photo_generator_gradients = tape.gradient(total_photo_gen_loss, self.p_gen.trainable_variables)
        monet_discriminator_gradients = tape.gradient(monet_disc_loss, self.m_disc.trainable_variables)
        photo_discriminator_gradients = tape.gradient(photo_disc_loss, self.p_disc.trainable_variables)

        # Apply the gradients to the optimizer
        self.m_gen_optimizer.apply_gradients(zip(monet_generator_gradients, self.m_gen.trainable_variables))
        self.p_gen_optimizer.apply_gradients(zip(photo_generator_gradients, self.p_gen.trainable_variables))
        self.m_disc_optimizer.apply_gradients(zip(monet_discriminator_gradients, self.m_disc.trainable_variables))
        self.p_disc_optimizer.apply_gradients(zip(photo_discriminator_gradients, self.p_disc.trainable_variables))
        
        total_loss = total_monet_gen_loss + total_photo_gen_loss + monet_disc_loss + photo_disc_loss

        return {
            "total_loss": total_loss,
            "monet_gen_loss": total_monet_gen_loss,
            "photo_gen_loss": total_photo_gen_loss,
            "monet_disc_loss": monet_disc_loss,
            "photo_disc_loss": photo_disc_loss
        }

The CycleGAN is an extension of the GAN architecture that involves the simultaneous training of two generator models and two discriminator models. The CycleGAN uses an additional extension to the architecture called cycle consistency. This is the idea that an image output by the first generator could be used as input to the second generator and the output of the second generator should match the original image.

The Discriminator is a deep convolutional neural network that performs image classification. It takes a source image as input and predicts the likelihood of whether the target image is a real or fake image. Two discriminator models are used, one for Domain-A (photos) and one for Domain-B (monets).

The generator is an encoder-decoder model architecture. The discriminator models are trained directly on real and generated images, whereas the generator models are not. The model takes a source image (e.g. a photo) and generates a target image (e.g. a monet image). It does this by first downsampling or encoding the input image down to a bottleneck layer, then interpreting the encoding with a number of ResNet layers that use skip connections, followed by a series of layers that upsample or decode the representation to the size of the output image.

def Generator(img_shape=[256, 256, 3]):
    inputs = layers.Input(shape=img_shape)
    down_stack = [
        downsample(64, 4, apply_instancenorm=False),
        downsample(128, 4),
        downsample(256, 4),
        downsample(512, 4),
        downsample(512, 4),
        downsample(512, 4),
        downsample(512, 4),
        downsample(512, 4),
    ]
    up_stack = [
        upsample(512, 4, apply_dropout=True),
        upsample(512, 4, apply_dropout=True),
        upsample(512, 4, apply_dropout=True),
        upsample(512, 4),
        upsample(256, 4),
        upsample(128, 4),
        upsample(64, 4),
    ]

    initializer = tf.random_normal_initializer(0., 0.02)
    last = layers.Conv2DTranspose(3, 4, strides=2, padding='same', kernel_initializer=initializer, activation='tanh')

    x = inputs
    skips = []
    for down in down_stack:
        x = down(x)
        skips.append(x)
    skips = reversed(skips[:-1])

    for up, skip in zip(up_stack, skips):
        x = up(x)
        x = layers.Concatenate()([x, skip])
    x = last(x)
    return keras.Model(inputs=inputs, outputs=x)


def Discriminator(img_shape=[256, 256, 3]):
    initializer = tf.random_normal_initializer(0., 0.02)
    gamma_init = keras.initializers.RandomNormal(mean=0.0, stddev=0.02)

    inp = layers.Input(shape=img_shape, name='input_image')
    x = inp

    x = downsample(64, 4, False)(x) 
    x = downsample(128, 4)(x) 
    x = downsample(256, 4)(x) 

    x = layers.ZeroPadding2D()(x) 
    x = layers.Conv2D(512, 4, strides=1, kernel_initializer=initializer, use_bias=False)(x) 
    x = tfa.layers.InstanceNormalization(gamma_initializer=gamma_init)(x)
    x = layers.LeakyReLU()(x)
    x = layers.ZeroPadding2D()(x) 
    x = layers.Conv2D(1, 4, strides=1, kernel_initializer=initializer)(x) 

    return tf.keras.Model(inputs=inp, outputs=x)
def downsample(filters, size, apply_instancenorm=True):
    initializer = tf.random_normal_initializer(0., 0.02)
    gamma_init = keras.initializers.RandomNormal(mean=0.0, stddev=0.02)

    result = keras.Sequential()
    result.add(layers.Conv2D(filters, size, strides=2, padding='same', kernel_initializer=initializer, use_bias=False))
    if apply_instancenorm:
        result.add(tfa.layers.InstanceNormalization(gamma_initializer=gamma_init))
    result.add(layers.LeakyReLU())
    return result


def upsample(filters, size, apply_dropout=False):
    initializer = tf.random_normal_initializer(0., 0.02)
    gamma_init = keras.initializers.RandomNormal(mean=0.0, stddev=0.02)
    result = keras.Sequential()
    result.add(layers.Conv2DTranspose(filters, size, strides=2, padding='same', kernel_initializer=initializer, use_bias=False))
    result.add(tfa.layers.InstanceNormalization(gamma_initializer=gamma_init))
    if apply_dropout:
        result.add(layers.Dropout(0.5))
    result.add(layers.ReLU())
    return result

def discriminator_loss(real, generated):
    real_loss = losses.BinaryCrossentropy(from_logits=True, reduction=losses.Reduction.NONE)(tf.ones_like(real), real)
    generated_loss = losses.BinaryCrossentropy(from_logits=True, reduction=losses.Reduction.NONE)(tf.zeros_like(generated), generated)
    total_disc_loss = real_loss + generated_loss
    return total_disc_loss * 0.5
    
def generator_loss(generated):
    return losses.BinaryCrossentropy(from_logits=True, reduction=losses.Reduction.NONE)(tf.ones_like(generated), generated)
        
def calc_cycle_loss(real_image, cycled_image, LAMBDA):
    return LAMBDA * tf.reduce_mean(tf.abs(real_image - cycled_image))

def identity_loss(real_image, same_image, LAMBDA):
    return LAMBDA * 0.5 * tf.reduce_mean(tf.abs(real_image - same_image))
img_shape = [256, 256, 3]

model = CycleGAN(
    monet_generator=Generator(img_shape),
    photo_generator=Generator(img_shape),
    monet_discriminator=Discriminator(img_shape),
    photo_discriminator=Discriminator(img_shape),
    lambda_cycle=10
)

model.compile(
    m_gen_optimizer=tf.keras.optimizers.Adam(1e-4, beta_1=0.5),
    p_gen_optimizer=tf.keras.optimizers.Adam(1e-4, beta_1=0.5),
    m_disc_optimizer=tf.keras.optimizers.Adam(1e-4, beta_1=0.5),
    p_disc_optimizer=tf.keras.optimizers.Adam(1e-4, beta_1=0.5),
    gen_loss_fn=generator_loss,
    disc_loss_fn=discriminator_loss,
    cycle_loss_fn=calc_cycle_loss,
    identity_loss_fn=identity_loss
)

Results and Analysis

Let’s train the GAN model (both the generators and discriminators simultaneously) for 50 epochs.

# Train the model
batch_size = 32
#tf.config.run_functions_eagerly(True)
#tf.get_logger().setLevel('INFO')
epochs = 50
history = model.fit(dataset, epochs=epochs, batch_size=batch_size)

Epoch 1/50
2023-03-27 19:18:41.116027: E tensorflow/core/grappler/optimizers/meta_optimizer.cc:954] layout failed: INVALID_ARGUMENT: Size of values 0 does not match size of permutation 4 @ fanin shape inmodel/sequential_8/dropout/dropout_2/SelectV2-2-TransposeNHWCToNCHW-LayoutOptimizer
300/300 [==============================] - 178s 442ms/step - total_loss: 13.5143 - monet_gen_loss: 6.0788 - photo_gen_loss: 6.2427 - monet_disc_loss: 0.6130 - photo_disc_loss: 0.5799
Epoch 2/50
300/300 [==============================] - 132s 440ms/step - total_loss: 9.6236 - monet_gen_loss: 4.1237 - photo_gen_loss: 4.3565 - monet_disc_loss: 0.6405 - photo_disc_loss: 0.5029
Epoch 3/50
300/300 [==============================] - 132s 440ms/step - total_loss: 9.1091 - monet_gen_loss: 3.8061 - photo_gen_loss: 4.2105 - monet_disc_loss: 0.6381 - photo_disc_loss: 0.4545
Epoch 4/50
300/300 [==============================] - 132s 440ms/step - total_loss: 9.0574 - monet_gen_loss: 3.8176 - photo_gen_loss: 4.1377 - monet_disc_loss: 0.5952 - photo_disc_loss: 0.5070
Epoch 5/50
300/300 [==============================] - 132s 440ms/step - total_loss: 9.0721 - monet_gen_loss: 3.8688 - photo_gen_loss: 4.1645 - monet_disc_loss: 0.5462 - photo_disc_loss: 0.4925
Epoch 6/50
300/300 [==============================] - 132s 440ms/step - total_loss: 9.1115 - monet_gen_loss: 3.8860 - photo_gen_loss: 4.1123 - monet_disc_loss: 0.5778 - photo_disc_loss: 0.5354
Epoch 7/50
300/300 [==============================] - 132s 439ms/step - total_loss: 8.9233 - monet_gen_loss: 3.7933 - photo_gen_loss: 3.9551 - monet_disc_loss: 0.5987 - photo_disc_loss: 0.5761
Epoch 8/50
300/300 [==============================] - 132s 440ms/step - total_loss: 8.7080 - monet_gen_loss: 3.6934 - photo_gen_loss: 3.8289 - monet_disc_loss: 0.6077 - photo_disc_loss: 0.5780
Epoch 9/50
300/300 [==============================] - 132s 440ms/step - total_loss: 8.5513 - monet_gen_loss: 3.6243 - photo_gen_loss: 3.7549 - monet_disc_loss: 0.5994 - photo_disc_loss: 0.5727
Epoch 10/50
300/300 [==============================] - 132s 440ms/step - total_loss: 8.4467 - monet_gen_loss: 3.5701 - photo_gen_loss: 3.7059 - monet_disc_loss: 0.6021 - photo_disc_loss: 0.5686
Epoch 11/50
300/300 [==============================] - 132s 440ms/step - total_loss: 8.3335 - monet_gen_loss: 3.5100 - photo_gen_loss: 3.6443 - monet_disc_loss: 0.6047 - photo_disc_loss: 0.5744
Epoch 12/50
300/300 [==============================] - 132s 440ms/step - total_loss: 8.1454 - monet_gen_loss: 3.3979 - photo_gen_loss: 3.5508 - monet_disc_loss: 0.6197 - photo_disc_loss: 0.5770
Epoch 13/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.9934 - monet_gen_loss: 3.3216 - photo_gen_loss: 3.4639 - monet_disc_loss: 0.6189 - photo_disc_loss: 0.5889
Epoch 14/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.8940 - monet_gen_loss: 3.2744 - photo_gen_loss: 3.4177 - monet_disc_loss: 0.6144 - photo_disc_loss: 0.5875
Epoch 15/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.8149 - monet_gen_loss: 3.2327 - photo_gen_loss: 3.3816 - monet_disc_loss: 0.6144 - photo_disc_loss: 0.5862
Epoch 16/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.7611 - monet_gen_loss: 3.2000 - photo_gen_loss: 3.3537 - monet_disc_loss: 0.6176 - photo_disc_loss: 0.5898
Epoch 17/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.7349 - monet_gen_loss: 3.1919 - photo_gen_loss: 3.3390 - monet_disc_loss: 0.6143 - photo_disc_loss: 0.5897
Epoch 18/50
300/300 [==============================] - 132s 439ms/step - total_loss: 7.6678 - monet_gen_loss: 3.1610 - photo_gen_loss: 3.2970 - monet_disc_loss: 0.6160 - photo_disc_loss: 0.5938
Epoch 19/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.6531 - monet_gen_loss: 3.1560 - photo_gen_loss: 3.2872 - monet_disc_loss: 0.6158 - photo_disc_loss: 0.5941
Epoch 20/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.5918 - monet_gen_loss: 3.1216 - photo_gen_loss: 3.2546 - monet_disc_loss: 0.6182 - photo_disc_loss: 0.5974
Epoch 21/50
300/300 [==============================] - 132s 439ms/step - total_loss: 7.5678 - monet_gen_loss: 3.1125 - photo_gen_loss: 3.2480 - monet_disc_loss: 0.6141 - photo_disc_loss: 0.5932
Epoch 22/50
300/300 [==============================] - 132s 439ms/step - total_loss: 7.5094 - monet_gen_loss: 3.0878 - photo_gen_loss: 3.2083 - monet_disc_loss: 0.6158 - photo_disc_loss: 0.5975
Epoch 23/50
300/300 [==============================] - 132s 439ms/step - total_loss: 7.5019 - monet_gen_loss: 3.0877 - photo_gen_loss: 3.2045 - monet_disc_loss: 0.6138 - photo_disc_loss: 0.5958
Epoch 24/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.4443 - monet_gen_loss: 3.0563 - photo_gen_loss: 3.1699 - monet_disc_loss: 0.6180 - photo_disc_loss: 0.6001
Epoch 25/50
300/300 [==============================] - 132s 439ms/step - total_loss: 7.4156 - monet_gen_loss: 3.0429 - photo_gen_loss: 3.1655 - monet_disc_loss: 0.6165 - photo_disc_loss: 0.5907
Epoch 26/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.3863 - monet_gen_loss: 3.0236 - photo_gen_loss: 3.1437 - monet_disc_loss: 0.6213 - photo_disc_loss: 0.5977
Epoch 27/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.3447 - monet_gen_loss: 3.0054 - photo_gen_loss: 3.1185 - monet_disc_loss: 0.6231 - photo_disc_loss: 0.5977
Epoch 28/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.3085 - monet_gen_loss: 2.9852 - photo_gen_loss: 3.0985 - monet_disc_loss: 0.6254 - photo_disc_loss: 0.5994
Epoch 29/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.2748 - monet_gen_loss: 2.9651 - photo_gen_loss: 3.0832 - monet_disc_loss: 0.6306 - photo_disc_loss: 0.5960
Epoch 30/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.2610 - monet_gen_loss: 2.9589 - photo_gen_loss: 3.0837 - monet_disc_loss: 0.6276 - photo_disc_loss: 0.5908
Epoch 31/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.1940 - monet_gen_loss: 2.9234 - photo_gen_loss: 3.0399 - monet_disc_loss: 0.6340 - photo_disc_loss: 0.5967
Epoch 32/50
300/300 [==============================] - 132s 440ms/step - total_loss: 7.1518 - monet_gen_loss: 2.9051 - photo_gen_loss: 3.0160 - monet_disc_loss: 0.6356 - photo_disc_loss: 0.5952
Epoch 33/50
300/300 [==============================] - 132s 439ms/step - total_loss: 7.1283 - monet_gen_loss: 2.8942 - photo_gen_loss: 3.0005 - monet_disc_loss: 0.6368 - photo_disc_loss: 0.5969
Epoch 34/50
300/300 [==============================] - 132s 439ms/step - total_loss: 7.0966 - monet_gen_loss: 2.8813 - photo_gen_loss: 2.9824 - monet_disc_loss: 0.6379 - photo_disc_loss: 0.5950
Epoch 35/50
300/300 [==============================] - 132s 439ms/step - total_loss: 7.0460 - monet_gen_loss: 2.8441 - photo_gen_loss: 2.9676 - monet_disc_loss: 0.6409 - photo_disc_loss: 0.5934
Epoch 36/50
300/300 [==============================] - 132s 439ms/step - total_loss: 6.9889 - monet_gen_loss: 2.8174 - photo_gen_loss: 2.9290 - monet_disc_loss: 0.6417 - photo_disc_loss: 0.6008
Epoch 37/50
300/300 [==============================] - 132s 439ms/step - total_loss: 6.9392 - monet_gen_loss: 2.7863 - photo_gen_loss: 2.9042 - monet_disc_loss: 0.6431 - photo_disc_loss: 0.6055
Epoch 38/50
300/300 [==============================] - 132s 440ms/step - total_loss: 6.9482 - monet_gen_loss: 2.8005 - photo_gen_loss: 2.9125 - monet_disc_loss: 0.6349 - photo_disc_loss: 0.6004
Epoch 39/50
300/300 [==============================] - 132s 440ms/step - total_loss: 6.9012 - monet_gen_loss: 2.7681 - photo_gen_loss: 2.8925 - monet_disc_loss: 0.6363 - photo_disc_loss: 0.6043
Epoch 40/50
300/300 [==============================] - 132s 440ms/step - total_loss: 6.7845 - monet_gen_loss: 2.7000 - photo_gen_loss: 2.8238 - monet_disc_loss: 0.6453 - photo_disc_loss: 0.6154
Epoch 41/50
300/300 [==============================] - 132s 440ms/step - total_loss: 6.7913 - monet_gen_loss: 2.6840 - photo_gen_loss: 2.8536 - monet_disc_loss: 0.6443 - photo_disc_loss: 0.6094
Epoch 42/50
300/300 [==============================] - 132s 440ms/step - total_loss: 6.6722 - monet_gen_loss: 2.6384 - photo_gen_loss: 2.7718 - monet_disc_loss: 0.6458 - photo_disc_loss: 0.6163
Epoch 43/50
300/300 [==============================] - 132s 440ms/step - total_loss: 6.7128 - monet_gen_loss: 2.6721 - photo_gen_loss: 2.7917 - monet_disc_loss: 0.6393 - photo_disc_loss: 0.6097
Epoch 44/50
300/300 [==============================] - 132s 439ms/step - total_loss: 6.6979 - monet_gen_loss: 2.6490 - photo_gen_loss: 2.8041 - monet_disc_loss: 0.6406 - photo_disc_loss: 0.6041
Epoch 45/50
300/300 [==============================] - 132s 439ms/step - total_loss: 6.9362 - monet_gen_loss: 2.8560 - photo_gen_loss: 2.9042 - monet_disc_loss: 0.5885 - photo_disc_loss: 0.5875
Epoch 46/50
300/300 [==============================] - 132s 439ms/step - total_loss: 6.8214 - monet_gen_loss: 2.7380 - photo_gen_loss: 2.8306 - monet_disc_loss: 0.6469 - photo_disc_loss: 0.6059
Epoch 47/50
300/300 [==============================] - 132s 440ms/step - total_loss: 6.6745 - monet_gen_loss: 2.6516 - photo_gen_loss: 2.7894 - monet_disc_loss: 0.6332 - photo_disc_loss: 0.6002
Epoch 48/50
300/300 [==============================] - 132s 439ms/step - total_loss: 6.7603 - monet_gen_loss: 2.7036 - photo_gen_loss: 2.8374 - monet_disc_loss: 0.6253 - photo_disc_loss: 0.5939
Epoch 49/50
300/300 [==============================] - 132s 440ms/step - total_loss: 6.8899 - monet_gen_loss: 2.8052 - photo_gen_loss: 2.9074 - monet_disc_loss: 0.6024 - photo_disc_loss: 0.5748
Epoch 50/50
300/300 [==============================] - 132s 439ms/step - total_loss: 6.7139 - monet_gen_loss: 2.6874 - photo_gen_loss: 2.8110 - monet_disc_loss: 0.6241 - photo_disc_loss: 0.5913

Let’s save the model.

model.save('cyclegan_100.h5')

Let’s plot the loss for the generators and discriminators, for generators we plot the mean loss.

history.history.keys()
# dict_keys(['total_loss', 'monet_gen_loss', 'photo_gen_loss', 'monet_disc_loss', 'photo_disc_loss'])
def plot_hist(hist):
    plt.figure(figsize=(10,5))
    plt.subplot(121)
    #plt.plot(np.mean(hist.history['monet_gen_loss'][0][0], axis=1))
    #plt.plot(np.mean(hist.history['photo_gen_loss'][0][0], axis=1))    
    plt.plot(hist.history['monet_gen_loss'][0][0].flatten())
    plt.plot(hist.history['photo_gen_loss'][0][0].flatten())   
    plt.legend(["monet","photo"])
    plt.title('Generator Loss')
    plt.ylabel("Loss")
    plt.xlabel("Epoch")
    plt.grid()
    plt.subplot(122)
    plt.plot(hist.history['monet_disc_loss'][0][0].flatten())
    plt.plot(hist.history['photo_disc_loss'][0][0].flatten())
    plt.title("Discriminator Loss")
    plt.ylabel("Loss")
    plt.xlabel("Epoch")
    plt.grid()
    plt.legend(["monet","photo"])
    plt.tight_layout()
    plt.show()

plot_hist(history)

Finally, let’s use the generated to generate ~7k images and save / submit the notebook to kaggle.

! mkdir ../images

def generate(dataset):
    dataset_iter = iter(dataset)
    out_dir = '../images/'
    for i in tqdm.tqdm(range(n_photo_samples)):
        # Get the image from the dataset iterator
        img = next(dataset_iter)
        prediction = model.generate(img)
        prediction = tf.squeeze(prediction).numpy()
        prediction = (prediction * 127.5 + 127.5).astype(np.uint8)   
        plt.imsave(os.path.join(out_dir, 'image_{:04d}.jpg'.format(i)), prediction)

generate(photo_ds)

shutil.make_archive("/kaggle/working/images", 'zip', "/kaggle/images")

The next figure shows few of the images generated.

plot_images([plt.imread(f) for f in glob('out/*.jpg')[:200]], 'Generated Images')

Git Repository

https://github.com/sandipan/Monet-Style-Transfer-with-GANs-Kaggle-Mini-Project

Kaggle Notebook

https://www.kaggle.com/code/sandipanumbc/monet-style-transfer-with-cyclegan

Conclusion

As we can see from the above results, the CycleGAN does a pretty good job in generating the monet-style images from photos. The model was trained for 50 epochs, it seems that losses will decrease further if increased for more epochs (e.g., 100), we can get a better model.

Kaggle uses an evaluation metric called MiFID (Memorization-informed Fréchet Inception Distance) score to evaluate the quality of generated images. The score obtained on Kaggle is ~54.18 and the leaderboard position is 49, as shown in the following screenshots.

Disaster Tweets Classification with RNN – GRU/ BiLSTM/ BERT/ USE

This problem appeared in a project in the coursera course Deep Learning (by the University of Colorado Boulder) and also as a past Kaggle competition.

Brief description of the problem and data

In this project, we shall build a machine learning model that predicts which Tweets are about real disasters and which one’s aren’t. We shall have access to a dataset of 10,000 tweets that were hand-classified apriori, hence contain groud-truth labels. We shall use a binary text classification model which will be trained on these tweets and then later will be used to predict the class labels for an unseen test data.

Given a train and a test csv file, where each sample in the train and test set has the following information:

  • The text of a tweet
  • A keyword from that tweet (although this may be blank!)
  • The location the tweet was sent from (may also be blank)

We shall predict whether a given tweet is about a real disaster or not. If so, predict a 1. If not, predict a 0.

Twitter has become an important communication channel in times of emergency. The ubiquitousness of smartphones enables people to announce an emergency they’re observing in real-time. Because of this, more agencies are interested in programatically monitoring Twitter (i.e. disaster relief organizations and news agencies). But, it’s not always clear whether a person’s words are actually announcing a disaster. That’s where the classifier will be useful.

Exploratory Data Analysis (EDA) — Inspect, Visualize and Clean the Data

First we need to import all python packages / functions (need to install with pip if some of them are not already installed) that are required to the clean the texts (from the tweets), for building the RNN models and for visualization. We shall use tensorflow / keras to to train the deep learning models.

import numpy as np 
import pandas as pd 
import os, math

#for visualization
import matplotlib.pyplot as plt
import seaborn as sns
from wordcloud import WordCloud, STOPWORDS

#for text cleaning
import string, re
import nltk
nltk.download('punkt')
nltk.download('stopwords')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords

#for data analysis and modeling
import tensorflow as tf
import tensorflow_hub as hub
# !pip install tensorflow_text
import tensorflow_text 
from tensorflow.keras.preprocessing import text, sequence 
from tensorflow.keras.layers import Dropout
from tensorflow.keras.metrics import Recall
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import LSTM, GRU, Bidirectional, Dense, Embedding, Dropout
from tensorflow.keras.layers import TextVectorization
tf.__version__
# 2.12.0

from sklearn.model_selection import train_test_split
from sklearn.utils.class_weight import compute_class_weight

Read the train and test dataframe, the only columns that we shall use are text (to extract input features) and target (output to predict).

df_train = pd.read_csv('nlp-getting-started/train.csv', index_col='id')
df_test = pd.read_csv('nlp-getting-started/test.csv', index_col='id')
df_train.head()
keywordlocationtexttarget
id
1NaNNaNOur Deeds are the Reason of this #earthquake M…1
4NaNNaNForest fire near La Ronge Sask. Canada1
5NaNNaNAll residents asked to ‘shelter in place’ are …1
6NaNNaN13,000 people receive #wildfires evacuation or…1
7NaNNaNJust got sent this photo from Ruby #Alaska as …1

There around 7.6k tweets in the training and 3.2k tweets in the test dataset, respectively.

df_train.shape, df_test.shape
# ((7613, 4), (3263, 3))

Maximum number of words present in a tweet is 31, for both training and test dataset

max_len_train = max(df_train['text'].apply(lambda x: len(x.split())).values)
max_len_test = max(df_train['text'].apply(lambda x: len(x.split())).values)
max_len_train, max_len_test
# (31, 31)

The following plot shows histogram of class labels, the number of positive (disaster) and negative (no distaster) classes in the training dataset. As can be seen, the dataset is slightly imbalanced.

#train_df['target'] = train_df['target'].astype(str)
sns.displot(data=train_df, x='target', hue='target')
train_df['target'].value_counts()

0    4342
1    3271
Name: target, dtype: int64

Now, let’s use the wordcloud library to find the most frequent words in disaster tweets and normal tweets. As we can see,

  • the top 10 most frequent words in disaster tweets (with class label 1) are: ‘fire’, ‘New’, ‘via’, ‘disaster’, ‘California’, ‘suicide’, ‘U’, ‘police’, ‘amp’, ‘people’
  • the top 10 most frequent words in the normal tweets (with class label 0) are: ‘new’, ‘amp’, ‘u’, ‘one’, ‘body’, ‘time’, ‘video’, ‘via’, ‘day’, ‘love’
def plot_wordcloud(text, title, k=10):
  # Create and Generate a Word Cloud Image
  wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, background_color='black', colormap='Set2', collocations=False, stopwords = STOPWORDS).generate(text)
  # top k words
  plt.figure(figsize=(10,5))
  print(f'top {k} words: {list(wordcloud.words_.keys())[:k]}')
  ax = sns.barplot(x=0, y=1, data=pd.DataFrame(wordcloud.words_.items()).head(k))
  ax.set(xlabel = 'words', ylabel='count', title=title)
  plt.show()
  #Display the generated image
  plt.figure(figsize=(15,15))
  plt.imshow(wordcloud, interpolation="bilinear"), plt.title(title, size=20), plt.axis("off")
  plt.show()

plot_wordcloud(' '.join(df_train[df_train['target']==1]['text'].values), 'train words (label 1)')
plot_wordcloud(' '.join(df_train[df_train['target']==0]['text'].values), 'train words (label 0)')

Now, let’s use the wordcloud library to find the most frequent words in disaster tweets and normal tweets. As we can see,

  • the top 10 most frequent words in disaster tweets (with class label 1) are: ‘fire’, ‘New’, ‘via’, ‘disaster’, ‘California’, ‘suicide’, ‘U’, ‘police’, ‘amp’, ‘people’
  • the top 10 most frequent words in the normal tweets (with class label 0) are: ‘new’, ‘amp’, ‘u’, ‘one’, ‘body’, ‘time’, ‘video’, ‘via’, ‘day’, ‘love’
def plot_wordcloud(text, title, k=10):
  # Create and Generate a Word Cloud Image
  wordcloud = WordCloud(width = 3000, height = 2000, random_state=1, background_color='black', colormap='Set2', collocations=False, stopwords = STOPWORDS).generate(text)
  # top k words
  plt.figure(figsize=(10,5))
  print(f'top {k} words: {list(wordcloud.words_.keys())[:k]}')
  ax = sns.barplot(x=0, y=1, data=pd.DataFrame(wordcloud.words_.items()).head(k))
  ax.set(xlabel = 'words', ylabel='count', title=title)
  plt.show()
  #Display the generated image
  plt.figure(figsize=(15,15))
  plt.imshow(wordcloud, interpolation="bilinear"), plt.title(title, size=20), plt.axis("off")
  plt.show()

plot_wordcloud(' '.join(df_train[df_train['target']==1]['text'].values), 'train words (label 1)')
plot_wordcloud(' '.join(df_train[df_train['target']==0]['text'].values), 'train words (label 0)')
top 10 words: ['fire', 'New', 'via', 'disaster', 'California', 'suicide', 'U', 'police', 'amp', 'people']

Preprocessing / Cleaning

Since the tweet texts are likely to contain many junk characters, very common non-informative words (stopwords, e.g., ‘the’), it is a good idea to clean the text (with the function clean_text() as shown below) and remove unnecessary stuffs before building the models, otherwise they can affect the performance. It’s important that we apply the same preprocessing on both the training and test tweets.

def clean_text(txt):
    """""
    cleans the input text by following the steps:
    * replace contractions
    * remove punctuation
    * split into words
    * remove stopwords
    * remove leftover punctuations
    """""
    contraction_dict = {"ain't": "is not", "aren't": "are not","can't": "cannot", "'cause": "because", 
                        "could've": "could have", "couldn't": "could not", "didn't": "did not",  
                        "doesn't": "does not", "don't": "do not", "hadn't": "had not", "hasn't": "has not", 
                        "haven't": "have not", "he'd": "he would","he'll": "he will", "he's": "he is", 
                        "you'd've": "you would have", "you'll": "you will", "you'll've": "you will have"}
    def _get_contractions(contraction_dict):
        contraction_re = re.compile('(%s)' % '|'.join(contraction_dict.keys()))
        return contraction_dict, contraction_re

    def replace_contractions(text):
        contractions, contractions_re = _get_contractions(contraction_dict)
        def replace(match):
            return contractions[match.group(0)]
        return contractions_re.sub(replace, text)

    # replace contractions
    txt = replace_contractions(txt)
    
    #remove punctuations
    txt  = "".join([char for char in txt if char not in string.punctuation])
    #remove numbers
    txt = re.sub('[0-9]+', '', txt)
    #txt = txt.str.replace(r"[^A-Za-z0-9()!?\'\`\"]", " ", regex = True )
    txt = txt.str.lower() # lowercase
    txt = txt.str.replace(r"\#","", regex = True ) # replaces hashtags
    txt = txt.str.replace(r"http\S+","URL", regex = True )  # remove URL addresses
    txt = txt.str.replace(r"@","", regex = True )
    txt = txt.str.replace("\s{2,}", " ", regex = True ) # remove multiple contiguous spaces
    
    # split into words
    words = word_tokenize(txt)
    
    # remove stopwords
    stop_words = set(stopwords.words('english'))
    words = [w for w in words if not w in stop_words]
    
    # removing leftover punctuations
    words = [word for word in words if word.isalpha()]
    
    cleaned_text = ' '.join(words)
    return cleaned_text

# clean train and test tweets
df_train['text'] = df_train['text'].apply(lambda txt: clean_text(txt))
df_test['text'] = df_test['text'].apply(lambda txt: clean_text(txt))

df_train.head()

# CPU times: user 2.05 s, sys: 101 ms, total: 2.15 s
# Wall time: 2.16 s
keywordlocationtexttarget
id
1NaNNaNOur Deeds Reason earthquake May ALLAH Forgive us1
4NaNNaNForest fire near La Ronge Sask Canada1
5NaNNaNAll residents asked shelter place notified off…1
6NaNNaNpeople receive wildfires evacuation orders Cal…1
7NaNNaNJust got sent photo Ruby Alaska smoke wildfire…1

Model Architecture

We shall use multiple models, starting from LSTM/GRU/BiLSTM to BERT and USE.

LSTM / GRU

Let’s start with vanilla LSTM / GRU model. We need to start by tokenizing the texts followed adding appropriate pads to the token sequence (to have the seuqence length fixed, e.g. equal to max_len)

xtrain, xtest, ytrain, ytest = train_test_split(df_train['text'].values, df_train['target'].values, shuffle=True, test_size=0.2)

max_len = max(df_train['text'].apply(lambda x: len(x.split())).values)
max_words = 20000
tokenizer = text.Tokenizer(num_words = max_words)
# create the vocabulary by fitting on x_train text
tokenizer.fit_on_texts(xtrain)
# generate the sequence of tokens
xtrain_seq = tokenizer.texts_to_sequences(xtrain)
xtest_seq = tokenizer.texts_to_sequences(xtest)

# pad the sequences
xtrain_pad = sequence.pad_sequences(xtrain_seq, maxlen=max_len)
xtest_pad = sequence.pad_sequences(xtest_seq, maxlen=max_len)
word_index = tokenizer.word_index

print('text example:', xtrain[0])
print('sequence of indices(before padding):', xtrain_seq[0])
print('sequence of indices(after padding):', xtrain_pad[0])

# text example: Witness video shows car explode behind burning buildings nd St afternoon Manchester httptcocgmJlSEYLo via 
# MikeCroninWMUR
# sequence of indices(before padding): [17, 29, 37, 9]
# sequence of indices(after padding): [ 0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  0  17 29 37  9]

We shall first use a pretrained (semantic) embedding from Global Vectors for Word Representation (GloVe) model (dowload the pretrained weights) and create a word-level embedding matrix as shown below. Later we shall use LSTM to train the embedding on our own.

#https://nlp.stanford.edu/projects/glove/
!wget https://nlp.stanford.edu/data/glove.6B.zip
!unzip g*zip
%%time
embedding_vectors = {}
with open('glove.6B.300d.txt','r',encoding='utf-8') as file: #glove.42B.300d.txt
    for row in file:
        values = row.split(' ')
        word = values[0]
        weights = np.asarray([float(val) for val in values[1:]])
        embedding_vectors[word] = weights
print(f"Size of vocabulary in GloVe: {len(embedding_vectors)}")  

# Size of vocabulary in GloVe: 400000
# CPU times: user 33.1 s, sys: 1.55 s, total: 34.7 s
# Wall time: 33.4 s
#initialize the embedding_matrix with zeros
emb_dim = 300
vocab_len = max_words if max_words is not None else len(word_index)+1
embedding_matrix = np.zeros((vocab_len, emb_dim))
oov_count = 0
oov_words = []
for word, idx in word_index.items():
    if idx < vocab_len:
        embedding_vector = embedding_vectors.get(word)
        if embedding_vector is not None:
            embedding_matrix[idx] = embedding_vector
        else:
            oov_count += 1 
            oov_words.append(word)
#print some of the out of vocabulary words
print(f'Some out of valubulary words: {oov_words[0:5]}')
print(f'{oov_count} out of {vocab_len} words were OOV.')

# Some out of valubulary words: []
# 0 out of 50 words were OOV.

Let’s create the model with and Embedding layer followed by the LSTM layer and add a bunch of Dense layers on top. We shall first use pretrained GloVe embeddings and then later build another model to train the embeddings from the data provided.

model_lstm = Sequential(name='model_lstm')
model_lstm.add(Embedding(vocab_len, emb_dim, trainable = False, weights=[embedding_matrix]))
#model_lstm.add(Embedding(vocab_len, emb_dim, trainable = True))
model_lstm.add(LSTM(64, activation='tanh', return_sequences=False))
model_lstm.add(Dense(128, activation='relu'))
#model_lstm.add(tf.keras.layers.BatchNormalization())
model_lstm.add(Dropout(0.2)) # Adding Dropout layer with rate of 0.2
model_lstm.add(Dense(256, activation='relu'))
model_lstm.add(Dense(128, activation='relu'))
model_lstm.add(Dense(64, activation='relu'))
model_lstm.add(Dense(1, activation='sigmoid'))
model_lstm.compile(loss='binary_crossentropy', optimizer='adam', metrics=[tf.keras.metrics.Recall(), tf.keras.metrics.AUC()])
model_lstm.summary()

Model: "model_lstm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_3 (Embedding)     (None, None, 300)         6000000   
                                                                 
 lstm_2 (LSTM)               (None, 64)                93440     
                                                                 
 dense_7 (Dense)             (None, 128)               8320      
                                                                 
 dropout_3 (Dropout)         (None, 128)               0         
                                                                 
 dense_8 (Dense)             (None, 256)               33024     
                                                                 
 dense_9 (Dense)             (None, 128)               32896     
                                                                 
 dense_10 (Dense)            (None, 64)                8256      
                                                                 
 dense_11 (Dense)            (None, 1)                 65        
                                                                 
=================================================================
Total params: 6,176,001
Trainable params: 6,176,001
Non-trainable params: 0
_________________________________________________________________
None

Now, let’s create the model using GRU layer instead of LSTM, as shown in the following code snippet.

emb_dim = embedding_matrix.shape[1]
model_gru = Sequential(name='model_gru')
model_gru.add(Embedding(vocab_len, emb_dim, trainable = False, weights=[embedding_matrix]))
model_gru.add(GRU(128, return_sequences=False))
model_gru.add(Dropout(0.5))
model_gru.add(Dense(1, activation = 'sigmoid'))
model_gru.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
model_gru.summary()

Model: "model_gru"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_4 (Embedding)     (None, None, 300)         6000000   
                                                                 
 gru_1 (GRU)                 (None, 128)               165120    
                                                                 
 dropout_4 (Dropout)         (None, 128)               0         
                                                                 
 dense_12 (Dense)            (None, 1)                 129       
                                                                 
=================================================================
Total params: 6,165,249
Trainable params: 165,249
Non-trainable params: 6,000,000
_________________________________________________________________
None

BiLSTM

Now, let’s create a Bidirection LSTM model instead, this time using TextVectorization: a preprocessing layer which maps text features to integer sequences. Let’s create training and validation datasets for model evaluation, by applying the vectorizer on the text tweets.

# Define Embedding layer as pre-processing layer for tokenization
max_features = 20000 #  20000 most frequent words in the input text data.

vectorizer = TextVectorization(max_tokens=max_features, output_sequence_length=200, output_mode='int') 
vectorizer.adapt(np.hstack((X_train, X_test))) 
vectorizerd_text = vectorizer(X_train)

dataset = tf.data.Dataset.from_tensor_slices((vectorizerd_text, y_train))
dataset = dataset.cache()
dataset = dataset.shuffle(160000)
dataset = dataset.batch(32) 
dataset = dataset.prefetch(8)
batch_X, batch_y = dataset.as_numpy_iterator().next()

train = dataset.take(int(len(dataset)*.8))
val = dataset.skip(int(len(dataset)*.8)).take(int(len(dataset)*.2))

model_bilstm = Sequential(name='model_bilstm')
model_bilstm.add(Embedding(max_features + 1, 64))
model_bilstm.add(Bidirectional(LSTM(64, activation='tanh')))
model_bilstm.add(Dense(128, activation='relu'))
model_bilstm.add(Dropout(0.2)) # Adding Dropout layer with dropout rate of 0.2
model_bilstm.add(Dense(256, activation='relu'))
model_bilstm.add(Dense(128, activation='relu'))
model_bilstm.add(Dense(64, activation='relu'))
model_bilstm.add(Dense(1, activation='sigmoid'))
model_bilstm.compile(loss='BinaryCrossentropy', optimizer='Adam', metrics=[Recall()])
model_bilstm.summary()

Model: "model_bilstm"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding_1 (Embedding)     (None, None, 64)          1280064   
                                                                 
 bidirectional_1 (Bidirectio  (None, 128)              66048     
 nal)                                                            
                                                                 
 dense_5 (Dense)             (None, 128)               16512     
                                                                 
 dropout_1 (Dropout)         (None, 128)               0         
                                                                 
 dense_6 (Dense)             (None, 256)               33024     
                                                                 
 dense_7 (Dense)             (None, 128)               32896     
                                                                 
 dense_8 (Dense)             (None, 64)                8256      
                                                                 
 dense_9 (Dense)             (None, 1)                 65        
                                                                 
=================================================================
Total params: 1,436,865
Trainable params: 1,436,865
Non-trainable params: 0
_________________________________________________________________

BERT

Next, let’s use the Bidirectional Encoder Representations from Transformers (BERT) model for the text classification. The function get_BERT_model() uses the BERT model as backbone, extracts the pooled_output layer and adds a couple of Dense layers (with Dropout regularizer) on top of it, as shown in thee next code snippet.

def get_BERT_model():
    # Preprocessing
    tfhub_handle_preprocess = 'https://tfhub.dev/tensorflow/bert_en_uncased_preprocess/3'
    # Bert encoder
    tfhub_handle_encoder = 'https://tfhub.dev/tensorflow/small_bert/bert_en_uncased_L-2_H-128_A-2/2'
    bert_preprocess_model = hub.KerasLayer(tfhub_handle_preprocess)
    bert_model = hub.KerasLayer(tfhub_handle_encoder)
    input_layer = tf.keras.layers.Input(shape=(), dtype=tf.string, name='tweets')
    x = bert_preprocess_model(input_layer)
    x = bert_model(x)['pooled_output']
    x = tf.keras.layers.Dropout(0.5)(x) #Optional, to eliminate overfitting
    x = tf.keras.layers.Dense(256, activation='relu')(x)
    classification_out = tf.keras.layers.Dense(1, activation='sigmoid', name='classifier')(x)
    bert_preprocess_model._name = "preprocess"
    bert_model._name = "bert_encoder"
    model_bert = tf.keras.Model(input_layer, classification_out)
    model_bert._name = "model_bert"
    return model_bert

model_bert = get_BERT_model()
model_bert.summary()

Model: "model_bert"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 tweets (InputLayer)            [(None,)]            0           []                               
                                                                                                  
 preprocess (KerasLayer)        {'input_type_ids':   0           ['tweets[0][0]']                 
                                (None, 128),                                                      
                                 'input_mask': (Non                                               
                                e, 128),                                                          
                                 'input_word_ids':                                                
                                (None, 128)}                                                      
                                                                                                  
 bert_encoder (KerasLayer)      {'pooled_output': (  4385921     ['preprocess[0][0]',             
                                None, 128),                       'preprocess[0][1]',             
                                 'sequence_output':               'preprocess[0][2]']             
                                 (None, 128, 128),                                                
                                 'encoder_outputs':                                               
                                 [(None, 128, 128),                                               
                                 (None, 128, 128)],                                               
                                 'default': (None,                                                
                                128)}                                                             
                                                                                                  
 dropout_1 (Dropout)            (None, 128)          0           ['bert_encoder[0][3]']           
                                                                                                  
 dense_1 (Dense)                (None, 256)          33024       ['dropout_1[0][0]']              
                                                                                                  
 classifier (Dense)             (None, 1)            257         ['dense_1[0][0]']                
                                                                                                  
==================================================================================================
Total params: 4,419,202
Trainable params: 33,281
Non-trainable params: 4,385,921
__________________________________________________________________________________________________

Universal Sequence Encoder Model (USE)

Finally, we shall use the Universal Sentence Encoder to obtain sentence level embedding, along with our regular Dense layers to create a binary text classification model.

transfer_model_url = 'https://tfhub.dev/google/universal-sentence-encoder-cmlm/en-base/1'
sentence_encoder_layer = hub.KerasLayer("https://tfhub.dev/google/universal-sentence-encoder/4",
                                        input_shape=[],  # shape of inputs coming to our model
                                        dtype=tf.string,  # data type of inputs coming to the USE layer
                                        trainable=False,
                                        # keep the pretrained weights (we'll create a feature extractor)
                                        name="USE")

model_use = tf.keras.Sequential([
    sentence_encoder_layer,
    tf.keras.layers.Dropout(0.1),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(16, activation="relu"),
    tf.keras.layers.Dense(1, activation="sigmoid")
], name = 'transfer_mode')
model_use.summary()

Model: "transfer_mode"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 USE (KerasLayer)            (None, 512)               256797824 
                                                                 
 dropout_9 (Dropout)         (None, 512)               0         
                                                                 
 dense_13 (Dense)            (None, 16)                8208      
                                                                 
 dense_14 (Dense)            (None, 16)                272       
                                                                 
 dense_15 (Dense)            (None, 1)                 17        
                                                                 
=================================================================
Total params: 256,806,321
Trainable params: 8,497
Non-trainable params: 256,797,824
_________________________________________________________________

Results and Analysis

Let’s now fit the models on the training dataset and compare the performance of the model (in terms of accuracy, recall and ROC AUC) on the held-out validation daatset. The metric Recall is more important than precision / accuracy here becuase we shall like our model to capture as many of the true disaster tweets as possibile.

LSTM / GRU

The LSTM model was trained for 50 epochs (10 epochs are shown below) and the accuracy did not seem to improve over time (obtained ~66% accuracy on validation).

Hyperparameter Tuning

  • Number of LSTM units and batch size were varied to see the impact on performance, but the model did almost the same.
  • First the model was trained with pe-trained GloVe Embedding layers and then later the Embedding layer was trained from the data, but the accuracies did not improve much.
# model_lstm.add(Embedding(vocab_len, emb_dim, trainable = False, weights=[embedding_matrix]))
# with pretrained GloVe weights
%%time
batch_size = 32
epochs  = 50
history = model_lstm.fit(xtrain_pad, np.asarray(ytrain), validation_data=(xtest_pad, np.asarray(ytest)), batch_size = batch_size, epochs = epochs)

Epoch 1/10
24/24 [==============================] - 9s 31ms/step - loss: 0.6367 - accuracy: 0.6448 - val_loss: 0.6179 - val_accuracy: 0.6586
Epoch 2/10
24/24 [==============================] - 0s 9ms/step - loss: 0.6084 - accuracy: 0.6727 - val_loss: 0.6110 - val_accuracy: 0.6579
Epoch 3/10
24/24 [==============================] - 0s 8ms/step - loss: 0.5995 - accuracy: 0.6757 - val_loss: 0.6132 - val_accuracy: 0.6586
Epoch 4/10
24/24 [==============================] - 0s 8ms/step - loss: 0.5980 - accuracy: 0.6749 - val_loss: 0.6093 - val_accuracy: 0.6573
Epoch 5/10
24/24 [==============================] - 0s 10ms/step - loss: 0.5944 - accuracy: 0.6780 - val_loss: 0.6093 - val_accuracy: 0.6573
Epoch 6/10
24/24 [==============================] - 0s 10ms/step - loss: 0.5907 - accuracy: 0.6777 - val_loss: 0.6089 - val_accuracy: 0.6586
Epoch 7/10
24/24 [==============================] - 0s 12ms/step - loss: 0.5899 - accuracy: 0.6793 - val_loss: 0.6106 - val_accuracy: 0.6559
Epoch 8/10
24/24 [==============================] - 0s 10ms/step - loss: 0.5907 - accuracy: 0.6778 - val_loss: 0.6111 - val_accuracy: 0.6632
Epoch 9/10
24/24 [==============================] - 0s 11ms/step - loss: 0.5876 - accuracy: 0.6841 - val_loss: 0.6121 - val_accuracy: 0.6619
Epoch 10/10
24/24 [==============================] - 0s 10ms/step - loss: 0.5870 - accuracy: 0.6851 - val_loss: 0.6101 - val_accuracy: 0.6619
CPU times: user 5.78 s, sys: 719 ms, total: 6.5 s
Wall time: 12.4 s
# model_lstm.add(Embedding(vocab_len, emb_dim, trainable = True))
# learning the embedding layer weights
%%time
batch_size = 32
epochs  = 50
history = model_lstm.fit(xtrain_pad, np.asarray(ytrain), validation_data=(xtest_pad, np.asarray(ytest)), batch_size = batch_size, epochs = epochs)

Epoch 1/50
191/191 [==============================] - 8s 22ms/step - loss: 0.6320 - recall: 0.2882 - auc_11: 0.6650 - val_loss: 0.6123 - val_recall: 0.3474 - val_auc_11: 0.6984
Epoch 2/50
191/191 [==============================] - 2s 9ms/step - loss: 0.6052 - recall: 0.3603 - auc_11: 0.7016 - val_loss: 0.6170 - val_recall: 0.3444 - val_auc_11: 0.6962
Epoch 3/50
191/191 [==============================] - 2s 8ms/step - loss: 0.6030 - recall: 0.3665 - auc_11: 0.7073 - val_loss: 0.6135 - val_recall: 0.3068 - val_auc_11: 0.6978
Epoch 4/50
191/191 [==============================] - 2s 9ms/step - loss: 0.6002 - recall: 0.3496 - auc_11: 0.7048 - val_loss: 0.6307 - val_recall: 0.3053 - val_auc_11: 0.6973
Epoch 5/50
191/191 [==============================] - 2s 12ms/step - loss: 0.6022 - recall: 0.3546 - auc_11: 0.7090 - val_loss: 0.6123 - val_recall: 0.3323 - val_auc_11: 0.6946
Epoch 6/50
191/191 [==============================] - 2s 9ms/step - loss: 0.5945 - recall: 0.3538 - auc_11: 0.7112 - val_loss: 0.6161 - val_recall: 0.3083 - val_auc_11: 0.6945
Epoch 7/50
191/191 [==============================] - 2s 9ms/step - loss: 0.5941 - recall: 0.3431 - auc_11: 0.7093 - val_loss: 0.6156 - val_recall: 0.3098 - val_auc_11: 0.6967
Epoch 8/50
191/191 [==============================] - 2s 8ms/step - loss: 0.5909 - recall: 0.3538 - auc_11: 0.7182 - val_loss: 0.6181 - val_recall: 0.3053 - val_auc_11: 0.6907
Epoch 9/50
191/191 [==============================] - 2s 9ms/step - loss: 0.5889 - recall: 0.3488 - auc_11: 0.7188 - val_loss: 0.6218 - val_recall: 0.2707 - val_auc_11: 0.6935
Epoch 10/50
191/191 [==============================] - 2s 8ms/step - loss: 0.5882 - recall: 0.3480 - auc_11: 0.7221 - val_loss: 0.6279 - val_recall: 0.3519 - val_auc_11: 0.6812
Epoch 11/50
191/191 [==============================] - 2s 9ms/step - loss: 0.5859 - recall: 0.3780 - auc_11: 0.7240 - val_loss: 0.6179 - val_recall: 0.3459 - val_auc_11: 0.7008
Epoch 12/50
191/191 [==============================] - 2s 12ms/step - loss: 0.5832 - recall: 0.3768 - auc_11: 0.7267 - val_loss: 0.6176 - val_recall: 0.2917 - val_auc_11: 0.6871

The GRU model was trained for 50 epochs (12 epochs are shown below) and the accuracy did not seem to improve over time (obtained ~67% accuracy on validation).

batch_size = 32
epochs  = 50
history = model_gru.fit(xtrain_pad, np.asarray(ytrain), validation_data=(xtest_pad, np.asarray(ytest)), batch_size = batch_size, epochs = epochs)

Epoch 1/10
24/24 [==============================] - 4s 27ms/step - loss: 0.6316 - accuracy: 0.6466 - val_loss: 0.6128 - val_accuracy: 0.6586
Epoch 2/10
24/24 [==============================] - 0s 11ms/step - loss: 0.6050 - accuracy: 0.6708 - val_loss: 0.6150 - val_accuracy: 0.6592
Epoch 3/10
24/24 [==============================] - 0s 10ms/step - loss: 0.5999 - accuracy: 0.6744 - val_loss: 0.6110 - val_accuracy: 0.6586
Epoch 4/10
24/24 [==============================] - 0s 8ms/step - loss: 0.5977 - accuracy: 0.6750 - val_loss: 0.6109 - val_accuracy: 0.6559
Epoch 5/10
24/24 [==============================] - 0s 8ms/step - loss: 0.5968 - accuracy: 0.6745 - val_loss: 0.6103 - val_accuracy: 0.6691
Epoch 6/10
24/24 [==============================] - 0s 8ms/step - loss: 0.5925 - accuracy: 0.6785 - val_loss: 0.6086 - val_accuracy: 0.6592
Epoch 7/10
24/24 [==============================] - 0s 7ms/step - loss: 0.5918 - accuracy: 0.6826 - val_loss: 0.6125 - val_accuracy: 0.6592
Epoch 8/10
24/24 [==============================] - 0s 7ms/step - loss: 0.5907 - accuracy: 0.6801 - val_loss: 0.6103 - val_accuracy: 0.6586
Epoch 9/10
24/24 [==============================] - 0s 10ms/step - loss: 0.5884 - accuracy: 0.6824 - val_loss: 0.6111 - val_accuracy: 0.6566
Epoch 10/10
24/24 [==============================] - 0s 10ms/step - loss: 0.5838 - accuracy: 0.6880 - val_loss: 0.6120 - val_accuracy: 0.6625

BiLSTM

This model was trained with TextVectorization as preprocessing layer. This time recall was used as evaluation metric. This model performed quite well an achived over 98% validation recall, as shown in the next figure too. This model is the second best performing model (in terms of bulic score) on the unseen test dataset.

hist= model_bilstm.fit(train, epochs=30, batch_size=32, validation_data=val)

Epoch 1/30
166/166 [==============================] - 29s 101ms/step - loss: 0.5638 - recall_1: 0.4101 - val_loss: 0.3570 - val_recall_1: 0.7625
Epoch 2/30
166/166 [==============================] - 6s 34ms/step - loss: 0.3524 - recall_1: 0.7582 - val_loss: 0.2676 - val_recall_1: 0.8138
Epoch 3/30
166/166 [==============================] - 5s 30ms/step - loss: 0.2598 - recall_1: 0.8562 - val_loss: 0.1658 - val_recall_1: 0.9122
Epoch 4/30
166/166 [==============================] - 4s 26ms/step - loss: 0.1861 - recall_1: 0.9017 - val_loss: 0.1183 - val_recall_1: 0.9609
Epoch 5/30
166/166 [==============================] - 4s 23ms/step - loss: 0.1278 - recall_1: 0.9400 - val_loss: 0.0879 - val_recall_1: 0.9745
Epoch 6/30
166/166 [==============================] - 3s 19ms/step - loss: 0.0929 - recall_1: 0.9624 - val_loss: 0.0485 - val_recall_1: 0.9685
Epoch 7/30
166/166 [==============================] - 4s 27ms/step - loss: 0.0659 - recall_1: 0.9642 - val_loss: 0.0504 - val_recall_1: 0.9788
Epoch 8/30
166/166 [==============================] - 4s 21ms/step - loss: 0.0637 - recall_1: 0.9782 - val_loss: 0.0270 - val_recall_1: 0.9825
Epoch 9/30
166/166 [==============================] - 6s 36ms/step - loss: 0.0412 - recall_1: 0.9783 - val_loss: 0.0281 - val_recall_1: 0.9876
Epoch 10/30
166/166 [==============================] - 5s 29ms/step - loss: 0.0373 - recall_1: 0.9792 - val_loss: 0.0285 - val_recall_1: 0.9729
Epoch 11/30
166/166 [==============================] - 3s 19ms/step - loss: 0.0321 - recall_1: 0.9840 - val_loss: 0.0322 - val_recall_1: 0.9985
Epoch 12/30
166/166 [==============================] - 4s 22ms/step - loss: 0.0345 - recall_1: 0.9813 - val_loss: 0.0258 - val_recall_1: 0.9865
Epoch 13/30
166/166 [==============================] - 3s 20ms/step - loss: 0.0346 - recall_1: 0.9792 - val_loss: 0.0230 - val_recall_1: 0.9817
Epoch 14/30
166/166 [==============================] - 3s 19ms/step - loss: 0.0343 - recall_1: 0.9835 - val_loss: 0.0236 - val_recall_1: 0.9827
Epoch 15/30
166/166 [==============================] - 4s 24ms/step - loss: 0.0270 - recall_1: 0.9804 - val_loss: 0.0182 - val_recall_1: 0.9893
Epoch 16/30
166/166 [==============================] - 4s 27ms/step - loss: 0.0217 - recall_1: 0.9885 - val_loss: 0.0206 - val_recall_1: 0.9952
Epoch 17/30
166/166 [==============================] - 3s 19ms/step - loss: 0.0228 - recall_1: 0.9788 - val_loss: 0.0125 - val_recall_1: 0.9877
Epoch 18/30
166/166 [==============================] - 4s 22ms/step - loss: 0.0228 - recall_1: 0.9802 - val_loss: 0.0326 - val_recall_1: 0.9806
Epoch 19/30
166/166 [==============================] - 3s 17ms/step - loss: 0.0270 - recall_1: 0.9793 - val_loss: 0.0310 - val_recall_1: 0.9760
Epoch 20/30
166/166 [==============================] - 3s 18ms/step - loss: 0.0265 - recall_1: 0.9832 - val_loss: 0.0243 - val_recall_1: 0.9749
Epoch 21/30
166/166 [==============================] - 4s 21ms/step - loss: 0.0228 - recall_1: 0.9818 - val_loss: 0.0262 - val_recall_1: 0.9686
Epoch 22/30
166/166 [==============================] - 4s 24ms/step - loss: 0.0298 - recall_1: 0.9803 - val_loss: 0.0123 - val_recall_1: 0.9923
Epoch 23/30
166/166 [==============================] - 5s 33ms/step - loss: 0.0179 - recall_1: 0.9845 - val_loss: 0.0185 - val_recall_1: 0.9788
Epoch 24/30
166/166 [==============================] - 5s 30ms/step - loss: 0.0181 - recall_1: 0.9830 - val_loss: 0.0138 - val_recall_1: 0.9875
Epoch 25/30
166/166 [==============================] - 6s 36ms/step - loss: 0.0186 - recall_1: 0.9844 - val_loss: 0.0158 - val_recall_1: 0.9859
Epoch 26/30
166/166 [==============================] - 4s 22ms/step - loss: 0.0168 - recall_1: 0.9833 - val_loss: 0.0180 - val_recall_1: 0.9866
Epoch 27/30
166/166 [==============================] - 3s 18ms/step - loss: 0.0261 - recall_1: 0.9830 - val_loss: 0.0251 - val_recall_1: 0.9794
Epoch 28/30
166/166 [==============================] - 3s 19ms/step - loss: 0.0125 - recall_1: 0.9872 - val_loss: 0.0165 - val_recall_1: 0.9761
Epoch 29/30
166/166 [==============================] - 3s 20ms/step - loss: 0.0130 - recall_1: 0.9859 - val_loss: 0.0098 - val_recall_1: 0.9844
Epoch 30/30
166/166 [==============================] - 3s 18ms/step - loss: 0.0164 - recall_1: 0.9849 - val_loss: 0.0130 - val_recall_1: 0.9865
plt.figure(figsize=(10, 6))
pd.DataFrame(hist.history).plot()
plt.show()

Predictions

Before computing the prediction, we need to preprocess the test tweets by applying TextVectorization.

vectorizerd_test_text = vectorizer(X_test)
preds = []
for input_text in vectorizerd_test_text:    
    pred = model.predict(np.expand_dims(input_text, 0))    
    preds.append(pred)

preds = np.round(np.array(preds))
sub_sample = pd.read_csv('sample_submission.csv')
sub_sample['target'] = preds.flatten()
sub_sample['target'] = sub_sample['target'].astype('int')
sub_sample.to_csv('submission.csv', index=False)

BERT

Since the training data is a little imbalanced, we shall compute the class weights and use them in the loss function to compensate the imbalance.

class_weights = compute_class_weight(class_weight = "balanced", 
                                     classes = np.unique(df_train["target"]),
                                     y= df_train["target"])
class_weights = {k:class_weights[k] for k in np.unique(df_train["target"])}
class_weights
# {0: 0.8766697374481806, 1: 1.1637114032405993}

The model was trained for 20 epochs with Adam optimizer and weighted BCE loss function. We can change the optimizer and use AdamW or SGD instead and observe the result on hyperparameter tuning. This model happened to be a competitor of the BiLSTM model above, in terms of performance score obtained on the unseen test data.

epochs = 20
batch_size = 32 

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001) 

loss = tf.keras.losses.BinaryCrossentropy(from_logits=False) #logits = data come forom direct output without sigmoid.
metrics = [tf.keras.metrics.BinaryAccuracy(), tf.keras.metrics.AUC()]

model_bert.get_layer('bert_encoder').trainable = True # need to train

model_bert.compile(optimizer=optimizer, loss=loss, metrics=metrics)

train_data = df_train.sample(frac=0.8,random_state=200)
valid_data = df_train.drop(train_data.index)

history = model_bert.fit(x=df_train.text.values, 
                                          y=df_train.target.values,
                                          class_weight=class_weights,
                                          epochs=epochs,
                                          batch_size = batch_size,
                                          validation_data=(valid_data.text.values, valid_data.target.values))

Epoch 1/20
238/238 [==============================] - 58s 206ms/step - loss: 0.5457 - binary_accuracy: 0.7416 - auc_1: 0.7982 - val_loss: 0.3832 - val_binary_accuracy: 0.8549 - val_auc_1: 0.9162
Epoch 2/20
238/238 [==============================] - 31s 130ms/step - loss: 0.4084 - binary_accuracy: 0.8330 - auc_1: 0.8898 - val_loss: 0.2670 - val_binary_accuracy: 0.9009 - val_auc_1: 0.9514
Epoch 3/20
238/238 [==============================] - 28s 120ms/step - loss: 0.3271 - binary_accuracy: 0.8795 - auc_1: 0.9269 - val_loss: 0.2485 - val_binary_accuracy: 0.9284 - val_auc_1: 0.9711
Epoch 4/20
238/238 [==============================] - 27s 113ms/step - loss: 0.2649 - binary_accuracy: 0.9087 - auc_1: 0.9500 - val_loss: 0.1660 - val_binary_accuracy: 0.9462 - val_auc_1: 0.9828
Epoch 5/20
238/238 [==============================] - 27s 114ms/step - loss: 0.2208 - binary_accuracy: 0.9237 - auc_1: 0.9656 - val_loss: 0.1767 - val_binary_accuracy: 0.9409 - val_auc_1: 0.9879
Epoch 6/20
238/238 [==============================] - 28s 119ms/step - loss: 0.2083 - binary_accuracy: 0.9324 - auc_1: 0.9681 - val_loss: 0.2900 - val_binary_accuracy: 0.9022 - val_auc_1: 0.9539
Epoch 7/20
238/238 [==============================] - 28s 118ms/step - loss: 0.2453 - binary_accuracy: 0.9216 - auc_1: 0.9527 - val_loss: 0.1693 - val_binary_accuracy: 0.9468 - val_auc_1: 0.9787
Epoch 8/20
238/238 [==============================] - 28s 119ms/step - loss: 0.2195 - binary_accuracy: 0.9236 - auc_1: 0.9669 - val_loss: 0.1254 - val_binary_accuracy: 0.9560 - val_auc_1: 0.9886
Epoch 9/20
238/238 [==============================] - 27s 114ms/step - loss: 0.1598 - binary_accuracy: 0.9430 - auc_1: 0.9825 - val_loss: 0.1068 - val_binary_accuracy: 0.9639 - val_auc_1: 0.9916
Epoch 10/20
238/238 [==============================] - 26s 108ms/step - loss: 0.1517 - binary_accuracy: 0.9486 - auc_1: 0.9837 - val_loss: 0.1094 - val_binary_accuracy: 0.9586 - val_auc_1: 0.9956
Epoch 11/20
238/238 [==============================] - 26s 107ms/step - loss: 0.1286 - binary_accuracy: 0.9546 - auc_1: 0.9877 - val_loss: 0.0837 - val_binary_accuracy: 0.9645 - val_auc_1: 0.9955
Epoch 12/20
238/238 [==============================] - 27s 114ms/step - loss: 0.1186 - binary_accuracy: 0.9546 - auc_1: 0.9902 - val_loss: 0.1023 - val_binary_accuracy: 0.9645 - val_auc_1: 0.9945
Epoch 13/20
238/238 [==============================] - 28s 116ms/step - loss: 0.1343 - binary_accuracy: 0.9526 - auc_1: 0.9874 - val_loss: 0.0924 - val_binary_accuracy: 0.9652 - val_auc_1: 0.9944
Epoch 14/20
238/238 [==============================] - 28s 117ms/step - loss: 0.1149 - binary_accuracy: 0.9559 - auc_1: 0.9906 - val_loss: 0.1000 - val_binary_accuracy: 0.9599 - val_auc_1: 0.9934
Epoch 15/20
238/238 [==============================] - 26s 111ms/step - loss: 0.1268 - binary_accuracy: 0.9532 - auc_1: 0.9885 - val_loss: 0.0943 - val_binary_accuracy: 0.9639 - val_auc_1: 0.9933
Epoch 16/20
238/238 [==============================] - 26s 108ms/step - loss: 0.1181 - binary_accuracy: 0.9611 - auc_1: 0.9890 - val_loss: 0.0958 - val_binary_accuracy: 0.9659 - val_auc_1: 0.9955
Epoch 17/20
238/238 [==============================] - 27s 112ms/step - loss: 0.1756 - binary_accuracy: 0.9408 - auc_1: 0.9779 - val_loss: 0.1498 - val_binary_accuracy: 0.9488 - val_auc_1: 0.9859
Epoch 18/20
238/238 [==============================] - 27s 114ms/step - loss: 0.1439 - binary_accuracy: 0.9502 - auc_1: 0.9839 - val_loss: 0.0994 - val_binary_accuracy: 0.9613 - val_auc_1: 0.9932
Epoch 19/20
238/238 [==============================] - 25s 105ms/step - loss: 0.1324 - binary_accuracy: 0.9505 - auc_1: 0.9880 - val_loss: 0.0963 - val_binary_accuracy: 0.9639 - val_auc_1: 0.9934
Epoch 20/20
238/238 [==============================] - 26s 110ms/step - loss: 0.1267 - binary_accuracy: 0.9546 - auc_1: 0.9875 - val_loss: 0.1058 - val_binary_accuracy: 0.9606 - val_auc_1: 0.9926
def plot_hist(hist):
    '''
    Plots the training / validation loss and accuracy given the training history
    '''
    plt.plot(hist.history["binary_accuracy"])
    plt.plot(hist.history['val_binary_accuracy'])
    plt.plot(hist.history['loss'])
    plt.plot(hist.history['val_loss'])
    plt.title("Model Evaluation")
    plt.ylabel("Accuracy")
    plt.xlabel("Epoch")
    plt.legend(["Accuracy","Validation Accuracy","Loss","Validation Loss"])
    plt.grid()
    plt.show()

plot_hist(history)

Prediction on the test dataset

X_test = df_test["text"].values
predictions_prob = model_bert.predict(X_test)
predictions = tf.round(predictions_prob)
submission = pd.read_csv('nlp-getting-started/sample_submission.csv')
submission['target'] = predictions
submission['target'] =submission['target'].astype(int)
submission['id'] = df_test.index
submission.to_csv('submission2.csv', index=False)
submission.head()

102/102 [==============================] - 7s 60ms/step
idtarget
000
121
231
391
4111

Model USE

Finally, the Universal Sentence Embedding model was trained, it outperformed all the models and obtained more than 80% public score on Kaggle on the test dataset.

X, y = df_train['text'].values, df_train['target'].values
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.25, random_state=42)
X.shape, y.shape
# ((7613,), (7613,))

model_use.compile(loss = tf.keras.losses.BinaryCrossentropy(),
             optimizer=tf.keras.optimizers.Adam(learning_rate=0.001),
             metrics=['accuracy',tf.keras.metrics.AUC()])
%%time
history = model_use.fit(X_train, y_train, epochs = 10, validation_data=(X_val, y_val))

Epoch 1/10
179/179 [==============================] - 8s 30ms/step - loss: 0.5605 - accuracy: 0.7578 - auc_10: 0.8121 - val_loss: 0.4440 - val_accuracy: 0.8078 - val_auc_10: 0.8804
Epoch 2/10
179/179 [==============================] - 3s 15ms/step - loss: 0.4336 - accuracy: 0.8073 - auc_10: 0.8754 - val_loss: 0.4184 - val_accuracy: 0.8157 - val_auc_10: 0.8838
Epoch 3/10
179/179 [==============================] - 3s 15ms/step - loss: 0.4150 - accuracy: 0.8164 - auc_10: 0.8840 - val_loss: 0.4131 - val_accuracy: 0.8214 - val_auc_10: 0.8848
Epoch 4/10
179/179 [==============================] - 3s 15ms/step - loss: 0.4053 - accuracy: 0.8199 - auc_10: 0.8889 - val_loss: 0.4117 - val_accuracy: 0.8193 - val_auc_10: 0.8852
Epoch 5/10
179/179 [==============================] - 4s 22ms/step - loss: 0.3997 - accuracy: 0.8247 - auc_10: 0.8912 - val_loss: 0.4109 - val_accuracy: 0.8193 - val_auc_10: 0.8856
Epoch 6/10
179/179 [==============================] - 4s 24ms/step - loss: 0.3900 - accuracy: 0.8280 - auc_10: 0.8959 - val_loss: 0.4137 - val_accuracy: 0.8199 - val_auc_10: 0.8837
Epoch 7/10
179/179 [==============================] - 3s 15ms/step - loss: 0.3848 - accuracy: 0.8339 - auc_10: 0.8983 - val_loss: 0.4108 - val_accuracy: 0.8246 - val_auc_10: 0.8858
Epoch 8/10
179/179 [==============================] - 3s 15ms/step - loss: 0.3800 - accuracy: 0.8353 - auc_10: 0.9013 - val_loss: 0.4092 - val_accuracy: 0.8214 - val_auc_10: 0.8846
Epoch 9/10
179/179 [==============================] - 3s 15ms/step - loss: 0.3751 - accuracy: 0.8396 - auc_10: 0.9036 - val_loss: 0.4129 - val_accuracy: 0.8220 - val_auc_10: 0.8835
Epoch 10/10
179/179 [==============================] - 4s 21ms/step - loss: 0.3704 - accuracy: 0.8399 - auc_10: 0.9063 - val_loss: 0.4135 - val_accuracy: 0.8204 - val_auc_10: 0.8838
CPU times: user 41.2 s, sys: 3.48 s, total: 44.7 s
Wall time: 36.5 s
def plot_hist(hist):
    '''
    Plots the training / validation loss and accuracy given the training history
    '''
    plt.plot(hist.history["accuracy"])
    plt.plot(hist.history['val_accuracy'])
    plt.plot(hist.history['loss'])
    plt.plot(hist.history['val_loss'])
    plt.title("Model Evaluation")
    plt.ylabel("Accuracy")
    plt.xlabel("Epoch")
    plt.legend(["Accuracy","Validation Accuracy","Loss","Validation Loss"])
    plt.grid()
    plt.show()
    
plot_hist(history)

Prediction and Submission to Kaggle

X_test = df_test['text'].values
predictions_prob = model_use.predict(X_test)
predictions = tf.round(predictions_prob)

102/102 [==============================] - 1s 10ms/step
submission = pd.read_csv('nlp-getting-started/sample_submission.csv')
submission['target'] = predictions
submission['target'] =submission['target'].astype(int)
submission['id'] = df_test.index
submission.to_csv('submission.csv', index=False)
submission.head()

Conclusion

The Sentence-level Embedding (USE) model performed the best on the test data (Kaggle public score ~81.1%), whereas BiLSTM and BERT models did decent jobs. Surprisingly, the USE model performed pretty well without any preprocessing. Training BERT for longer time may improve the accuracy of the transfomer on the test dataset. The next screenshots show the Kaggle public scores obtained for different submissions and the leaderboard position for the best sumission is 265, as of now.

Histopathologic Cancer Detection with CNN – VGG16/ VGG19/ ResNet50

This problem appeared in a project in the coursera course Introduction to Deep Learning (by the university of Colorado Coulder) and is taken from a past Kaggle competition.

Brief description of the problem and data

In this mini-project, we shall use binary classification to classify an image into cancerous (class label 1) or benign (class label 0), i.e., to identify metastatic cancer in small image patches taken from larger digital pathology scans (this problem is taken from a past Kaggle competition).

PCam packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, which the CNN model performs. The CNN Model is trained on Colab GPU on a given set of images and then it predicts / detects tumors in a set of unseen test images.

The dataset provides a large number of small pathology images to classify. Files are named with an image id. The train_labels.csv file provides the ground truth for the images in the train folder. The goal is to predict the labels for the images in the test folder. A positive label indicates that the center 32×32 px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label.

Exploratory Data Analysis (EDA)

First we need to import all python packages / functions that are required for building the CNN model. We shall use tensorflow / keras to train the deep learning model.

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
import seaborn as sns
import os

import tensorflow as tf
import tensorflow.keras as keras
from keras.layers import Conv2D, MaxPool2D, BatchNormalization, Flatten, Dropout, Dense, Activation
from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from tensorflow.keras.optimizers import Adam

import visualkeras
from tifffile import imread
from skimage import draw

As can be seen, the number of images in training folder (ones with ground-truth labels) and test folders (without ground-truth labels) are around 220k and 57k, respectively.

# count number of training and test images
base_dir = 'histopathologic-cancer-detection'
train_dir, test_dir = f'{base_dir}/train/', f'{base_dir}/test/'
ntrain, ntest = len(os.listdir(train_dir)), len(os.listdir(test_dir))
print(f'#training images = {ntrain}, #test inages={ntest}')
#training images = 220025, #test inages=57458


The train_labels.csv file is loaded as a pandas DataFrame (first few rows of the dataframe are shown below). It contains the id of the tif image files from the training folder, along with their ground-truth label. Let’s have the file names in the id column instead, by concatenating the .tif extension to the id values. It will turn out to be useful later while reading the files from the folder automatically.

# read the training images and ground truth labels into a dataframe
train_df = pd.read_csv(f'{base_dir}/train_labels.csv')
train_df['label'] = train_df['label'].astype(str)
train_df['id'] = train_df['id'] + '.tif'
train_df.head()

idlabel
0f38a6374c348f90b587e046aac6079959adf3835.tif0
1c18f2d887b7ae4f6742ee445113fa1aef383ed77.tif1
2755db6279dae599ebb4d39a9123cce439965282d.tif0
3bc3f0c64fb968ff4a8bd33af6971ecae77c75e08.tif0
4068aba587a4950175d04c680d38943fd488d6a9d.tif0

The images are RGB color images of shape 96×9696×96, as seen from the next code snippet, the images are loaded using tif imread() function.

imread(f'{train_dir}/{train_df.id[0]}').shape
# (96, 96, 3)

Also, as can be seen from below plot, there are around 130k benign and 89k cancerous images in the trainign dataset, hence the dataset is not very imbalanced, also the data size is large. Hence we are not using augmentation like preprocessing steps.

 ## histogram of class labels
sns.displot(data=train_df, x='label', hue='label')
train_df['label'] = train_df['label'].astype(int)
train_df['label'].value_counts()

#0    130908
#1     89117
#Name: label, dtype: int64

Next let’s plot 100 randomly selected bening and cancerous images (with label 0 and 1, respectively, by highlighting the center 32×32 region) from training dataset to visually inspect the difference, if any.

def plot_images(df, label, n=100):
    row, col = draw.rectangle_perimeter(start=(96//2-32//2,96//2-32//2), end=(96//2+32//2,96//2+32//2))
    df_sub = df.loc[df.label == label].sample(n)
    imfiles = df_sub['id'].values
    plt.figure(figsize=(15,15))
    for i in range(n):
        im = imread(f'{train_dir}/{imfiles[i]}')
        for j in range(-1,2):
            im[row+j, col+j, :] = [0, 255, 0]
        plt.subplot(10,10,i+1)
        plt.imshow(im)
        plt.axis('off')
    plt.suptitle(f'sample train images with label = {label} (highlighting the center 32x32 region)', size=20)
    plt.tight_layout()
    plt.show()

# sample and plot images from different classes
for label in train_df.label.unique():
  plot_images(train_df, label)

Preprocessing

  • The image pixels will be scaled (normalized) in the range [0−1][0−1], it will be done during the batch reading of the images.
  • Resize the 96×9696×96 images to size 64×6464×64, to reduce the memory.
  • The dataset is not very imbalanced, also the data size is large. Hence we are not using augmentation like preprocessing steps.

Model Architecture

Quite a few models were trained to classify the images, starting from simpler baseline models to more complex models.

Baseline Models

We shall implement the baseline models using keras model subclassing with configurable reusable blocks (so that we can reuse them and change the hyperparameters) and functional APIs.

  • The models will be traditional CNN models, which will comprise of a few Convolution/Pooling blocks (to obtain translation invariance, increase field of view, reduce number of parameters etc.) followed by flatten, Dense a binary classification layer.
  • There will be a Convolution Block implemented by the class ConvBlock which will contain 2 Conv2D layers of same size followed by a MaxPool2D layer. There can be an optional BatchNormalization layer (to reduce internal covariate shift between the batches) immediately following each fo the the Conv2D layers. The filter size, kernel size, activation, pooling size and whether batch norm will be present or not will be configurable (with default values) and can be specified during instantiation of the block.
  • The next reusable block will be the TopBlock that will contain a Flatten layer followed by two dense layers, the last of which will be classification layer, which will have a single output neuron. The size of the dense layer will be configurable.
  • By default ReLU non-linear activation will be used in the layers, except in the last classifier layer, which will use sigmoid activation (to keep the output in the range [0−1], that can be interpreted as probability whether an image contains tumor cells or not).

The architecture of the baseline models are shown below:

The batch_size and im_size will also be hyperparameters that can be tuned / changed. We shall use batch size of 256 and all the images will be resized to 64×6464×64, in order to save memory.

batch_size, im_size = 256, 64

class ConvBlock(tf.keras.layers.Layer):
    '''
    implements CovBlock as
    Conv2D -> Conv2D -> MaxPool OR
    Conv2D -> BatchNorm -> Conv2D -> BatchNorm -> MaxPool 
    as a reusable block in the CNN to be created
    '''
    def __init__(self, n_filter, kernel_sz=(3,3), activation='relu', pool_sz=(2,2), batch_norm=False):
        # initialize with different hyperparamater values such as number of filters in Conv2D, kernel size, activation
        # BatchNorm present or not
        super(ConvBlock, self).__init__()
        self.batch_norm = batch_norm
        self.conv_1 = Conv2D(n_filter, kernel_sz, activation=activation)
        self.bn_1 = BatchNormalization()
        self.conv_2 = Conv2D(n_filter, kernel_sz, activation=activation)
        self.bn_2 = BatchNormalization()
        self.pool = MaxPool2D(pool_size=pool_sz)

    def call(self, x):
        # forward pass
        x = self.conv_1(x)
        if self.batch_norm:
          x = self.bn_1(x)
          x = tf.keras.layers.ReLU()(x)
        x = self.conv_2(x)
        if self.batch_norm:
          x = self.bn_2(x)
          x = tf.keras.layers.ReLU()(x)
        return self.pool(x)

class TopBlock(tf.keras.layers.Layer):
    '''
    implements the top layer of the CNN
    Flatten -> Dense -> Classification
    as a reusable block
    '''
    def __init__(self, n_units=256, activation='relu', drop_out=False, drop_rate=0.5):
        # initialize the block with configurable hyperparameter values, e.g., number of neurons in Dense block, activation
        # Droput present or not
        super(TopBlock, self).__init__()
        self.drop_out = drop_out
        self.flat  = tf.keras.layers.Flatten()
        self.dropout = Dropout(drop_rate)
        self.dense = tf.keras.layers.Dense(n_units, activation=activation)
        self.classifier = tf.keras.layers.Dense(1, activation='sigmoid')

    def call(self, x, training=False):
        # forward pass
        x = self.flat(x)
        #if training:
        #  x = self.dropout(x)
        x = self.dense(x)
        return self.classifier(x)

We shall create couple of models by tuning the hyperparameters Conv2D filtersize and Dense layersize, with / without normalization:

CNNModel1

  • CNNModel1: the first one without BatchNormalization and ConvBlock sizes 1616, 3232, respectively and dense layer size 256256, as shown in the next code snippet.
class CNNModel1(tf.keras.Model):
    def __init__(self, input_shape=(im_size,im_size,3), n_class=1):
        super(CNNModel1, self).__init__()
        # the first conv module
        self.conv_block_1 = ConvBlock(16)
        # the second conv module
        self.conv_block_2 = ConvBlock(32)
        # model top
        self.top_block = TopBlock(n_units=256)

    def call(self, inputs, training=False, **kwargs):
        # forward pass 
        x = self.conv_block_1(inputs)
        x = self.conv_block_2(x)
        return self.top_block(x)       

# instantiate and build the model without BatchNorm
model1 = CNNModel1()
model1.build(input_shape=(batch_size,im_size[0],im_size[1],3))
model1.summary()

Model: "base_model1"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv_block (ConvBlock)      multiple                  2768      
                                                                 
 conv_block_1 (ConvBlock)    multiple                  14144     
                                                                 
 top_block (TopBlock)        multiple                  1384961   
                                                                 
=================================================================
Total params: 1,401,873
Trainable params: 1,401,745
Non-trainable params: 128
_________________________________________________________________

The model architecture looks like the following (the model can be defined using keras Sequential too):

CNNModel2

  • CNNModel2: the second one with BatchNormalization and ConvBlock sizes 32×32, respectively and dense layer size 512, defined / instantiated using the next code snippet.
class CNNModel2(tf.keras.Model):
    def __init__(self, input_shape=(im_size,im_size,3), n_class=1):
        super(CNNModel2, self).__init__()
        # the first conv module
        self.conv_block_1 = ConvBlock(32, batch_norm=True)
        # the second conv module
        self.conv_block_2 = ConvBlock(32, batch_norm=True)
        # model top
        self.top_block = TopBlock(n_units=512)

    def call(self, inputs, training=False, **kwargs):
        # forward pass 
        x = self.conv_block_1(inputs)
        x = self.conv_block_2(x)
        return self.top_block(x)       

# instantiate and build the model with BatchNorm
model2 = CNNModel2()
model2.build(input_shape=(batch_size,im_size,im_size,3))
model2.summary()

Model: "base_model2"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 conv_block_2 (ConvBlock)    multiple                  10400     
                                                                 
 conv_block_3 (ConvBlock)    multiple                  18752     
                                                                 
 top_block_1 (TopBlock)      multiple                  2769921   
                                                                 
=================================================================
Total params: 2,799,073
Trainable params: 2,798,817
Non-trainable params: 256
_________________________________________________________________

We shall also use couple of popular architectures, namely, VGG16, VGG19 and ResNet50 and add a couple of layers with the models by removing the top layer, for the classification.

Model with VGG16 Backbone

base_model = tf.keras.applications.VGG16(
    input_shape=(im_size[0],im_size[1],3), 
    include_top=False, 
    weights='imagenet'
)

#color_map = get_color_map()
#visualkeras.layered_view(base_model, color_map=color_map, legend=True)
np.random.seed(1)
tf.random.set_seed(1)
model_vgg16 = Sequential([
    base_model,
    Flatten(),    
    BatchNormalization(),
    Dense(16, activation='relu'),
    Dropout(0.3),
    Dense(8, activation='relu'),
    Dropout(0.3),
    BatchNormalization(),
    Dense(1, activation='sigmoid')
], name='vgg16_backbone')
model_vgg16.summary()

Model: "vgg16_backbone"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 vgg16 (Functional)          (None, 3, 3, 512)         14714688  
                                                                 
 flatten_3 (Flatten)         (None, 4608)              0         
                                                                 
 batch_normalization_6 (Batc  (None, 4608)             18432     
 hNormalization)                                                 
                                                                 
 dense_9 (Dense)             (None, 16)                73744     
                                                                 
 dropout_6 (Dropout)         (None, 16)                0         
                                                                 
 dense_10 (Dense)            (None, 8)                 136       
                                                                 
 dropout_7 (Dropout)         (None, 8)                 0         
                                                                 
 batch_normalization_7 (Batc  (None, 8)                32        
 hNormalization)                                                 
                                                                 
 dense_11 (Dense)            (None, 1)                 9         
                                                                 
=================================================================
Total params: 14,807,041
Trainable params: 14,797,809
Non-trainable params: 9,232
________________________________________________________________

Model with VGG19 Backbone

base_model = tf.keras.applications.VGG19(
    input_shape=(im_size[0],im_size[1],3), 
    include_top=False, 
    weights='imagenet'
)
#color_map = get_color_map()
#visualkeras.layered_view(base_model, color_map=color_map, legend=True)
\np.random.seed(1)
tf.random.set_seed(1)

model_vgg19 = Sequential([
    base_model,
    Flatten(),    
    BatchNormalization(),
    Dense(16, activation='relu'),
    Dropout(0.3),
    Dense(8, activation='relu'),
    Dropout(0.3),
    BatchNormalization(),
    Dense(1, activation='sigmoid')
], name='vgg19_backbone')

model_vgg19.summary()

Model: "vgg19_backbone"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 vgg19 (Functional)          (None, 3, 3, 512)         20024384  
                                                                 
 flatten_1 (Flatten)         (None, 4608)              0         
                                                                 
 batch_normalization_2 (Batc  (None, 4608)             18432     
 hNormalization)                                                 
                                                                 
 dense_3 (Dense)             (None, 16)                73744     
                                                                 
 dropout_2 (Dropout)         (None, 16)                0         
                                                                 
 dense_4 (Dense)             (None, 8)                 136       
                                                                 
 dropout_3 (Dropout)         (None, 8)                 0         
                                                                 
 batch_normalization_3 (Batc  (None, 8)                32        
 hNormalization)                                                 
                                                                 
 dense_5 (Dense)             (None, 1)                 9         
                                                                 
=================================================================
Total params: 20,116,737
Trainable params: 20,107,505
Non-trainable params: 9,232
_________________________________________________________________

Model with ResNet50 Backbone

base_model = tf.keras.applications.ResNet50(
    input_shape=(im_size[0],im_size[1],3), 
    include_top=False, 
    weights='imagenet'
)
#color_map = get_color_map()
#visualkeras.layered_view(base_model, color_map=color_map, legend=True)

#Downloading data from https://storage.googleapis.com/tensorflow/keras- #applications/resnet/resnet50_weights_tf_dim_ordering_tf_kernels_notop.h5
#94765736/94765736 [==============================] - 1s 0us/step
np.random.seed(1)
tf.random.set_seed(1)

model_resnet50 = Sequential([
    base_model,
    Flatten(),    
    BatchNormalization(),
    Dense(16, activation='relu'),
    Dropout(0.5),
    Dense(8, activation='relu'),
    Dropout(0.5),
    BatchNormalization(),
    Dense(1, activation='sigmoid')
], 'resnet50_backbone')

model_resnet50.summary()

Model: "resnet50_backbone"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 resnet50 (Functional)       (None, 3, 3, 2048)        23587712  
                                                                 
 flatten_1 (Flatten)         (None, 18432)             0         
                                                                 
 batch_normalization_4 (Batc  (None, 18432)            73728     
 hNormalization)                                                 
                                                                 
 dense_2 (Dense)             (None, 16)                294928    
                                                                 
 dropout_1 (Dropout)         (None, 16)                0         
                                                                 
 dense_3 (Dense)             (None, 8)                 136       
                                                                 
 dropout_2 (Dropout)         (None, 8)                 0         
                                                                 
 batch_normalization_5 (Batc  (None, 8)                32        
 hNormalization)                                                 
                                                                 
 dense_4 (Dense)             (None, 1)                 9         
                                                                 
=================================================================
Total params: 23,956,545
Trainable params: 23,866,545
Non-trainable params: 90,000
_________________________________________________________________

Now, we need to read the images, to do this automatically during the training phase, we shall use ImageDataGenerator with the flow_from_dataframe() method and hold-out 25% of the training images for validation performance evaluation.

# scale the images to have pixel values in between [0-1]
# create 75-25 train-validation split of the training dataset for model evaluation
generator = ImageDataGenerator(rescale=1./255, validation_split=0.25)

train_data = generator.flow_from_dataframe(
    dataframe = train_df,
    x_col='id', # filenames
    y_col='label', # labels
    directory=train_dir,
    subset='training',
    class_mode='binary',
    batch_size=batch_size,
    target_size=im_size)

val_data = generator.flow_from_dataframe(
    dataframe=train_df,
    x_col='id', # filenames
    y_col='label', # labels
    directory=train_dir,
    subset="validation",
    class_mode='binary',
    batch_size=batch_size,
    target_size=im_size)

# Found 165019 validated image filenames belonging to 2 classes.
# Found 55006 validated image filenames belonging to 2 classes.

Results and Analysis

The model without the BatchNormalization layers is first trained. Adam optimizer is used with learning_rate=0.001 (higher learning rates seems to diverge), with loss function as BCE (binary_crossentropy) and trained for 10 epochs. We shall use the BCE loss and accuracy metrics on the held-out validation dataset for model evaluation.

Results with Baseline Models

With CNNModel1

# compile and fit the first model with Adam optimizer, BCE loss and to be evaluated with accuracy metric
opt = Adam(learning_rate=0.0001)
model1.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(train_data, validation_data=val_data, epochs=10)

Epoch 1/10
645/645 [==============================] - 465s 702ms/step - loss: 0.4813 - accuracy: 0.7729 - val_loss: 0.4514 - val_accuracy: 0.7940
Epoch 2/10
645/645 [==============================] - 313s 486ms/step - loss: 0.4457 - accuracy: 0.7974 - val_loss: 0.4334 - val_accuracy: 0.8033
Epoch 3/10
645/645 [==============================] - 295s 458ms/step - loss: 0.4254 - accuracy: 0.8090 - val_loss: 0.4130 - val_accuracy: 0.8156
Epoch 4/10
645/645 [==============================] - 304s 472ms/step - loss: 0.4085 - accuracy: 0.8181 - val_loss: 0.4145 - val_accuracy: 0.8127
Epoch 5/10
645/645 [==============================] - 293s 454ms/step - loss: 0.3970 - accuracy: 0.8232 - val_loss: 0.4027 - val_accuracy: 0.8197
Epoch 6/10
645/645 [==============================] - 307s 476ms/step - loss: 0.3859 - accuracy: 0.8291 - val_loss: 0.3780 - val_accuracy: 0.8340
Epoch 7/10
645/645 [==============================] - 298s 462ms/step - loss: 0.3756 - accuracy: 0.8343 - val_loss: 0.3716 - val_accuracy: 0.8381
Epoch 8/10
645/645 [==============================] - 299s 464ms/step - loss: 0.3689 - accuracy: 0.8372 - val_loss: 0.3652 - val_accuracy: 0.8383
Epoch 9/10
645/645 [==============================] - 303s 470ms/step - loss: 0.3609 - accuracy: 0.8416 - val_loss: 0.3569 - val_accuracy: 0.8441
Epoch 10/10
645/645 [==============================] - 297s 461ms/step - loss: 0.3544 - accuracy: 0.8449 - val_loss: 0.3611 - val_accuracy: 0.8432

As can be seen, the validation loss went as low as 0.3611 and validation accuracy went upto 84%. The next figure shows how both the training and validation loss decreased over epochs, whereas both the training and validation accuracy increased steadily over epochs during training, with the model.

def plot_hist(hist):
    plt.figure(figsize=(10,5))
    plt.subplot(121)
    plt.plot(hist.history["accuracy"])
    plt.plot(hist.history['val_accuracy'])
    plt.legend(["Accuracy","Validation Accuracy"])
    plt.ylabel("Accuracy")
    plt.xlabel("Epoch")
    plt.grid()
    plt.subplot(122)
    plt.plot(hist.history['loss'])
    plt.plot(hist.history['val_loss'])
    plt.title("Model Evaluation")
    plt.ylabel("Loss")
    plt.xlabel("Epoch")
    plt.grid()
    plt.legend(["Loss","Validation Loss"])
    plt.tight_layout()
    plt.show()

plot_hist(hist)

with CNNModel2

Next we tried the model with batch normalization enabled and the other hyperparameters tune, as described above. The optimizer and number of epochs used were same as above.

# compile and fit the second model with Adam optimizer, BCE loss and to be evaluated with accuracy metric
opt = Adam(learning_rate=0.0001)
model.compile(optimizer=opt, loss='binary_crossentropy', metrics=['accuracy'])
hist = model.fit(data_train, validation_data=data_validate, epochs=10)

Epoch 1/10
645/645 [==============================] - 339s 524ms/step - loss: 0.4674 - accuracy: 0.7898 - val_loss: 0.5076 - val_accuracy: 0.7524
Epoch 2/10
645/645 [==============================] - 295s 457ms/step - loss: 0.3545 - accuracy: 0.8459 - val_loss: 0.4115 - val_accuracy: 0.8156
Epoch 3/10
645/645 [==============================] - 290s 450ms/step - loss: 0.3094 - accuracy: 0.8687 - val_loss: 0.4750 - val_accuracy: 0.7899
Epoch 4/10
645/645 [==============================] - 296s 458ms/step - loss: 0.2780 - accuracy: 0.8837 - val_loss: 0.2989 - val_accuracy: 0.8750
Epoch 5/10
645/645 [==============================] - 295s 457ms/step - loss: 0.2524 - accuracy: 0.8958 - val_loss: 0.3228 - val_accuracy: 0.8658
Epoch 6/10
645/645 [==============================] - 295s 457ms/step - loss: 0.2236 - accuracy: 0.9088 - val_loss: 0.3009 - val_accuracy: 0.8763
Epoch 7/10
645/645 [==============================] - 292s 453ms/step - loss: 0.1917 - accuracy: 0.9246 - val_loss: 0.3610 - val_accuracy: 0.8610
Epoch 8/10
645/645 [==============================] - 293s 454ms/step - loss: 0.1535 - accuracy: 0.9428 - val_loss: 0.5574 - val_accuracy: 0.8076
Epoch 9/10
645/645 [==============================] - 289s 448ms/step - loss: 0.1121 - accuracy: 0.9628 - val_loss: 0.3404 - val_accuracy: 0.8670
Epoch 10/10
645/645 [==============================] - 296s 459ms/step - loss: 0.0758 - accuracy: 0.9787 - val_loss: 0.4302 - val_accuracy: 0.8520

As can be seen, the validation loss went to 0.4302 and validation accuracy went upto 85%. The next figure shows how both the training loss / accuracy steadily decreased / increased over epochs, resepectively, but the validation loss / accuracy became unstable.

plot_hist(hist)

Results with the model with VGG16 backbone

We shall attempt transfer learning / fine tuning (by starting with pretrained weights on imagenet and keeping the VGG16 backbone weights frozen) and also train VGG16 from scratch and compare the results obtained in terms of accuracy and ROC AUC (area under the curve).

1. Transfer Learning / Fine Tuning

opt = tf.keras.optimizers.Adam(0.001)
base_model.trainable = False
model_vgg16.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])
hist = model_vgg16.fit(train_data, epochs = 20, validation_data = val_data, verbose=1)

plot_hist(hist)

2. Training from Scratch

base_model.trainable = True
model_vgg16.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])

%%time 
hist = model_vgg16.fit(train_data, epochs = 20, validation_data = val_data, verbose=1)


Epoch 1/20
645/645 [==============================] - 867s 1s/step - loss: 0.5446 - accuracy: 0.7338 - auc_6: 0.7940 - val_loss: 0.5420 - val_accuracy: 0.7202 - val_auc_6: 0.7990
Epoch 2/20
645/645 [==============================] - 524s 812ms/step - loss: 0.5008 - accuracy: 0.7689 - auc_6: 0.8329 - val_loss: 0.9227 - val_accuracy: 0.6275 - val_auc_6: 0.7825
Epoch 3/20
645/645 [==============================] - 543s 842ms/step - loss: 0.4721 - accuracy: 0.7896 - auc_6: 0.8551 - val_loss: 0.5896 - val_accuracy: 0.7274 - val_auc_6: 0.8574
Epoch 4/20
645/645 [==============================] - 544s 842ms/step - loss: 0.4151 - accuracy: 0.8228 - auc_6: 0.8908 - val_loss: 156.7007 - val_accuracy: 0.6567 - val_auc_6: 0.8239
Epoch 5/20
645/645 [==============================] - 534s 828ms/step - loss: 0.3698 - accuracy: 0.8484 - auc_6: 0.9156 - val_loss: 0.3955 - val_accuracy: 0.8148 - val_auc_6: 0.9197
Epoch 6/20
645/645 [==============================] - 517s 802ms/step - loss: 0.3137 - accuracy: 0.8809 - auc_6: 0.9401 - val_loss: 124019.8984 - val_accuracy: 0.7569 - val_auc_6: 0.8264
Epoch 7/20
645/645 [==============================] - 486s 752ms/step - loss: 0.2689 - accuracy: 0.9010 - auc_6: 0.9564 - val_loss: 0.2424 - val_accuracy: 0.9119 - val_auc_6: 0.9729
Epoch 8/20
645/645 [==============================] - 474s 735ms/step - loss: 0.2460 - accuracy: 0.9096 - auc_6: 0.9634 - val_loss: 0.2357 - val_accuracy: 0.8903 - val_auc_6: 0.9777
Epoch 9/20
645/645 [==============================] - 474s 734ms/step - loss: 0.2245 - accuracy: 0.9210 - auc_6: 0.9694 - val_loss: 0.3355 - val_accuracy: 0.8652 - val_auc_6: 0.9742
Epoch 10/20
645/645 [==============================] - 476s 738ms/step - loss: 0.2222 - accuracy: 0.9214 - auc_6: 0.9699 - val_loss: 3.3083 - val_accuracy: 0.8834 - val_auc_6: 0.9826
Epoch 11/20
645/645 [==============================] - 515s 798ms/step - loss: 0.1935 - accuracy: 0.9338 - auc_6: 0.9767 - val_loss: 0.1701 - val_accuracy: 0.9417 - val_auc_6: 0.9842
Epoch 12/20
645/645 [==============================] - 532s 825ms/step - loss: 0.1894 - accuracy: 0.9345 - auc_6: 0.9776 - val_loss: 0.1957 - val_accuracy: 0.9417 - val_auc_6: 0.9817
Epoch 13/20
645/645 [==============================] - 474s 734ms/step - loss: 0.1702 - accuracy: 0.9420 - auc_6: 0.9814 - val_loss: 0.3708 - val_accuracy: 0.9157 - val_auc_6: 0.9851
Epoch 14/20
645/645 [==============================] - 470s 728ms/step - loss: 0.1556 - accuracy: 0.9480 - auc_6: 0.9840 - val_loss: 1.0746 - val_accuracy: 0.9535 - val_auc_6: 0.9858
Epoch 15/20
645/645 [==============================] - 471s 729ms/step - loss: 0.1523 - accuracy: 0.9495 - auc_6: 0.9845 - val_loss: 2.1778 - val_accuracy: 0.9524 - val_auc_6: 0.9873
Epoch 16/20
645/645 [==============================] - 471s 729ms/step - loss: 0.1333 - accuracy: 0.9557 - auc_6: 0.9878 - val_loss: 0.9888 - val_accuracy: 0.7773 - val_auc_6: 0.9137
Epoch 17/20
645/645 [==============================] - 474s 735ms/step - loss: 0.1226 - accuracy: 0.9599 - auc_6: 0.9896 - val_loss: 0.1557 - val_accuracy: 0.9563 - val_auc_6: 0.9855
Epoch 18/20
645/645 [==============================] - 476s 737ms/step - loss: 0.1123 - accuracy: 0.9634 - auc_6: 0.9909 - val_loss: 0.5402 - val_accuracy: 0.9573 - val_auc_6: 0.9860
Epoch 19/20
645/645 [==============================] - 508s 787ms/step - loss: 0.1068 - accuracy: 0.9654 - auc_6: 0.9915 - val_loss: 0.1696 - val_accuracy: 0.9542 - val_auc_6: 0.9841
Epoch 20/20
645/645 [==============================] - 471s 730ms/step - loss: 0.0949 - accuracy: 0.9701 - auc_6: 0.9929 - val_loss: 116.0880 - val_accuracy: 0.8824 - val_auc_6: 0.9839
CPU times: user 2h 47min 57s, sys: 11min 9s, total: 2h 59min 7s
Wall time: 2h 52min 26s

plot_hist(hist)




Results with the model with VGG19 Backbone

This is the model that performed best and obtained ~88% public ROC score on the unseen test dataset.

Training from Scratch

opt = tf.keras.optimizers.Adam(0.001)
base_model.trainable = True
model_vgg19.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])

# start training
%%time 
hist = model_vgg19.fit(train_data, epochs = 20, validation_data = val_data, verbose=1)

Epoch 1/20
645/645 [==============================] - 1362s 2s/step - loss: 0.5519 - accuracy: 0.7240 - auc_1: 0.7850 - val_loss: 1294.4092 - val_accuracy: 0.7835 - val_auc_1: 0.8600
Epoch 2/20
645/645 [==============================] - 585s 907ms/step - loss: 0.4525 - accuracy: 0.7966 - auc_1: 0.8673 - val_loss: 1.1149 - val_accuracy: 0.6579 - val_auc_1: 0.7463
Epoch 3/20
645/645 [==============================] - 557s 863ms/step - loss: 0.3838 - accuracy: 0.8365 - auc_1: 0.9082 - val_loss: 1.0965 - val_accuracy: 0.7396 - val_auc_1: 0.7908
Epoch 4/20
645/645 [==============================] - 557s 864ms/step - loss: 0.3291 - accuracy: 0.8682 - auc_1: 0.9331 - val_loss: 0.5429 - val_accuracy: 0.7161 - val_auc_1: 0.9106
Epoch 5/20
645/645 [==============================] - 555s 859ms/step - loss: 0.2725 - accuracy: 0.8989 - auc_1: 0.9540 - val_loss: 0.2246 - val_accuracy: 0.9134 - val_auc_1: 0.9708
Epoch 6/20
645/645 [==============================] - 555s 860ms/step - loss: 0.2316 - accuracy: 0.9171 - auc_1: 0.9663 - val_loss: 0.2160 - val_accuracy: 0.9162 - val_auc_1: 0.9709
Epoch 7/20
645/645 [==============================] - 554s 858ms/step - loss: 0.2149 - accuracy: 0.9233 - auc_1: 0.9707 - val_loss: 0.2129 - val_accuracy: 0.9202 - val_auc_1: 0.9773
Epoch 8/20
645/645 [==============================] - 558s 865ms/step - loss: 0.1930 - accuracy: 0.9333 - auc_1: 0.9760 - val_loss: 0.1660 - val_accuracy: 0.9395 - val_auc_1: 0.9820
Epoch 9/20
645/645 [==============================] - 554s 858ms/step - loss: 0.1804 - accuracy: 0.9386 - auc_1: 0.9788 - val_loss: 19353.1387 - val_accuracy: 0.9063 - val_auc_1: 0.9763
Epoch 10/20
645/645 [==============================] - 555s 859ms/step - loss: 0.1704 - accuracy: 0.9424 - auc_1: 0.9810 - val_loss: 35.3024 - val_accuracy: 0.8789 - val_auc_1: 0.9433
Epoch 11/20
645/645 [==============================] - 551s 854ms/step - loss: 0.1642 - accuracy: 0.9435 - auc_1: 0.9823 - val_loss: 1148.5889 - val_accuracy: 0.8950 - val_auc_1: 0.9636
Epoch 12/20
645/645 [==============================] - 552s 856ms/step - loss: 0.1558 - accuracy: 0.9476 - auc_1: 0.9840 - val_loss: 0.1572 - val_accuracy: 0.9445 - val_auc_1: 0.9827
Epoch 13/20
645/645 [==============================] - 549s 851ms/step - loss: 0.1430 - accuracy: 0.9524 - auc_1: 0.9862 - val_loss: 0.1498 - val_accuracy: 0.9457 - val_auc_1: 0.9855
Epoch 14/20
645/645 [==============================] - 552s 856ms/step - loss: 0.1342 - accuracy: 0.9556 - auc_1: 0.9877 - val_loss: 0.1678 - val_accuracy: 0.9412 - val_auc_1: 0.9827
Epoch 15/20
645/645 [==============================] - 550s 852ms/step - loss: 0.1294 - accuracy: 0.9565 - auc_1: 0.9884 - val_loss: 205.0925 - val_accuracy: 0.9417 - val_auc_1: 0.9852
Epoch 16/20
645/645 [==============================] - 550s 853ms/step - loss: 0.1144 - accuracy: 0.9623 - auc_1: 0.9907 - val_loss: 183.5535 - val_accuracy: 0.9473 - val_auc_1: 0.9856
Epoch 17/20
645/645 [==============================] - 553s 857ms/step - loss: 0.1083 - accuracy: 0.9647 - auc_1: 0.9917 - val_loss: 132.3142 - val_accuracy: 0.9428 - val_auc_1: 0.9800
Epoch 18/20
645/645 [==============================] - 559s 867ms/step - loss: 0.1038 - accuracy: 0.9657 - auc_1: 0.9922 - val_loss: 1007.8879 - val_accuracy: 0.9205 - val_auc_1: 0.9828
Epoch 19/20
645/645 [==============================] - 557s 864ms/step - loss: 0.1012 - accuracy: 0.9666 - auc_1: 0.9926 - val_loss: 121.1083 - val_accuracy: 0.9210 - val_auc_1: 0.9599
Epoch 20/20
645/645 [==============================] - 578s 896ms/step - loss: 0.0885 - accuracy: 0.9709 - auc_1: 0.9941 - val_loss: 2187.4048 - val_accuracy: 0.9541 - val_auc_1: 0.9863
CPU times: user 3h 8min 54s, sys: 13min 30s, total: 3h 22min 24s
Wall time: 3h 19min 3s

# save model
model_vgg19.save('model_vgg19_20.h5')

plot_hist(hist)

Results with the model with ResNet50 Backbone

base_model.trainable = True
opt = tf.keras.optimizers.Adam(0.001)
model_resnet50.compile(loss='binary_crossentropy', optimizer=opt, metrics=['accuracy', tf.keras.metrics.AUC()])

%%time 
hist = model_resnet50.fit(train_data, validation_data=val_data, epochs=20, verbose=1)

Epoch 1/20
645/645 [==============================] - 1317s 2s/step - loss: 0.3414 - accuracy: 0.8588 - auc: 0.9298 - val_loss: 1.1011 - val_accuracy: 0.5951 - val_auc: 0.3934
Epoch 2/20
645/645 [==============================] - 593s 919ms/step - loss: 0.2412 - accuracy: 0.9122 - auc: 0.9653 - val_loss: 0.5169 - val_accuracy: 0.8323 - val_auc: 0.9167
Epoch 3/20
645/645 [==============================] - 450s 697ms/step - loss: 0.2124 - accuracy: 0.9235 - auc: 0.9722 - val_loss: 0.2642 - val_accuracy: 0.8980 - val_auc: 0.9597
Epoch 4/20
645/645 [==============================] - 448s 695ms/step - loss: 0.1941 - accuracy: 0.9305 - auc: 0.9766 - val_loss: 0.3500 - val_accuracy: 0.8499 - val_auc: 0.9337
Epoch 5/20
645/645 [==============================] - 446s 692ms/step - loss: 0.1791 - accuracy: 0.9353 - auc: 0.9795 - val_loss: 0.6030 - val_accuracy: 0.7776 - val_auc: 0.8575
Epoch 6/20
645/645 [==============================] - 449s 696ms/step - loss: 0.1673 - accuracy: 0.9416 - auc: 0.9823 - val_loss: 1.3558 - val_accuracy: 0.7927 - val_auc: 0.8046
Epoch 7/20
645/645 [==============================] - 453s 701ms/step - loss: 0.1497 - accuracy: 0.9474 - auc: 0.9852 - val_loss: 0.4392 - val_accuracy: 0.8672 - val_auc: 0.9208
Epoch 8/20
645/645 [==============================] - 463s 718ms/step - loss: 0.1431 - accuracy: 0.9497 - auc: 0.9863 - val_loss: 0.3357 - val_accuracy: 0.9010 - val_auc: 0.9642
Epoch 9/20
645/645 [==============================] - 454s 703ms/step - loss: 0.1331 - accuracy: 0.9531 - auc: 0.9879 - val_loss: 1.3040 - val_accuracy: 0.7925 - val_auc: 0.8910
Epoch 10/20
645/645 [==============================] - 490s 760ms/step - loss: 0.1234 - accuracy: 0.9571 - auc: 0.9893 - val_loss: 0.3096 - val_accuracy: 0.9203 - val_auc: 0.9672
Epoch 11/20
645/645 [==============================] - 451s 698ms/step - loss: 0.1124 - accuracy: 0.9605 - auc: 0.9909 - val_loss: 0.9762 - val_accuracy: 0.7699 - val_auc: 0.8315
Epoch 12/20
645/645 [==============================] - 458s 710ms/step - loss: 0.1117 - accuracy: 0.9607 - auc: 0.9909 - val_loss: 1.2930 - val_accuracy: 0.7756 - val_auc: 0.7858
Epoch 13/20
645/645 [==============================] - 503s 779ms/step - loss: 0.1023 - accuracy: 0.9643 - auc: 0.9922 - val_loss: 0.3675 - val_accuracy: 0.9167 - val_auc: 0.9695
Epoch 14/20
645/645 [==============================] - 507s 786ms/step - loss: 0.0978 - accuracy: 0.9653 - auc: 0.9927 - val_loss: 0.8825 - val_accuracy: 0.7811 - val_auc: 0.9036
Epoch 15/20
645/645 [==============================] - 477s 740ms/step - loss: 0.0956 - accuracy: 0.9667 - auc: 0.9931 - val_loss: 0.6381 - val_accuracy: 0.8942 - val_auc: 0.9421
Epoch 16/20
645/645 [==============================] - 451s 699ms/step - loss: 0.0888 - accuracy: 0.9688 - auc: 0.9938 - val_loss: 0.2691 - val_accuracy: 0.9328 - val_auc: 0.9725
Epoch 17/20
645/645 [==============================] - 461s 714ms/step - loss: 0.0876 - accuracy: 0.9695 - auc: 0.9940 - val_loss: 0.5430 - val_accuracy: 0.8916 - val_auc: 0.9415
Epoch 18/20
645/645 [==============================] - 451s 699ms/step - loss: 0.0805 - accuracy: 0.9721 - auc: 0.9948 - val_loss: 0.7418 - val_accuracy: 0.8526 - val_auc: 0.9025
Epoch 19/20
645/645 [==============================] - 445s 690ms/step - loss: 0.0823 - accuracy: 0.9714 - auc: 0.9946 - val_loss: 0.3851 - val_accuracy: 0.9273 - val_auc: 0.9635
Epoch 20/20
645/645 [==============================] - 442s 684ms/step - loss: 0.0798 - accuracy: 0.9723 - auc: 0.9948 - val_loss: 0.3040 - val_accuracy: 0.9145 - val_auc: 0.9710
CPU times: user 2h 45min 23s, sys: 10min 53s, total: 2h 56min 16s
Wall time: 2h 51min 7s

plot_hist(hist)

As seen from the above figure, the ResNet50 model’s accuracy and loss both are quite fluctuating not stable and need more training.

Predictions on test images

Again a dataframe with test image ids were stored in a dataframe and the test images were automatically read during rediction phase with the function flow_from_dataframe(), as before. The predictions are submitted to kaggle to obtain the score (although leaderboard selection was disabled).

import os
images_test = pd.DataFrame({'id':os.listdir(test_dir)})
generator_test = ImageDataGenerator(rescale=1./255) # scale the test images to have pixel values in [0-1]

test_data = generator_test.flow_from_dataframe(
    dataframe = images_test,
    x_col='id', # filenames
    directory=test_dir,
    class_mode=None,
    batch_size=1,
    target_size=im_size,
    shuffle=False)

# predict with the model
predictions = model1.predict(test_data, verbose=1)

Found 57458 validated image filenames.
57458/57458 [==============================] - 209s 4ms/step

predictions = predictions.squeeze()
predictions.shape
# (57458,)

# create submission dataframe for kaggle submission
submission_df = pd.DataFrame()
submission_df['id'] = images_test['id'].apply(lambda x: x.split('.')[0])
submission_df['label'] = list(map(lambda x: 0 if x < 0.5 else 1, predictions))

submission_df['label'].value_counts()
submission_df.to_csv('submission2.csv', index=False)

print(submission_df.head())

                                         id  label
0  86cbac8eef45d436a8b1c7469ada0894f0b684cc      0
1  cb452f428031d335eadb8dc8eb4c7744b0cab276      0
2  6ff0a28ac41715a0646473c78b4130c64929a21d      1
3  b56127fff86222a42457749884daca4df8fef050      0
4  787a40cb598ad5f2afe00937af2488e68df2a4fc      1

  • With the first baseline model (CNNModel1) ~82.7% ROC score was obtained on the unseen test dataset in kaggle, as shown below, which was more than the one obtained with the second baseline model (CNNModel2).
  • The ROC scores obtained with the models with VGG16, VGG19 backbones were much better and the results obtained with them by training from scratch were better than those obtained with transfer learning / fine tuning (since we have a huge dataset) – the best ROC score ~88.1% was obtained with the model with VGG19 backbone trained from sctrach.

Git Repository

https://github.com/sandipan/Coursera-Deep-Learning-Histopathologic-Cancer-Detection-Project

Kaggle Notebook

https://www.kaggle.com/code/sandipanumbc/vgg16-19-resnet50

Conclusion

As we could see, the CNN model without BatchNormalization (CNNModel1) outputperformed the other model (CNNModel2) with one, given a small number of epochs (namely 10) were used to train both the models. It’s likely that the CNNModel2 could improve its generalizability on the unseen test images, if it were trained for more epochs (e.g. 30 epochs). We have also trained popular CNN architectures such as VGG1/19 both using pre-trained weights from imagenet, with transfer learning / fine tuning and also training them from scratch, and we obtained much better accuracy, particularly with these models trained from scratch. We could try more recent and complex models such as ResNet50/101, InceptionV3 or EfficientNet too.

Custom Object Detection with transfer learning with pre-trained YOLO-V4 model

In this blog we shall demonstrate how to start with a pre-trained Yolo (You only look once) V4 end-to-end one-stage object detection model (trained on MS COCO dataset) and train it to detect a custom object (Raccoon).

Dataset Description / Exploration

  • Roboflow allows to download the annotated images (with bounding boxes for the object Raccoon to be detected) in different formats, here we shall use darknet text format for the bounding box annotations, which can be used for both YOLO V3 and V4, as shown in the next figure.
  • The following figure shows an image and the corresponding annotation text, denoting the position of the bounding box for the Raccoon object in the image. It will be used to firther train the YOLO-V4 model, to make it able to detect the custom object Raccoon.
  • In the above annotation, the first two coordinates represent the center of the bounding box and the next two represent the width and height of the bounding box, respectively.
  • From the above representation, the bounding box left, top and right bottom
    coordinates can be computed as follows:

(x1, y1) = (416 × 0.3790 − 416×0.4904 / 2, 416 × 0.4796 − 416×0.7115 / 2) ≈ (56, 52)
(x2, y2) = (416 × 0.3790 + 416×0.4904 / 2, 416 × 0.4796 + 416×0.7115 / 2) ≈ (260, 348)

the corresponding bounding box can be drawn as shown in the next figure:

Objective & Outline

  • The original YOLO-v4 deep learning model being trained on MS COCO dataset, it can detect objects
    belonging to 80 different classes. Unfortunately, those 80 classes don’t include Raccoon, hence, without explicit training the pre-trained model will not be able to identify the Raccoons from the image dataset.
  • Also, we have only 196 images of Raccoon, which is a pretty small number, so it’s not feasible to train the YOLO-V4 model from scratch.
  • However, this is an ideal scenario to apply transfer learning. Since the task is same, i.e., object
    detection, we can always start with the pretrained weights on COCO dataset and then train the model on our images, starting from those initial weights.
  • Instead of training a model from scratch, let’s use pre-trained YOLOv4 weights which have been trained up to 137 convolutional layers. Since the original model was trained on COCO dataset with 80 classes and we are interested in detection of an object of a single class (namely Raccoon), we need to modify the corresponding layers (in the conig file).

Data cleaning / feature engineering

  • Images are normalized to have value in between 0-1. Histogram equalization / contrast stretching can be used for image enhancement. Since this task involves object localization, data augmentation was not used, since it would then require the re-computation of the bounding box.
  • No other feature engineering technique was used, since the deep neural net contains so many
    convolution layers that automatically generates many different features, the earlier layers with simpler features and later layers more complicated features.

Training with transfer learning – Configuration and hyperparameter settings

  • Google colab is to be used to train the model on GPU.
  • To start with we need to first clone the darknet source from the following git repository using the following command:
!git clone https://github.com/AlexeyAB/darknet/
  • We need to change the Makefile to enable GPU and opencv and run make to create the darknet
    executable.
  • Next we need to download the pre-trained model yolov4.conv.137 and copy it to the right folder, using the following code.
!wget -P build/darknet/x64/ https://github.com/AlexeyAB/darknet/releases/download/darkn
et_yolo_v3_optimal/yolov4.conv.137
  • We need to copy the input image / annotation files to the right folders and provide the information for training data to the model, e.g., create a file build/darknet/x64/data/obj.data, that looks like the following:
  • Here the training and validation text files list the names of the training and validation set images, whereas backup represents the location for saving the model checkpoints while training.
  • We need to create a configuration file (yolov4_train.cfg, e.g.,) for training the model on our images. A
    relevant portion of the config file (with few of the hyperparameters) to be used for training the YOLO-V4 model is shown below:
  • Total number of images we have is 196, out of which 153 of them are used for training and the remaining are used for validation.
  • Since the number of training images is small, we keep the batch size hyperparamater (used 3 different values, namely, 16, 8 and 4) for training small too.
  • Notice that we need to chaneg the number of classes to (since we are interested to detect a single
    object here), as opposed to 80 in the original config file and the number of features as (1+5) x 3 = 18, as shown in the next figure, a part of the config file, again.
  • Number of batches for which the model is trained is 2000 (since it is recommended to be at least
    200 x num_classes), the model checkpoints stored at batches 500, 1000 and 2000 respectively.
    Now we can start training the model on our images, initializing it with the pretrained weights, using the following line of code.
!./darknet detector train build/darknet/x64/data/obj.data cfg/yolov4_train.cfg build/darknet/x64/
yolov4.conv.137 -dont_show
  • A few iterations of training are shown in the below figure:
  • It takes around ~2 hrs to finish 2000 batches and the final model weights are stored in a file
    (yolov4_train_final.weights) on the backup folder provide.

Model Selection and Testing / Prediction

  • Since the batch size 8 and subdivision size 2 resulted in higher accuracy (in terms of IOU), the
    corresponding model is selected as the best fit model.
  • The final model checkpoint saved can be used for prediction (with an unseen image test.jpg) with the following line of code:
! ./darknet detector test build/darknet/x64/data/obj.data cfg/yolov4_train.cfg build/darknet/x64/backup/
yolov4_train_latest.weights -dont_show test.jpg
  • Around ~500 test images with raccoons were used for custom object detection with the model trained. The following figures show the custom objects (Raccoons) detected with the model on a few unseen images.

The next animation shows how the racoons are detected with the model:

Summary

  • With a relatively few number of iterations and a small number of training images we could do a descent job for detecting custom objects using transfer learning.
  • The YOLO model’s advantage being its speed (since a one-stage object detection model), starting with weights pretrained on MS-COCO for object detection followed by transfer learning one can detect custom objects with a few hours of training.

Next steps

We obtained a few false positives and false negatives with the model trained. To improve the performance of the model,

  • We can train the model for more batches (~10k)
  • Increase the input images with data augmentation + re-annotation
  • Tune many of the hyperparameters (momentum, decay etc.) of the model

References

  1. https://github.com/AlexeyAB/darknet/
  2. https://public.roboflow.com/object-detection/raccoon
  3. https://stackoverflow.com/questions/65204524/training-custom-object-detection-model-bin-bash-darknetno-such-file-or-di/70562641#70562641

Implementing a few algorithms with python from scratch

In this blog, we shall focus on implementing a few famous algorithms with python – the algorithms will be from various topics from computer science, such as graph theory, compiler construction, theory of computation, numerical analysis, data structures, digital logic, networking, operating systems, DBMS, cryptography, optimization, quantum computation, game theory etc. and all of the implementations will be from scratch.

Let’s start with a problem called Segmented Least Squares, which will be solved using Dynamic Programming.

Segmented Least Squares

  • Least squares is a foundational problem in statistics and numerical analysis. Given n points in the plane: (x1, y1), (x2, y2) , . . . , (xn, yn), the objective is to fit a line y = ax + b that minimizes the sum of the squared error.
  • The Segmented least squares is a more general problem, where
    • The data points lie roughly on a sequence of several line segments.
    • Given n points in the plane (x1, y1), (x2, y2) , . . . , (xn, yn), with x1 < x2 < … < xn the objective is to find a sequence of lines that minimizes f(x), with a reasonable choice for f(x) to balance accuracy (goodness of fit) and parsimony (number of lines).
    • The goal is to minimize the sum of
      • the sums of the squared errors E in each segment
      • the number of lines L
      • the Tradeoff function: E + c L, for some constant c > 0 (cost of a line).

The following figure shows how the dynamic programming for the segmented least squares problem is formulated:

Here, M[j] represents the minimum error (regression) line fitted on the points i to j, we can track the starting point i with minimum error (MSE) by keeping a back pointer array along with the dynamic programming array. Also, c denotes the cost to draw a line (acts as a penalty on number of lines fit). The optimal substructure property is by Bellman equation.

Here is my python implementation for the above DP algorithm, on the following 2D dataset with points (xs, ys) (scatter plotted below):

def ls_fit(xs, ys, m):
    a = (m*sum(xs*ys)-sum(xs)*sum(ys)) / (m*sum(xs**2)-sum(xs)**2)
    b = (sum(ys)-a*sum(xs)) / m
    return a, b

def compute_errors(xs, ys):
    n = len(xs)
    e = np.zeros((n,n))
    for j in range(n):
        for i in range(j+1):
            m = j-i+1
            if m > 1:
                a, b = ls_fit(xs[i:i+m], ys[i:i+m], m)
                e[i,j] = sum((ys[i:i+m] - a*xs[i:i+m] - b)**2)
    return e

def build_DP_table(e, n):
    M = np.zeros(n)
    p = np.zeros(n, dtype=int) # backpointers
    for j in range(1, n):
        cost = [e[i,j] + c + M[i-1] for i in range(j)]
        M[j] = np.min(cost)
        p[j] = np.argmin(cost)
    return M, p

Now plot the least square line segments obtained with the dynamic programming formulation:

c = 10
tol = 2
starts = np.unique(p)
drawn = set([])
plt.plot(xs, ys, 'g.')
for start in starts:
    indices = np.where(abs(p-start) < tol)[0]
    a, b = ls_fit(xs[indices], ys[indices], len(indices))
    if not (a, b) in drawn:
        plt.plot([xs[min(indices)],xs[max(indices)]], [a*xs[min(indices)]+b, a*xs[max(indices)]+b], linewidth=3, 
                 label='line: ({:.2f}, {:.2f})'.format(a,b))
        drawn.add((a,b))
plt.legend()

As expected, the DP found the 3 optimal least-square lines fitted on the data. The following figure shows how the cost of a line c can impact the number of lines created with the dynamic programming method:

Logistic Regression with Gradient Descent

In this problem we shall focus on the implementation of a popular supervised machine learning model (a binary classifier) logistic regression. Given n data points each with m dimensions (represented by a m x n matrix X) along with a binary label corresponding each of the data points (a m x 1 vector y), let’s try to build a model that learns a set of weights (a n x 1 parameter vector θ) using maximum likelihood estimation (MLE), by minimizing the binary cross entropy (BCE) loss function (which is equivalent to maximizing the log likelihood) on the training dataset (X, y) as shown in the next figure:

The below figure again summarizes the theory / math we are using here to implement Logistic Regression with Gradient Descent optimizer:

def VanillaLogisticRegression(x, y): # LR without regularization
    m, n = x.shape
    w = np.zeros((n+1, 1))
    X = np.hstack((np.ones(m)[:,None],x)) # include the feature corresponding to the bias term
    num_epochs = 1000 # number of epochs to run gradient descent, tune this hyperparametrer
    lr = 0.5 # learning rate, tune this hyperparameter
    losses = []
    for _ in range(num_epochs):
        y_hat = 1. / (1. + np.exp(-np.dot(X, w))) # predicted y by the LR model
        J = np.mean(-y*np.log2(y_hat) - (1-y)*np.log2(1-y_hat)) # the binary cross entropy loss function
        grad_J = np.mean((y_hat - y)*X, axis=0) # the gradient of the loss function
        w -= lr * grad_J[:, None] # the gradient descent step, update the parameter vector w
        losses.append(J)
        # test corretness of the implementation
        # loss J should monotonically decrease & y_hat should be closer to y, with increasing iterations
        # print(J)            
    return w

m, n = 1000, 5 # 1000 rows, 5 columns
# randomly generate dataset, note that y can have values as 0 and 1 only
x, y = np.random.random(m*n).reshape(m,n), np.random.randint(0,2,m).reshape(-1,1)
w = VanillaLogisticRegression(x, y)
w # learnt parameters
# array([[-0.0749518 ],
#   [ 0.28592107],
#   [ 0.15202566],
#   [-0.15020757],
#   [ 0.08147078],
#   [-0.18823631]])

Finally, let’s compare the above implementation with sklearn‘s implementation, which uses a more advanced optimization algorithm lbfgs by default, hence likely to converge much faster, but if our implementation is correct both of then should converge to the same global minima, since the loss function is convex (note that sklearn by default uses regularization, in order to have almost no regularization, we need to have the value of the input hyper-parameter C very high):

from sklearn.linear_model import LogisticRegression
clf = LogisticRegression(random_state=0, C=10**12).fit(x, y)
print(clf.coef_, clf.intercept_)
# [[ 0.28633262  0.15256914 -0.14975667  0.08192404 -0.18780851]] [-0.07612282]

Compare the parameter values obtained from the above implementation and the one obtained with sklearn‘s implementation: they are almost equal.

Also, let’s compare the predicted probabilities obtained using these two different implementations of LR (one from scratch, another one from sklearn‘s library function), as can be seen from the following scatterplot, they are almost identical:

pred_probs = 1 / (1 + np.exp(-X@w))
plt.scatter(pred_probs, clf.predict_proba(x)[:,1])
plt.grid()
plt.xlabel('pred prob', size=20)
plt.ylabel('pred prob (sklearn)', size=20)
plt.show()

Finally, let’s compute the accuracies obtained, they are identical too:

print(sum((pred_probs > 0.5) == y) / len(y)) 
# [0.527]
clf.score(x, y)   
# 0.527

The Havel-Hakimi Algorithm

Now, let’s concentrate on a problem from graph theory. Given degree sequence of an undirected graph, let’s say we have to determine whether the sequence is graphic or not. If graphic, we have to draw a graph with the same degree sequence as input. We shall use the Havel-Hakimi algorithm to accomplish this.

By the Havel-Hakimi theorem from https://d3gt.com/unit.html?havel-hakimi, we have the following:

from which we have the following algorithm:

Iteratively execute the following steps

  • Sort the degree sequence list in non-increasing order.
  • Extract (remove) the max-degree vertex of degree d from the sorted degree-sequence list.
  • Decrement the degrees of next d vertices in the list
  • If there are not enough vertices in the list, or degree of some vertex becomes negative, return False
  • Stop and return True if list of all zeros remains.

Here is how we can implement the above Havel-Hakimi algorithm to draw a simple graph, given its degree sequence, provided such a graph exists:

import networkx as nx
import numpy as np

def get_adj_list(deg_seq):        
    deg_seq = np.array(deg_seq)
    labels = np.arange(len(deg_seq))
    adj_list = {}
    while True:
        indices = np.argsort(-deg_seq)
        deg_seq = deg_seq[indices]
        labels = labels[indices]
        if all(deg_seq == 0):
            return adj_list     
        v = deg_seq[0]            
        if v > len(deg_seq[1:]):
            return None             
        for i in range(1,v+1):
            deg_seq[i] -= 1
            if deg_seq[i] < 0:
                return None
            adj_list[labels[0]] = adj_list.get(labels[0], []) + [labels[i]]            
        deg_seq[0] = 0            
    return None     
 
deg_seqs = [[5, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3], 
            [6, 3, 3, 3, 3, 2, 2, 2, 2, 2,1,1],
            [4,3,2,1],
            [3, 3, 3, 3]]

for deg_seq in deg_seqs:
    adj_list = get_adj_list(deg_seq)
    if adj_list:
        print('The graph with {} can be drawn'.format(deg_seq))
        #print(adj_list)
        G = nx.from_dict_of_lists(adj_list)#.to_undirected()
        nx.draw(G, pos=nx.spring_layout(G), with_labels=True)
        plt.show()
    else:
        print('The graph with {} can\'t be drawn'.format(deg_seq))    

The following animations show how the graphs are drawn with the algorithm, given the corresponding degree-sequence lists.

Graph with degree sequence [5, 4, 4, 4, 4, 4, 3, 3, 3, 3, 3]:
Graph with degree sequence [6, 3, 3, 3, 3, 2, 2, 2, 2, 2,1,1]:

Graph with degree sequence [3, 3, 3, 3]:

Encoding / Decoding with Prüfer sequences

Again we shall concentrate on implementation another graph theory algorithm that uniquely encodes a labelled tree into a sequence and decodes it back. Let’s say we want to constructing all possible spanning trees of the complete graph with K vertices, i.e. Kn. By Cayley’s theorem, we can count number of different possible spanning trees of Kn – there are exactly n^(n-2) of them. Using Prüfer sequences, we can show that there are exactly as many Prufer seuqences as we have number of spanning trees, encode a spanning tree into a unique sequence and decode it back.

Prüfer Encoding

The following algorithm can be used to generate the Prüfer code (of length n-2) given a tree on n vertices – there is a one-to-one mapping between the Prüfer sequence and a spanning tree of Kn (complete graph with n vertices).

The following python code implements the above algorithm:
def get_degree_sequence(tree):
    ds, nbrs = {}, {}
    for node in tree:
        for nbr in tree[node]:
            ds[node] = ds.get(node, 0) + 1
            ds[nbr] = ds.get(nbr, 0) + 1
            nbrs[node] = nbrs.get(node, []) + [nbr]
            nbrs[nbr] = nbrs.get(nbr, []) + [node]
    return ds, nbrs

def get_prufer_seq(tree):
    ds, nbrs = get_degree_sequence(tree)
    seq = []
    while len(ds) > 2:
        min_leaf = min(list(filter(lambda x: ds[x] == 1, ds)))
        parent = nbrs[min_leaf][0]
        seq.append(parent)
        ds[parent] -= 1
        del ds[min_leaf]
        del nbrs[min_leaf]
        nbrs[parent].remove(min_leaf)
    return seq

Invoke the function with the input tree to get the Prüfer code corresponding to the tree, as shown below:

T = {1:[2, 4, 5], 2:[7], 7:[3], 3:[6]}
print(get_prufer_seq(T))
# [1, 1, 2, 7, 3]

The following animation shows the steps in Prüfer code generation:

Prüfer Decoding

We can use the following algorithm (from here) to generate a spanning tree (on nn vertices) of Kn, given the corresponding Prüfer sequence (of length n−2), since there exists a 1-1 mapping always, in between set of all spanning trees of labeled graph Kn (there are n^(n−2) spanning trees of Kn by Cayley’s theorem) and set of all Prüfer sequences of length n−2.

The following python code implements the above algorithm:
def get_tree(S):
    n = len(S)
    L = set(range(1, n+2+1))
    tree_edges = []
    for i in range(n):
        u, v = S[0], min(L - set(S))
        S.pop(0)
        L.remove(v)
        tree_edges.append((u,v))
    tree_edges.append((L.pop(), L.pop()))
    return tree_edges

Invoking the above function on a Prüfer sequenceof length 9, we can obtain the corresponding spanning tree of K11, as shown below:

S = [6,2,2,6,2,5,10,9,9]
T_E = get_tree(S)

The next figure shows the final tree:

The following animation shows the tree-building steps:

Generate all Spanning Trees of Kn

We can use Prüfer sequences (of length n−2) to find the labeled spanning trees for Kn, using the above decoding algorithm, by Cayley’s theorem the number of spanning trees are n^(n−2).

For n=5, there are 5^3=125 such spanning trees on 5 labeled vertices, as can be computed using the above algorithm and seen from the following animation:

Generate Random Trees

Using Prüfer sequences again, here is how we can generate a 20-node labeled random tree (in python):

  1. start from a randomly generated length-18 sequence with each element chosen from the set {1,2,…,20}.
  2. use the generated string as Prufer sequence for the spanning tree for the complete graph K20 on 20 vertices and generate the corresponding labeled tree (since there is a 1-1 correspondence always) using the above decoding algorithm (from here).

Now, we can always generate a prufer sequence (of length n-2) randomly and subsequently generate the corresponding spanning tree (on n vertices) with the function get_tree(), which can serve as our random tree (can be thought of to be randomly sampled from the set of n^(n-2) spanning trees of Kn).

n = 20 # Kn with n vertices
N = 25 # generate 25 random trees with 20 vertices (as spanning trees of K20)
for i in range(N):
    S = np.random.choice(range(1,n+1), n-2, replace=True).tolist()
    T_E = get_tree(S) # the spanning tree corresponding to S
    # plot the tree generated (with `networkx`, e.g.,)

The next animation shows a few such randomly generated labeled trees with 20 nodes.

Construction of an LR(0) Parser

In this problem we shall try to implement an algorithm with compiler-construction with python. The LR-parsing is Non-Backtracking Shift-Reduce Bottom-Up Parser. It scans input string from left-to-right and use left most derivation in reverse.

  • The general idea is to shift some symbols of input to the stack until a reduction can be applied
  • At each reduction step, if a specific substring is matched, then the body of a production is replaced by the Non Terminal at the head of the production
  • The LR parser consists of 1) Input 2)Output 3)Stack 4) Driver Program 5) Parsing Table, as shown in the next figure
  • The Driver Program is same for all LR Parsers.
  • Only the Parsing Table changes from one parser to the other.
  • The difference in between the individual parsers in the class of bottom-up LR parsers is whether they result in shift/reduce or reduce/reduce conflicts when generating the parsing tables. The less it will have the conflicts, the more powerful will be the grammar (LR(0) < SLR(1) < LALR(1) < CLR(1)).
  • The LR Shift-Reduce Parsers can be efficiently implemented by computing an LR(0) automaton and a parsing table to guide the processing.
  • The Parsing Table consists of two parts:
    • A Parsing Action Function and
    • A GOTO function.
  • We shall now focus on the implementation of the basic LR(0) parser for a few simple grammars, which does not use any lookahead and then see how it can be improved to a more powerful SLR(1) parser, with a single lookahead symbol.

For example, consider the following expression grammar:

E → E + T
E → T
T → F
T → T * F
F → ( E )
F → id

It’s not LR(0) but SLR(1). Using the following code, we can construct the LR0 automaton and build the parsing table (we need to augment the grammar, compute the DFA with closure, compute the action and goto sets):

from copy import deepcopy
import pandas as pd

def update_items(I, C):
    if len(I) == 0:
         return C
    for nt in C:
         Int = I.get(nt, [])
         for r in C.get(nt, []):
              if not r in Int:
                  Int.append(r)
          I[nt] = Int
     return I

def compute_action_goto(I, I0, sym, NTs): 
    #I0 = deepcopy(I0)
    I1 = {}
    for NT in I:
        C = {}
        for r in I[NT]:
            r = r.copy()
            ix = r.index('.')
            #if ix == len(r)-1: # reduce step
            if ix >= len(r)-1 or r[ix+1] != sym:
                continue
            r[ix:ix+2] = r[ix:ix+2][::-1]    # read the next symbol sym
            C = compute_closure(r, I0, NTs)
            cnt = C.get(NT, [])
            if not r in cnt:
                cnt.append(r)
            C[NT] = cnt
        I1 = update_items(I1, C)
    return I1

def construct_LR0_automaton(G, NTs, Ts):
    I0 = get_start_state(G, NTs, Ts)
    I = deepcopy(I0)
    queue = [0]
    states2items = {0: I}
    items2states = {str(to_str(I)):0}
    parse_table = {}
    cur = 0
    while len(queue) > 0:
        id = queue.pop(0)
        I = states[id]
        # compute goto set for non-terminals
        for NT in NTs:
            I1 = compute_action_goto(I, I0, NT, NTs) 
            if len(I1) > 0:
                state = str(to_str(I1))
                if not state in statess:
                    cur += 1
                    queue.append(cur)
                    states2items[cur] = I1
                    items2states[state] = cur
                    parse_table[id, NT] = cur
                else:
                    parse_table[id, NT] = items2states[state]
        # compute actions for terminals similarly
        # ... ... ...
                    
    return states2items, items2states, parse_table
        
states, statess, parse_table = construct_LR0_automaton(G, NTs, Ts)

where the grammar G, non-terminal and terminal symbols are defined as below

G = {}
NTs = ['E', 'T', 'F']
Ts = {'+', '*', '(', ')', 'id'}
G['E'] = [['E', '+', 'T'], ['T']]
G['T'] = [['T', '*', 'F'], ['F']]
G['F'] = [['(', 'E', ')'], ['id']]

Here are few more useful function I implemented along with the above ones for LR(0) parsing table generation:

def augment(G, S): # start symbol S
    G[S + '1'] = [[S, '$']]
    NTs.append(S + '1')
    return G, NTs

def compute_closure(r, G, NTs):
    S = {}
    queue = [r]
    seen = []
    while len(queue) > 0:
        r = queue.pop(0)
        seen.append(r)
        ix = r.index('.') + 1
        if ix < len(r) and r[ix] in NTs:
            S[r[ix]] = G[r[ix]]
            for rr in G[r[ix]]:
                if not rr in seen:
                    queue.append(rr)
    return S

The following figure (expand it to view) shows the LR0 DFA constructed for the grammar using the above code:

The following table shows the LR(0) parsing table generated as a pandas dataframe, notice that there are couple of shift/reduce conflicts, indicating that the grammar is not LR(0).

SLR(1) parser avoids the above shift / reduce conflicts by reducing only if the next input token is a member of the Follow Set of the nonterminal being reduced. The following parse table is generated by SLR:

  • The LR driver Program determines Sm, the state on top of the stack and ai , the Current Input symbol.
  • It then consults Action[ Sm, ai] which can take one of four values:
    • Shift
    • Reduce
    • Accept
    • Error
  • The next code snippet implements a driver program.
def parse(input, parse_table, rules):
    stack = [0]
    df = pd.DataFrame(columns=['stack', 'input', 'action'])
    df = df.append({'stack':0, 'input':''.join(input), 'action':''}, ignore_index = True)
    i, k = 0, 0
    accepted = False
    while i < len(input):
        state = stack[-1]
        char = input[i]
        action = parse_table.loc[parse_table.states == state, char].values[0]
        if action[0] == 's':   # shift
            stack.append(char)
            stack.append(int(action[-1]))
            i += 1
        elif action[0] == 'r': # reduce
            r = rules[int(action[-1])]
            l, r = r['l'], r['r']
            char = ''
            for j in range(2*len(r)):
                s = stack.pop()
                if type(s) != int:
                    char = s + char
            if char == ''.join(r):
                goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
                stack.append(l)
                stack.append(int(goto)) #[-1]))
        elif action == 'acc':
            accepted = True
        df2 = {'stack': ''.join(map(str, stack)), 'input': ''.join(input[i:]), 
               'action': 'shift' if action[0] == 's' else 'accept' if action == 'acc' else 
                         'reduce by rule {}'.format(rules[int(action[-1])]['l'] + '-->' + ''.join(rules[int(action[-1])]['r']))}
        df = df.append(df2, ignore_index = True)
        k += 1
        if accepted:
            break
        
    return df

input = ['id', '*', 'id', '+', 'id', '*', 'id', '+', 'id', '$'] #'aaacbbb$' '(', , ')' #list('aaacbbb$') #
parse(input, parse_table, rules)
  • The following animation shows how an input expression is parsed by the driver of the above SLR(1) grammar:

But, the following grammar which accepts the strings of the form a^ncb^n, n >= 1 is LR(0):

S → A
A → a A b
A → c

# S --> A 
# A --> a A b | c
G = {}
NTs = ['S', 'A']
Ts = {'a', 'b', 'c'}
G['S'] = [['A']]
G['A'] = [['a', 'A', 'b'], ['c']]

As can be seen from the following figure, there is no conflict in the parsing table generated.

Here is how the input string a^2cb^2 can be parsed using the above LR(0) parse table, using the following code:
def parse(input, parse_table, rules):
    input = 'aaacbbb$'
    stack = [0]
    df = pd.DataFrame(columns=['stack', 'input', 'action'])
    i, accepted = 0, False
    while i < len(input):
        state = stack[-1]
        char = input[i]
        action = parse_table.loc[parse_table.states == state, char].values[0]
        if action[0] == 's':   # shift
            stack.append(char)
            stack.append(int(action[-1]))
            i += 1
        elif action[0] == 'r': # reduce
            r = rules[int(action[-1])]
            l, r = r['l'], r['r']
            char = ''
            for j in range(2*len(r)):
                s = stack.pop()
                if type(s) != int:
                    char = s + char
            if char == r:
                goto = parse_table.loc[parse_table.states == stack[-1], l].values[0]
                stack.append(l)
                stack.append(int(goto[-1]))
        elif action == 'acc':  # accept
            accepted = True
        df2 = {'stack': ''.join(map(str, stack)), 'input': input[i:], 'action': action}
        df = df.append(df2, ignore_index = True)
        if accepted:
            break
        
    return df

parse(input, parse_table, rules)

The next animation shows how the input string aacbb is parsed with LR(0) parser using the above code:

Deterministic Pushdown Automata

Let’s now focus on simulating an automaton from the theory of computation. The PDA is an automaton equivalent to the CFG in language-defining power. Only the nondeterministic PDA defines all
the CFL’s. But the deterministic version models parsers. Most programming languages have
deterministic PDA’s.

A PDA is described by:

  1. A finite set of states (Q).
  2. An input alphabet (Σ).
  3. A stack alphabet (Γ).
  4. A transition function (δ).
  5. A start state (q0, in Q).
  6. A start symbol (Z0, in Γ).
  7. A set of final states (F ⊆ Q).

If δ(q, a, Z) contains (p,α) among its actions, then one thing the PDA can do in state q, with
a at the front of the input, and Z on top of the stack is:

  1. Change the state to p.
  2. Remove a from the front of the input (but a may be ε).
  3. Replace Z on the top of the stack by α
    .

More specifically, let’s implement a Deterministic Pushdown Automata (DPDA) for the DCFL L = a^nb^n | n >=1 with python.

Here is how we can implement the class DPDA for the CFL a^nb^n, using the following states, stack symbols and transition function from here:

The states shown in the above figure are:
* q = start state. We are in state q if we have seen only 0’s so far.
* p = we’ve seen at least one 1 and may now proceed only if the inputs are 1’s.
* acc = final state; accept.

class DPDA:
    
    def __init__(self, trf, input, state):
        
        self.head = 0
        self.trf = {}
        self.state = str(state)
        self.input = input
        self.trf = trf
        self.stack = ['Z']
        
    def step(self):
        
        a = self.input[self.head]
        s = self.stack.pop()
        state, ss = self.trf.get((self.state, a, s))
        if ss != 'ε':
            for s in ss[::-1]:
                self.stack.append(s)
        self.state = state
        print('{:20s} [{:10s}] {:5s}'.format(self.input[self.head:], 
                       ''.join(self.stack), self.state))        
        self.head += 1
    
    def run(self):
        
        print('{:20s} [{:10s}] {:5s}'.format(self.input[self.head:], 
                              ''.join(self.stack), self.state))
        
        while self.head  < len(self.input):
            self.step()

        s = self.stack.pop()        
        if self.trf.get((self.state, 'ε', s)):
            state, ss = self.trf.get((self.state, 'ε', s))
            self.state = state        
            print('{:20s} [{:10s}] {:5s}'.format('ε', 
                 ''.join(self.stack), self.state))
        
# run DPDA to accept the input string a^9b^9
DPDA({('q', 'a', 'Z'): ('q', 'XZ'),
     ('q', 'a', 'X'): ('q', 'XX'),
     ('q', 'b', 'X'): ('p', 'ε'),
     ('p', 'b', 'X'): ('p', 'ε'),
     ('p', 'ε', 'Z'): ('acc', 'Z'),
    }, 
    'aaaaaaaaabbbbbbbbb', 'q').run()

#input                #stack       #state
#aaaaaaaaabbbbbbbbb   [Z         ] q    
#aaaaaaaaabbbbbbbbb   [ZX        ] q    
#aaaaaaaabbbbbbbbb    [ZXX       ] q    
#aaaaaaabbbbbbbbb     [ZXXX      ] q    
#aaaaaabbbbbbbbb      [ZXXXX     ] q    
#aaaaabbbbbbbbb       [ZXXXXX    ] q    
#aaaabbbbbbbbb        [ZXXXXXX   ] q    
#aaabbbbbbbbb         [ZXXXXXXX  ] q    
#aabbbbbbbbb          [ZXXXXXXXX ] q    
#abbbbbbbbb           [ZXXXXXXXXX] q    
#bbbbbbbbb            [ZXXXXXXXX ] p    
#bbbbbbbb             [ZXXXXXXX  ] p    
#bbbbbbb              [ZXXXXXX   ] p    
#bbbbbb               [ZXXXXX    ] p    
#bbbbb                [ZXXXX     ] p    
#bbbb                 [ZXXX      ] p    
#bbb                  [ZXX       ] p    
#bb                   [ZX        ] p    
#b                    [Z         ] p    
#ε                    [Z         ] acc  

The next animation shows how the string is accepted by the above DPDA:

Simplification of a Boolean function with K-map

In this problem, we shall aim at implementing a greedy version of the K-map algorithm to represent a Boolean function in SOP with minimum number of terms for 4 variables. for 4 variables. The function accepts the Boolean function in SOP (sum of products) form and the names of the variables and returns a simplified reduced representation. Basically you need to create rectangular groups containing total terms in power of two like 8, 4, 2 and try to cover as many elements as you can in one group (we need to cover all the ones).

For example, the function

f(A,B,C,D)=A’B’C’D+A’B’CD+A’BC’D’+A’BC’D+A’BCD’+AB’C’D’+
AB’C’D+ABC’D’+ABCD’+ABCD

can be represented as

f(A,B,C,D)=∑(1,3,4,5,6,8,9,12,14,15).

As can be seen from the output of the next code snippet, the program outputs the simplified form BD’+A’BC’+AB’C’+ABC+A’B’DBD’+A’BC’+AB’C’+ABC+A’B’D, where negation of a boolean variable A is represented A’ and equivalently as ¬A in the code.

from collections import defaultdict
from itertools import permutations, product
    
def kv_map(sop, vars):
    
    sop = set(sop)
    not_covered = sop.copy()
    sop_covered = set([])
    
    mts = [] # minterms
    
    # check for minterms with 1 variable
    all_3 = [''.join(x) for x in product('01', repeat=3)]
    for i in range(4):
        for v_i in [0,1]:
                if len(not_covered) == 0: continue
                mt = ('' if v_i else '¬') + vars[i]
                s = [x[:i]+str(v_i)+x[i:] for x in all_3]
                sop1 = set(map(lambda x: int(x,2), s))
                if len(sop1 & sop) == 8 and len(sop_covered & sop1) < 8: # if not already covered
                    mts.append(mt)
                    sop_covered |= sop1
                    not_covered = not_covered - sop1
        if len(not_covered) == 0:
           return mts
    
    # check for minterms with 2 variables
    all_2 = [''.join(x) for x in product('01', repeat=2)]
    for i in range(4):
        for j in range(i+1, 4):
            for v_i in [0,1]:
                for v_j in [0,1]:
                    if len(not_covered) == 0: continue
                    mt = ('' if v_i else '¬') + vars[i] + ('' if v_j else '¬') + vars[j]
                    s = [x[:i]+str(v_i)+x[i:] for x in all_2]
                    s = [x[:j]+str(v_j)+x[j:] for x in s]
                    sop1 = set(map(lambda x: int(x,2), s))
                    if len(sop1 & sop) == 4 and len(sop_covered & sop1) < 4: # if not already covered
                        mts.append(mt)
                        sop_covered |= sop1
                        not_covered = not_covered - sop1
    if len(not_covered) == 0:
        return mts

    # check for minterms with 3 variables similarly (code omitted)
    # ... ... ...
    
    return mts
    
kv_map([1,3,4,5,6,8,9,12,14,15], ['A', 'B', 'C', 'D'])
mts
# ['B¬D', '¬AB¬C', 'A¬B¬C', 'ABC', '¬A¬BD']

The following animation shows how the above code (greedily) simplifies the Boolean function given in SOP form (the basic goal is to cover all the 11s with minimum number of power-2 blocks).

Since the algorithm is greedy it may get stuck to some local minimum, that we need to be careful about (can be improved!).

Newton-Raphson

Now, let’s implement a very popular algorithm for numerical analysis, namely the Newton-Raphson algorithm, with which roots of functions can be found. The following figure shows the iterative update steps for the algorithm to compute a root of a function f(x). We start with an initial guess and then go by the tangent at the current location to obtain the next guess and finally converge to the root.

Now let’s say we want to approximate 1/a, for an integer a. In order to compute 1/a, for a given a, we can try to find the roots of f(x)=a−1/x=0, s. t., f′(x)=1/x^2 and we have the following iterative update equation till convergence:

Use the following python code to implement the above algorithm:

def f(x):
    return a - 1/x

def df(x):
    return 1/x**2

def newton_raphson(f, df, x, ϵ): 
    x1 = x
    while True:
        x1 -= f(x) / df(x)
        if abs(x1 - x) < ϵ: # converged?
            break
        x = x1
    return x

a = 3
ϵ= 1e-6 # accuracy
x = 0.6 # initial guess
newton_raphson(f, df, x, ϵ)
# 0.3333331240966088

The following animations show how the algorithm converges with the output 0.33333:

Double Hashing

Now, let’s try to implement a hashing algorithm from data structures. In general, here is how we resolve collision with double-hashing: use the second hash function if there is a collision as follows, to find the next location in hash table T, as shown below:

If there is collision even using the composite hash function, the probing continues for |T| times and if still the collision is not resolved, the element can’t be inserted into the hash table using double hashing (use separate-chaining etc. in this case).

Now, let’s implement double hashing with the following code:

def h1(k, m):
    return (2*k+3)%m

def h2(k, m):
    return (3*k+1)%m

def resolve_collision_with_double_hashing(hastable, keys, m, h1, h2):
    for k in keys:
        index = h1(k, m)
        if not hastable[index]: # no collision
            hastable[index] = k
        else:         # use double-hashing
            v = h2(k, m)
            inserted = False
            i = 1 # no need to check for i = 0, since collision already occurred
            while i < m: 
                index1 =  (index +  v * i) % m
                i += 1
                print('inserting {}, number of probings: {}'.format(k, i))
                if not hastable[index1]:
                    hastable[index1], inserted = k, True
                    break
            if not inserted:
                print('could not insert {}'.format(k))

print('hash table: ' + ' '.join(map(lambda x: str(x) if x else '', hastable)))

m = 11
hashtable = [None]*m
keys = [3,2,9,6,11,13,7,1,12,22]
resolve_collision_with_double_hashing(hashtable, keys, m, h1, h2)

# trying to insert 13, number of probings: 2
# trying to insert 13, number of probings: 3
# trying to insert 13, number of probings: 4
# inserted 13
# trying to insert 7, number of probings: 2
# trying to insert 7, number of probings: 3
# trying to insert 7, number of probings: 4
# trying to insert 7, number of probings: 5
# trying to insert 7, number of probings: 6
# trying to insert 7, number of probings: 7
# trying to insert 7, number of probings: 8
# trying to insert 7, number of probings: 9
# trying to insert 7, number of probings: 10
# trying to insert 7, number of probings: 11
# could not insert 7
# trying to insert 12, number of probings: 2
# trying to insert 12, number of probings: 3
# inserted 12
# trying to insert 22, number of probings: 2
# trying to insert 22, number of probings: 3
# trying to insert 22, number of probings: 4
# trying to insert 22, number of probings: 5
# trying to insert 22, number of probings: 6
# inserted 22
# hash table: _ _ 12 11 6 1 13 2 22 3 9

The following animation shows how the keys are inserted in the hash table and collision resolution was attempted using double hashing:

The CYK Algorithm

Let’s focus again on another famous problem from compilers, which is known as the membership problem:

• Given a context-free grammar G and a string w
– G = (V, ∑ ,P , S) where
» V finite set of variables
» ∑ (the alphabet) finite set of terminal symbols
» P finite set of rules
» S start symbol (distinguished element of V)
» V and ∑ are assumed to be disjoint
– G is used to generate the string of a language

– The question we try to answer is the following:
• Is w in L(G)?

The CYK algorithm solves the membership problem for a CFG, with the following assumption:

  • The Structure of the rules in a Chomsky Normal Form (CNF) grammar
    • Context-free grammar is in CNF if each rule has one of the following forms:
      • A –> BC at most 2 symbols (variables / non-terminals) on right side
      • A –> a, or terminal symbol
      • S –> λ, null string
      • here B, C Є V – {S}
  • It uses a “dynamic programming” or “table-filling algorithm”

The following section describes the algorithm steps:

1. Given an input string w of length n, construct a table DP for size n × n.
2. If w = e (empty string) and S -> e is a rule in G then we accept the string else we reject.
3. For i = 1 to n:
     For each variable A:
       We check if A -> b is a rule and b = wi for some i:
        If so, we place A in cell (i, i) of our table. 
4. For l = 2 to n:
     For i = 1 to n-l+1:
       j = i+l-1
       For k = i to j-1:
          For each rule A -> BC: 
       We check if (i, k) cell contains B and (k + 1, j) cell contains C:
           If so, we put A in cell (i, j) of our table. 
5. We check if S is in (1, n):
   If so, we accept the string
   Else, we reject.

The following `python` code shows how to implement the algorithm for a given CFG and how it solves the membership problem given an input string:

Let’s use the above implementation of algorithm for the following simple CFG G (already in CNF):

S -> AB | BC

A -> BA | a

B -> CC | b

C -> AB | a

and the input string w = baaba to test membership of w in L(G).

# first use a data structure to store the given CFG

def get_grammar(rules):
    G = {}
    for rule in rules:
        rule = rule.replace(' ', '')
        lhs, rhs = rule.split('->')
        for r in rhs.split('|'):
            G[lhs] = G.get(lhs, []) + [r]
    return G

NTs = ['S', 'A', 'B', 'C', 'D']
Ts = ['a', 'b']
rules = ['S -> AB | BC', 'A -> BA | a', 'B -> CC | b', 'C -> AB | a'] #, 'D -> ϵ']
G = get_grammar(rules)
print(G)
# {'S': ['AB', 'BC'], 'A': ['BA', 'a'], 'B': ['CC', 'b'], 'C': ['AB', 'a']}

# now check if the grammar is in the chomosky normal form (CNF):
def is_in_CNF(G, NTs, Ts):
    for lhs in G.keys():
        if lhs in NTs:
            for rhs in G[lhs]:
                if len(rhs) == 2:   # of the form A -> BC
                    if not rhs[0] in NTs or not rhs[1] in NTs:
                        return False
                elif len(rhs) == 1: # of the form S -> a
                    if not rhs in Ts + ['ϵ']:
                        return False
                else:
                    return False
    return True

is_in_CNF(G, NTs, Ts)
# True
import numpy as np
import pandas as pd

def is_in_cartesian_prod(x, y, r):
    return r in [i+j for i in x.split(',') for j in y.split(',')]

def accept_CYK(w, G, S):
    if w == 'ϵ':
        return 'ϵ' in G[S]
    n = len(w)
    DP_table = [['']*n for _ in range(n)]
    for i in range(n):
        for lhs in G.keys():
            for rhs in G[lhs]:
                 if w[i] == rhs: # rules of the form A -> a
                    DP_table[i][i] = lhs if not DP_table[i][i] else DP_table[i][i] + ',' + lhs
                    
    for l in range(2, n+1):       # span
        for i in range(n-l+1):    # start
            j = i+l-1                    # right
            for k in range(i, j):     # partition
                for lhs in G.keys():
                    for rhs in G[lhs]:
                        if len(rhs) == 2: #rules of form A -> BC
                            if is_in_cartesian_prod(DP_table[i][k], DP_table[k+1][j], rhs):
                                if not lhs in DP_table[i][j]:
                                    DP_table[i][j] = lhs if not DP_table[i][j] else DP_table[i][j] + ',' + lhs

    return S in DP_table[0][n-1]  

accept_CYK('baaba', G, 'S')
# True

The following animation shows how the algorithm constructs the dynamic programming table for the following simple grammar:

CRC Polynomial Division

Let’s now concentrate on how to encode data with an error-checking code, namely, the cyclic redundancy code (CRC), in order to detect any error at the time of communication. We need to follow the following algorithm (given by the steps listed below) to compute the bits to be appended to the data to encode the data and send it from the transmitter side, given the CRC generating polynomial and the data polynomial, and then at the receiver end use the appended bits to check if there is a possible corruption in any data bit in the communication channel:

  • Convert CRC / data polynomials to corresponding binary equivalents.
  • if the CRC key (binary representation obtained from the polynomial) has k bits, we need to pad an additional k-1 bits with the data to check for errors. In the example given, the bits 011 should be appended to the data, not 0011, since k=4.
  • At the transmitter end
    • The binary data is to be augmented first by adding k-1 zeros in the end of the data.
    • Use modulo-2 binary division to divide binary data by the CRC key and store remainder of division.
    • Append the remainder at the end of the data to form the encoded data and send the same
  • At the receiver end
    • Check if there are errors introduced in transmission
    • Perform modulo-2 division again on the sent data with the CRC key and if the remainder is 0, then there is no error.

Now let’s implement the above:

def CRC_polynomial_to_bin_code(pol):
    return bin(eval(pol.replace('^', '**').replace('x','2')))[2:]

def get_remainder(data_bin, gen_bin):
    ng = len(gen_bin)
    data_bin += '0'*(ng-1)
    nd = len(data_bin)
    divisor = gen_bin
    i = 0
    remainder = ''
    print('\nmod 2 division steps:')
    print('divisor dividend remainder')
    while i < nd:
        j = i + ng - len(remainder)
        if j > nd: 
            remainder += data_bin[i:]
            break
        dividend = remainder + data_bin[i:j]
        remainder = ''.join(['1' if dividend[k] != gen_bin[k] else '0' for k in range(ng)])
        print('{:8s} {:8s} {:8s}'.format(divisor, dividend, remainder[1:]))
        remainder = remainder.lstrip('0')
        i = j
    return remainder.zfill(ng-1)

gen_bin = CRC_polynomial_to_bin_code('x^3+x')
data_bin = CRC_polynomial_to_bin_code('x^11 + x^8 + x^7 + x^2 + x + 1') 
print('transmitter end:\n\nCRC key: {}, data: {}'.format(gen_bin, data_bin))
r = get_remainder(data_bin, gen_bin)
data_crc = data_bin + r
print('\nencoded data: {}'.format(data_crc))
print('\nreceiver end:')
r = get_remainder(data_crc, gen_bin)
print('\nremainder {}'.format(r))

if eval(r) == 0:
    print('data received at the receiver end has no errors')

# ---------------------------------
# transmitter end:
# 
# CRC key: 1010, data: 100110000111
# 
# mod 2 division steps:
# divisor dividend remainder
# 1010     1001     011     
# 1010     1110     100     
# 1010     1000     010     
# 1010     1000     010     
# 1010     1011     001     
# 1010     1100     110     
# 1010     1100     110     
# 
# encoded data: 100110000111110
# ---------------------------------
# receiver end:
# 
# mod 2 division steps:
# divisor dividend remainder
# 1010     1001     011     
# 1010     1110     100     
# 1010     1000     010     
# 1010     1000     010     
# 1010     1011     001     
# 1010     1111     101     
# 1010     1010     000     
# 
# remainder 000
# data received at the receiver end has no errors
# ---------------------------------

Shortest Remaining Time Process Scheduling Algorithm

Let’s now try to implement a preemptive process scheduling algorithm, namely, shortest remaining time next (SRTN) scheduling algorithm for assigning a process to CPU (this one is from the operating systems). This is the preemptive version of Shortest Job First algorithm. At any point in time, the process with smallest amount of time remaining until completion is selected to execute. A process running on CPU is preempted by a new process iff the latter one has smaller execution time than the current one.

When a new process arrives that has execution (burst) time that is less as the remaining completion time of the currently executing process, a context witching will happen and the current process will be removed from the CPU and the newly arrived process will start its execution on CPU.

The Gantt chart is used to show the scheduling of the processes to the CPU.

We can implement the algorithm for preemptive shortest remaining time next scheduling using the following python function and simulate the execution of the processes on CPU, given the process arrival and burst times as a data frame.

import pandas as pd

def SRTN(df): # df is the data frame with arrival / burst time of processes

    queue = []
    cpu, cur_pdf = None, None
    alloc, dalloc = {}, {}

    time = 0

    while True: # simulate the CPU scheduling algorithm

        # check if all processes finished execution
        if df['RemainingTime'].max() == 0:
            break

        # get current process assigned to cpu, if any
        if cpu:
            cur_pdf =  df[df.Process == cpu]    

        # check if a process arrived at this time instance and put it into wait queue
        pdf = df[df.ArrivalTime == time]

        if len(pdf) > 0:
            for p in pdf['Process'].values:
                queue.append(p)

        if len(queue) > 0:
            pdf = df[df['Process'].isin(queue)]

            # find the process with shortest remaining time
            if len(pdf) > 0:
                pdf = pdf[pdf['RemainingTime']==pdf['RemainingTime'].min()]

            # allocate a process to CPU, pre-empt the running one if required
            if (cpu is None) or (len(pdf) > 0 and pdf['RemainingTime'].values[0] < cur_pdf['RemainingTime'].values[0]):
                if cpu:
                    # prempt the current process
                    dalloc[cpu] = dalloc.get(cpu, []) + [time]
                    queue.append(cpu)
                    print('Process {} deallocated from CPU at time {}'.format(cpu, time))
                cur_pdf = pdf
                cpu = cur_pdf['Process'].values[0]
                queue.remove(cpu)
                print('Process {} allocated to CPU at time {}'.format(cpu, time))
                alloc[cpu] = alloc.get(cpu, []) + [time]

        df.loc[df['Process']==cpu,'RemainingTime'] -= 1

        time += 1 # increment timer

        # deallocate process
        if df[df['Process']==cpu]['RemainingTime'].values[0] == 0:
            print('Process {} deallocated from CPU at time {}'.format(cpu, time))
            dalloc[cpu] = dalloc.get(cpu, []) + [time]
            cpu = cur_pdf = None
            
    return alloc, dalloc

Now, run SRTN on the following data (process arrival / burst times):

df = pd.DataFrame({'Process':['A','B','C','D'], 'BurstTime':[3,5,3,2], 'ArrivalTime':[0,2,5,6]})
df.sort_values('ArrivalTime', inplace=True)
df['RemainingTime'] = df.BurstTime

df
alloc, dalloc = SRTN(df)
# Process A allocated to CPU at time 0
# Process A deallocated from CPU at time 3
# Process B allocated to CPU at time 3
# Process B deallocated from CPU at time 8
# Process D allocated to CPU at time 8
# Process D deallocated from CPU at time 10
# Process C allocated to CPU at time 10
# Process C deallocated from CPU at time 13
 
# alloc
# {'A': [0], 'B': [3], 'D': [8], 'C': [10]}
# dalloc
# {'A': [3], 'B': [8], 'D': [10], 'C': [13]}

The following animation shows how the Gantt chart for the preemptive SRTN scheduling algorithm:

Let’s consider the following input table for the arrival of the following 3 processes:

alloc, dalloc, events = SRTN(df)
# Process A allocated to CPU at time 0
# Process A deallocated from CPU at time 1
# Process B allocated to CPU at time 1
# Process B deallocated from CPU at time 5
# Process A allocated to CPU at time 5
# Process A deallocated from CPU at time 11
# Process C allocated to CPU at time 11
# Process C deallocated from CPU at time 19

The Gantt chart corresponding to the above table is shown in the following animation:

Determine if a Relation is in BCNF

Let’s now focus on the following problem from DBMS: given a relation and set of functional dependencies whether the relation is in Boyce-Codd Normal form or not. Using the algorithm to compute the closure of a given set of attributes and the definition of BCNF as shown in the following figure, we can determine whether or not the given relation (with associated set of FDs) is in BCNF.

We can implement the above algorithm in python 

  • to compute closure of a given set of attributes and
  • then determine whether they form a superkey (if the closure yields in all the attributes in the relation) or not, as shown in the following code snippet:
def closure(s, fds):
    c = s
    for f in fds:
        l, r = f[0], f[1]
        if l.issubset(c):
            c = c.union(r)
    if s != c:
        c = closure(c, fds)
    return c

def is_superkey(s, rel, fds):
    c = closure(s, fds)
    print(f'({"".join(sorted(s))})+ = {"".join(sorted(c))}')
    return c == rel

Now check if for each given Functional Dependency A -> B from relation RA is a superkey or not, to determine whether R is in BCNF or not:

def is_in_BCNF(rel, fds):
        for fd in fds:
            l, r = fd[0], fd[1]
            isk = is_superkey(l, rel, fds)
            print(f'For the Functional Dependency {"".join(sorted(l))} -> {"".join(sorted(r))}, ' +\
                  f'{"".join(sorted(l))} {"is" if isk else "is not"} a superkey')
            if not isk:
                print('=> R not in BCNF!')
                return False
        print('=> R in BCNF!')   
        return True

To process the given FDs in standard form, to convert to suitable data structure, we can use the following function:

import re

def process_fds(fds):
    pfds = []
    for fd in fds:
        fd = re.sub('\s+', '', fd)
        l, r = fd.split('->')
        pfds.append([set(list(l)), set(list(r))])
    return pfds

Now, let’s test with a few relations (and associated FDs):

relation = {'U','V','W','X','Y','Z'}
fds = process_fds(['UVW->X', 'VW->YU', 'VWY->Z'])
is_in_BCNF(relation, fds)

# (UVW)+ = UVWXYZ
# For the Functional Dependency UVW -> X, UVW is a superkey
# (VW)+ = UVWXYZ
# For the Functional Dependency VW -> UY, VW is a superkey
# (VWY)+ = UVWXYZ
# For the Functional Dependency VWY -> Z, VWY is a superkey
# => R in BCNF!

relation = {'A','B','C'}
fds = process_fds(['A -> BC', 'B -> A'])
is_in_BCNF(relation, fds)

# (A)+ = ABC
# For the Functional Dependency A -> BC, A is a superkey
# (B)+ = ABC
# For the Functional Dependency B -> A, B is a superkey
# => R in BCNF!

relation = {'A','B','C', 'D'}
fds = process_fds(['AC -> D', 'D -> A', 'D -> C', 'D -> B'])
is_in_BCNF(relation, fds)

# (AC)+ = ABCD
# For the Functional Dependency AC -> D, AC is a superkey
# (D)+ = ABCD
# For the Functional Dependency D -> A, D is a superkey
# (D)+ = ABCD
# For the Functional Dependency D -> C, D is a superkey
# (D)+ = ABCD
# For the Functional Dependency D -> B, D is a superkey
# => R in BCNF!

relation = {'A','B','C', 'D', 'E'}
fds = process_fds(['BCD -> E', 'BDE -> C', 'BE -> D', 'BE -> A'])
is_in_BCNF(relation, fds)

# (BCD)+ = ABCDE
# For the Functional Dependency BCD -> E, BCD is a superkey
# (BDE)+ = ABCDE
# For the Functional Dependency BDE -> C, BDE is a superkey
# (BE)+ = ABCDE
# For the Functional Dependency BE -> D, BE is a superkey
# (BE)+ = ABCDE
# For the Functional Dependency BE -> A, BE is a superkey
# => R in BCNF!

relation = {'A','B','C','D','E'}
fds = process_fds(['BC->D', 'AC->BE', 'B->E'])
is_in_BCNF(relation, fds)

# (BC)+ = BCDE
# For the Functional Dependency BC -> D, BC is not a superkey
 #=> R not in BCNF!

Check if a Relation decomposition is lossless

Let’s now focus on another problem from DBMS: given a relation (along with FDs) and the decompositions, check if the decomposition is lossless or not.

As described here, decomposition of R into R1 and R2 is lossless if

  1. Attributes(R1) U Attributes(R2) = Attributes(R)
  2. Attributes(R1) ∩ Attributes(R2) ≠ Φ
  3. Common attribute must be a key for at least one relation (R1 or R2)

with the assumption that we are not considering the trivial cases where all tuples of R1 / R2 are unique, hence any decomposition will be lossless, since spurious tuples cant be created upon joining (so that (2) holds under this non-trivial assumption).

We can check the above condition with the following python code snippet:

def is_supekey(s, rel, fds):
    c = closure(s, fds)
    print(f'({"".join(sorted(s))})+ = {"".join(sorted(c))}')
    return rel.issubset(c) #c == rel

def is_lossless_decomp(r1, r2, r, fds):
    c = r1.intersection(r2)
    if r1.union(r2) != r:
        print('not lossless: R1 U R2 ≠ R!')
        return False
    if r1.union(r2) != r or len(c) == 0:
        print('not lossless: no common attribute in between R1 and R2!')
        return False
    if not is_supekey(c, r1, fds) and not is_supekey(c, r2, fds):
        print(f'not lossless: common attribute {"".join(c)} not a key in R1 or R2!')
        return False
    print('lossless decomposition!')
    return True     

Now let’s test with the above decompositions given:

r = {'A','B','C','D','E'}
fds = process_fds(['AB->E', 'C->AD', 'D->B', 'E->C'])

r1, r2 = {'A', 'C', 'D'}, {'B', 'C', 'E'} 
is_lossless_decomp(r1, r2, r, fds)
# (C)+ = ACD
# lossless decomposition!

r1, r2 = {'A', 'C', 'D'}, {'A', 'B', 'E'} 
is_lossless_decomp(r1, r2, r, fds)
# (A)+ = A
# not lossless: common attribute A not a key in R1 or R2!

r = {'A','B','C','D','E', 'G'}
fds = process_fds(['AB->C', 'AC->B', 'AD->E', 'B->D', 'BC->A', 'E->G'])
r1, r2 = {'A', 'D', 'G'}, {'A', 'B', 'C', 'D', 'E'} 
is_lossless_decomp(r1, r2, r, fds)
# (AD)+ = ADEG
# lossless decomposition!

The RSA Encryption Algorithm

Let’s now focus on a conceptual implementation of the RSA algorithm for (asymmetric) encryption of a message. Let’s consider sending some message over a communication channel from a sender A to receiver B. The RSA algorithm can be used to encrypt the message (plain-text) to be sent into a cipher-text in such a way that it will be computationally hard for an eavesdropper to extract the plain-text from the cipher-text. For each of the users, the algorithm generates two keys, one being public and shared across all the users, the second one being private, which must be kept secret. The algorithm relies on the fact that the prime factorization is hard and in order to break the encryption, one needs to factorize a number into two large primes, no polynomial time algorithm for which exists, hence providing the security.

The RSA algorithm can be described as follows:

  1. Choose two different large primes (here for the purpose of demonstration let’s choose smaller primes p=89q=97)
  2. Compute n = p*q
  3. Compute Euler Totient φ(n) ≡ (p-1)*(q-1)
  4. Choose the public key e as coprime with φ(n), for simplicity, let’s choose e=257, which is a prime
  5. Compute the private key d, s.t. d*e ≡ 1 (mod φ(n)), using the multiplicative inverse algorithm (extended Euclidean) from here:

# Compute multiplicative inverse of a modulo n
# solution t to a*t ≡ 1 (mod n) 

def multiplicative_inverse(a, n):

    t, newt = 0, 1
    r, newr = n, a

    while newr != 0:
        #print(t, newt, r, newr)
        q = r // newr
        t, newt = newt, t - q * newt
        r, newr = newr, r - q * newr

    if t < 0:
        t = t + n

    return t

Python code for steps 1-5 are shown below:

p, q = 89, 97 # choose large primes here
n = p*q
φ = (p-1)*(q-1)
e = 257 # choose public key e as a Fermat's prime, s.t., gcd(φ, e) = 1
d = multiplicative_inverse(e, φ) # private key
print(d)
# 4865
  1. Encrypt the message with the receiver’s public key (e) at sender’s end
  2. Decrypt the ciphertext received at the receiver end with his private key (d)

The following code shows how encryption / decryption can be done:

def rsa_encrypt(plain_text, e, n):
    # ideally we should convert the plain text to byte array and 
    # then to a big integer which should be encrypted, but here for the sake of 
    # simplicity character-by-character encryption is done, which will be slow in practice
    cipher_text = [ord(x)**e % n for x in plain_text]        
    return cipher_text

def rsa_decrypt(cipher_text, d, n):
    decoded_text = ''.join([chr(x**d % n) for x in cipher_text])
    return decoded_text 

Now, let’s use the above functions for encryption / decryption:

plain_text = 'Hello world'
cipher_text = rsa_encrypt(plain_text, e, n)
print(cipher_text)
# [5527, 7221, 8242, 8242, 5243, 2381, 3611, 5243, 886, 8242, 1735]
decoded_text = rsa_decrypt(cipher_text, d, n)
decoded_text 
# Hello world

Bernstein-Vazirani Quantum Algorithm to find a hidden bit string

Now let’s focus on the following problem: Given an oracle access to f : {0, 1}n → {0, 1} and a promise that the function f(x) ≡ s·x (mod 2), where s is a secret string, the algorithm will learn s just with a single query to the oracle.

Using a quantum computer, we can solve this problem with 100% confidence after only one call to the function f(x). The quantum Bernstein-Vazirani algorithm to find the hidden bit string is described as follows:

  1. Initialize the inputs qubits to the |0⟩⊗n state, and an auxiliary qubit to |−⟩.
  2. Apply Hadamard gates to the input register
  3. Query the oracle
  4. Apply Hadamard gates to the input register
  5. Measure

With qasm simulator with qiskit and the inner-product quantum oracle (parameterized by the secret bits and leveraging the phase-kickback using the auxiliary qubit at state |−⟩), the BV algorithm can be implemented as follows:

import numpy as np
from qiskit.circuit import QuantumCircuit
from qiskit import Aer, execute 
from qiskit.visualization import plot_histogram

def oracle(qc, s):
    n = len(s)
    for i in range(n):
        if s[i] == '1': # phase-kickback
            qc.cx(i, n)
    
def BV_circuit(s): 
    
    s = s[::-1]
    n = len(s)

    # n-qubit quantum register + 1 auxiliary qubit with n classical registers
    qc = QuantumCircuit(n+1, n) 
    for i in range(n):
        qc.h(i)
    qc.x(n)
    qc.h(n)
    
    qc.barrier(range(n))
   
    oracle(qc, s)
    
    qc.barrier(range(n))

    for i in range(n):
        qc.h(i)
    
    qc.measure(list(range(n)), list(range(n)))
    
    return qc, execute(qc,
        Aer.get_backend('qasm_simulator'), shots=1024
    ).result().get_counts()

Running with secret bits 1010, the following figure shows the Hadamard sandwich and the Oracle circuit


Finally, after measurement, it always outputs the secret bits, just 1 run is required as opposed to the classical algorithm requiring n runs to determine the secret bits.

plot_histogram(res) 

Compute Dominant Strategy, Pure and Mixed Nash Equilibriums for a 2-payer Game

Now, let’s now focus on a classic problem in Game Theory and try to understand how to compute / check for existence of a dominant strategy, pure Nash equilibrium and a mixed strategy NE for a 2-player game.

To find a dominant strategy for a given player we need to check if there exists a strategy that always leads to better payoff, irrespective of the other player’s strategy.

To check whether a given pair of moves corresponds to a (pure) Nash equilibrium, we need to check whether any of the players can do better by changing his strategy or not.

The below functions provide a simple implementation for checking dominating strategy and pure Nash equilibrium for a 2-player game.

import numpy as np
import pandas as pd

def get_dominating_strategy(player, payoffs_df, other_strategies):
    other_player = 'p2' if player == 'p1' else 'p1'
    all_best_strategies = set([])
    for other_strategy in other_strategies:
        sdf = payoffs_df[payoffs_df[other_player + '_strategy'] == other_strategy]
        cur_strategy = sdf.loc[sdf[player + '_payoff'].idxmax(), player + '_strategy']
        all_best_strategies.add(cur_strategy)
    return list(all_best_strategies)[0] if len(all_best_strategies) == 1 else None

def exists_better_strategy(player, cur_strategy, cur_payoff, other_strategy, payoffs_df, strategies):
    other_player = 'p2' if player == 'p1' else 'p1'
    sdf = payoffs_df[payoffs_df[other_player + '_strategy'] == other_strategy]
    for i in sdf.index:
        if sdf.loc[i, player + '_payoff'] > cur_payoff:
            return True
    return False
            
def get_pure_nash_equilibriums(payoffs_df, p1_strategies, p2_strategies):
    nash_equlibriums = []
    for i in range(len(payoffs_df)):
        cur_state = payoffs_df.loc[i]
        p1_payoff, p2_payoff = cur_state['p1_payoff'], cur_state['p2_payoff']
        p1_strategy, p2_strategy = cur_state['p1_strategy'], cur_state['p2_strategy']
        nash_equlibrium_found = True
        if exists_better_strategy('p1', p1_strategy, p1_payoff, p2_strategy, payoffs_df, p1_strategies):
            nash_equlibrium_found = False
            continue
        if exists_better_strategy('p2', p2_strategy, p2_payoff, p1_strategy, payoffs_df, p2_strategies):
            nash_equlibrium_found = False
            continue
        if nash_equlibrium_found:
            nash_equlibriums.append((p1_strategy, p2_strategy))
    return None if len(nash_equlibriums) == 0 else nash_equlibriums

Now, let’s test for the famous Prisoner Dilemma game, here let’s use pandas DataFrame to store the utility table.

p1_strategies = ['confess', 'silent']
p2_strategies = ['confess', 'silent']
payoffs_PD = pd.DataFrame({'p1_strategy': ['confess', 'confess', 'silent', 'silent'],
                           'p2_strategy': ['confess', 'silent', 'confess', 'silent'],
                           'p1_payoff': [0, 3, -1, 1],
                           'p2_payoff': [0, -1, 3, 1]})

print(payoffs_PD)  # utilities
#       p1_strategy p2_strategy  p1_payoff  p2_payoff
# 0     confess     confess          0          0
# 1     confess      silent          3         -1
# 2      silent     confess         -1          3
# 3      silent      silent          1          1

ds_p1 = get_dominating_strategy('p1', payoffs_PD, p2_strategies)
ds_p2 = get_dominating_strategy('p2', payoffs_PD, p1_strategies)
print(ds_p1, ds_p2) # dominating strategy for the players
# confess confess

nes = get_pure_nash_equilibriums(payoffs_PD, p1_strategies, p2_strategies)
# pure Nash equlibriums, only 1 here
print(nes)
# [('confess', 'confess')]

Let’s test with another game with the following utilities:

p1_strategies = ['T', 'B']
p2_strategies = ['L', 'R']
payoffs_df = pd.DataFrame({'p1_strategy': ['T', 'T', 'B', 'B'],
                           'p2_strategy': ['L', 'R', 'L', 'R'],
                           'p1_payoff': [7, 0, 2, 4],
                           'p2_payoff': [6, 5, 0, 3]})

print(payoffs_df)
#   p1_strategy p2_strategy  p1_payoff  p2_payoff
# 0           T           L          7          6
# 1           T           R          0          5
# 2           B           L          2          0
# 3           B           R          4          3

ds_p1 = get_dominating_strategy('p1', payoffs_df, p2_strategies)
ds_p2 = get_dominating_strategy('p2', payoffs_df, p1_strategies)
print(ds_p1, ds_p2) # no dominating strategy exists for any of the players
# None None

nes = get_pure_nash_equilibriums(payoffs_df, p1_strategies, p2_strategies)
print(nes) # 2 NE solutions exist
# [('T', 'L'), ('B', 'R')]

In case of mixed strategy equilibrium, we need to solve a linear system of equations, For example, consider the following payoff table

#              player 2 
#              S1      S2         probability
# player 1 S1  a \ e   b \ f      p_1 
#          S2  c \ g   d \ h      p_2
# probability    q_1     q_2

The player1 takes strategy S1S2 with probability p_1 and p_2, respectively.

The player2 takes strategy S1S2 with probability q_1 and q_2, respectively.

Then we must have p_1 + p_2 = 1 and q_1 + q_2 = 1.

At Nash equilibrium, player2 becomes indifferent of his strategy S1 and S2, so we must have the expected payoffs equal:

p_1*a + p_2*c = p_1*b + p_2*d => p_1(a-b) + p_2*(c-d) = 0

Hence we have the following linear system of equations to solve for (p_1, p_2) at NE:

1*p_1 + 1*p_2 = 1
(a-b)*p_1 + (c-d)*p_2 = 0

In matrix form, in order to find (p_1, p_2) at NE, we need to solve(A, b), where 
A = [[1, 1], [a-b, c-d]] and b=[1,0].

Similarly for (q_1, q_2), we need to solve another linear system.

Generalizing, we can implement the mixed strategy NE as follows.

def get_mixed_nash_equilibriums(payoffs_df, p1_strategies, p2_strategies):
    m, n = len(p1_strategies), len(p2_strategies)
    assert(m == n) # assuming player1 and player2 have same number of strategies, for simplicity
    A = np.zeros((m, m))
    b = [1] + [0]*(m-1)
    A[0] = [1]*m  # p_1 + p_2 + ... + p_m = 1
    for j in range(n-1):
        for i in range(m):
            s, s1, s2 = p1_strategies[i], p2_strategies[j], p2_strategies[j+1]
            A[j+1, i] = payoffs_df.loc[(payoffs_df['p1_strategy'] == s) & (payoffs_df['p2_strategy'] == s1), 'p2_payoff']
            A[j+1, i] -= payoffs_df.loc[(payoffs_df['p1_strategy'] == s) & (payoffs_df['p2_strategy'] == s2), 'p2_payoff']
    ps = np.linalg.solve(A, b)
    A = np.zeros((n, n))
    b = [1] + [0]*(n-1)
    A[0] = [1]*n  # q_1 + q_2 + ... + q_n = 1
    for j in range(m-1):
        for i in range(n):
            s, s1, s2 = p2_strategies[i], p1_strategies[j], p1_strategies[j+1]
            A[j+1, i] = payoffs_df.loc[(payoffs_df['p2_strategy'] == s) & (payoffs_df['p1_strategy'] == s1), 'p1_payoff']
            A[j+1, i] -= payoffs_df.loc[(payoffs_df['p2_strategy'] == s) & (payoffs_df['p1_strategy'] == s2), 'p1_payoff']
    qs = np.linalg.solve(A, b)
    return (ps, qs)

Now, let’s test the following game for the existence of DS, pure NE and mixed NE:

p1_strategies = ['H', 'T']
p2_strategies = ['H', 'T']
payoffs_df = pd.DataFrame({'p1_strategy': ['H', 'H', 'T', 'T'],
                           'p2_strategy': ['H', 'T', 'H', 'T'],
                           'p1_payoff': [1, -1, -1, 1],
                           'p2_payoff': [-1, 1, 1, -1]})
print(payoffs_df)
#  p1_strategy p2_strategy  p1_payoff  p2_payoff
#0           H           H          1         -1
#1           H           T         -1          1
#2           T           H         -1          1
#3           T           T          1         -1
ds_p1 = get_dominating_strategy('p1', payoffs_df, p2_strategies)
ds_p2 = get_dominating_strategy('p2', payoffs_df, p1_strategies)
print(ds_p1, ds_p2)
# None None
nes = get_pure_nash_equilibriums(payoffs_df, p1_strategies, p2_strategies)
print(nes)
# None
ps, qs = get_mixed_nash_equilibriums(payoffs_df, p1_strategies, p2_strategies)
print(ps, qs)
# [0.5 0.5] [0.5 0.5]

Note that both DS and pure NE don’t exist in above case where a mixed NE exists.

Let’s try another game, where the same thing happens, although with different probabilities:

p1_strategies = ['F', 'B']
p2_strategies = ['F', 'B']
payoffs_df = pd.DataFrame({'p1_strategy': ['F', 'F', 'B', 'B'],
                           'p2_strategy': ['F', 'B', 'F', 'B'],
                           'p1_payoff': [90, 20, 30, 60],
                           'p2_payoff': [10, 80, 70, 40]})
print(payoffs_df)
#  p1_strategy p2_strategy  p1_payoff  p2_payoff
#0           F           F         90         10
#1           F           B         20         80
#2           B           F         30         70
#3           B           B         60         40
ds_p1 = get_dominating_strategy('p1', payoffs_df, p2_strategies)
ds_p2 = get_dominating_strategy('p2', payoffs_df, p1_strategies)
print(ds_p1, ds_p2)
# None None
nes = get_pure_nash_equilibriums(payoffs_df, p1_strategies, p2_strategies)
print(nes)
# None
ps, qs = get_mixed_nash_equilibriums(payoffs_df, p1_strategies, p2_strategies)
print(ps, qs)
# [0.3 0.7] [0.4 0.6]

References

Probabilistic Deep Learning with Tensorflow

In this blog, we shall discuss on how to implement probabilistic deep learning models using Tensorflow. The problems to be discussed in this blog appeared in the exercises / projects in the coursera course “Probabilistic Deep Learning“, by Imperial College, London, as a part of TensorFlow 2 for Deep Learning Specialization. The problem statements / descriptions are taken from the course itself.

Naive Bayes and logistic regression with Tensorflow Probability

In this problem, we shall develop a Naive Bayes classifier model to the Iris dataset using Distribution objects from TensorFlow Probability. We shall also explore the connection between the Naive Bayes classifier and logistic regression.

The Iris dataset

In this problem, we shall use the Iris dataset. It consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. For a reference, see the following papers:

  • R. A. Fisher. “The use of multiple measurements in taxonomic problems”. Annals of Eugenics. 7 (2): 179–188, 1936.

Our goal will be to construct a Naive Bayes classifier model that predicts the correct class from the sepal length and sepal width features. Under certain assumptions about this classifier model, we shall explore the relation to logistic regression.

The following figures show the 3 categories of flowers from the Iris dataset, namely, setosa, versicolor and verginica, respectively.

Let’s start by importing the required libraries.

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
import numpy as np
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score
from sklearn import datasets, model_selection
%matplotlib inline

Load and prepare the data

We will first read in the Iris dataset, and split the dataset into training and test sets, using the following code snippet.

iris = datasets.load_iris()
# Use only the first two features: sepal length and width

data = iris.data[:, :2]
targets = iris.target

x_train, x_test, y_train, y_test = model_selection.train_test_split(data, targets, test_size=0.2)

labels = {0: 'Iris-Setosa', 1: 'Iris-Versicolour', 2: 'Iris-Virginica'}
label_colours = ['blue', 'orange', 'green']

def plot_data(x, y, labels, colours):
    for c in np.unique(y):
        inx = np.where(y == c)
        plt.scatter(x[inx, 0], x[inx, 1], label=labels[c], c=colours[c])
    plt.title("Training set")
    plt.xlabel("Sepal length (cm)")
    plt.ylabel("Sepal width (cm)")
    plt.legend()
    
plt.figure(figsize=(8, 5))
plot_data(x_train, y_train, labels, label_colours)
plt.show()

Naive Bayes classifier

We will briefly review the Naive Bayes classifier model. The fundamental equation for this classifier is Bayes’ rule:

In the above, dd is the number of features or dimensions in the inputs X (in our case d=2), and K is the number of classes (in our case K=3). The distribution P(Y) is the class prior distribution, which is a discrete distribution over K classes. The distribution P(X|Y) is the class-conditional distribution over inputs.

The Naive Bayes classifier makes the assumption that the data features Xi are conditionally independent give the class Y (the ‘naive’ assumption). In this case, the class-conditional distribution decomposes as follows:

This simplifying assumption means that we typically need to estimate far fewer parameters for each of the distributions P(Xi|Y=yk) instead of the full joint distribution P(X|Y=yk).

Once the class prior distribution and class-conditional densities are estimated, the Naive Bayes classifier model can then make a class prediction for a new data input according to

Define the class prior distribution

We will begin by defining the class prior distribution. To do this we will simply take the maximum likelihood estimate, given by

where the superscript (n) indicates the n-th dataset example, and N is the total number of examples in the dataset. The above is simply the proportion of data examples belonging to class k.

Let’s now define a function that builds the prior distribution from the training data, and returns it as a Categorical Distribution object.

  • The input to your function y will be a numpy array of shape (num_samples,)
  • The entries in y will be integer labels k=0,1,…,K−1
  • The function should build and return the prior distribution as a Categorical distribution object
    • The probabilities for this distribution will be a length-KK vector, with entries corresponding to P(Y=yk) for k=0,1,…,K−1
    • Your function should work for any value of K≥1
    • This Distribution will have an empty batch shape and empty event shape
def get_prior(y):
    return tfd.Categorical(probs=[np.mean(y == i) for i in range(3)]) 

prior = get_prior(y_train)

# Plot the prior distribution

labels = ['Iris-Setosa', 'Iris-Versicolour', 'Iris-Virginica']
plt.bar([0, 1, 2], prior.probs.numpy(), color=label_colours)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1, 2], labels)
plt.show()


Define the class-conditional densities

Let’s now turn to the definition of the class-conditional distributions P(Xi|Y=yk) for i=0,1 and k=0,1,2. In our model, we will assume these distributions to be univariate Gaussian:

with mean parameters μik and standard deviation parameters σik, twelve parameters in all. We will again estimate these parameters using maximum likelihood. In this case, the estimates are given by

Note that the above are just the means and variances of the sample data points for each class.

Let’s now implement a function the computes the class-conditional Gaussian densities, using the maximum likelihood parameter estimates given above, and returns them in a single, batched MultivariateNormalDiag Distribution object.

  • The inputs to the function are
    • a numpy array x of shape (num_samples, num_features) for the data inputs
    • a numpy array y of shape (num_samples,) for the target labels
  • The function should work for any number of classes K≥1 and any number of features d≥1
def get_class_conditionals(x, y):
    mu = [np.mean(x[y == k], axis=0) for k in range(3)]
    sigma = [np.sqrt(np.mean((x[y == k]-mu[k])**2, axis=0)) for k in range(3)]
    return tfd.MultivariateNormalDiag(loc=mu, scale_diag=sigma) 

class_conditionals = get_class_conditionals(x_train, y_train)

We can visualise the class-conditional densities with contour plots by running the cell below. Notice how the contours of each distribution correspond to a Gaussian distribution with diagonal covariance matrix, since the model assumes that each feature is independent given the class.

def get_meshgrid(x0_range, x1_range, num_points=100):
    x0 = np.linspace(x0_range[0], x0_range[1], num_points)
    x1 = np.linspace(x1_range[0], x1_range[1], num_points)
    return np.meshgrid(x0, x1)

def contour_plot(x0_range, x1_range, prob_fn, batch_shape, colours, levels=None, num_points=100):
    X0, X1 = get_meshgrid(x0_range, x1_range, num_points=num_points)
    Z = prob_fn(np.expand_dims(np.array([X0.ravel(), X1.ravel()]).T, 1))
    #print(Z.shape, batch_shape, 'x', *X0.shape)
    Z = np.array(Z).T.reshape(batch_shape, *X0.shape)
    for batch in np.arange(batch_shape):
        if levels:
            plt.contourf(X0, X1, Z[batch], alpha=0.2, colors=colours, levels=levels)
        else:
            plt.contour(X0, X1, Z[batch], colors=colours[batch], alpha=0.3)

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), class_conditionals.prob, 3, label_colours)
plt.title("Training set with class-conditional density contours")
plt.show()

Make predictions from the model

Now the prior and class-conditional distributions are defined, you can use them to compute the model’s class probability predictions for an unknown test input, according to

The class prediction can then be taken as the class with the maximum probability:

Let’s now implement a function to return the model’s class probabilities for a given batch of test inputs of shape (batch_shape, 2), where the batch_shape has rank at least one.

  • The inputs to the function are the prior and class_conditionals distributions, and the inputs x
  • The function should use these distributions to compute the probabilities for each class k as above
    • As before, the function should work for any number of classes K≥1
  • It should then compute the prediction by taking the class with the highest probability
  • The predictions should be returned in a numpy array of shape (batch_shape)
def predict_class(prior, class_conditionals, x):
    x = x[:, np.newaxis, :]
    return tf.argmax(tf.cast(class_conditionals.prob(x),tf.float32)*tf.cast(prior.probs,tf.float32),axis=1).numpy()

predictions = predict_class(prior, class_conditionals, x_test)

# Evaluate the model accuracy on the test set
accuracy = accuracy_score(y_test, predictions)
print("Test accuracy: {:.4f}".format(accuracy))
# Test accuracy: 0.8000

# Plot the model's decision regions

plt.figure(figsize=(10, 6))
plot_data(x_train, y_train, labels, label_colours)
x0_min, x0_max = x_train[:, 0].min(), x_train[:, 0].max()
x1_min, x1_max = x_train[:, 1].min(), x_train[:, 1].max()
contour_plot((x0_min, x0_max), (x1_min, x1_max), 
             lambda x: predict_class(prior, class_conditionals, x), 
             3, label_colours, levels=[-0.5, 0.5, 1.5, 2.5],
             num_points=500)
plt.title("Training set with decision regions")
plt.show()

Binary classifier

We will now draw a connection between the Naive Bayes classifier and logistic regression.

First, we will update our model to be a binary classifier. In particular, the model will output the probability that a given input data sample belongs to the ‘Iris-Setosa’ class: P(Y=y0|X1,…,Xd). The remaining two classes will be pooled together with the label y1.

# Redefine the dataset to have binary labels

y_train_binary = np.array(y_train)
y_train_binary[np.where(y_train_binary == 2)] = 1

y_test_binary = np.array(y_test)
y_test_binary[np.where(y_test_binary == 2)] = 1

# Plot the training data

labels_binary = {0: 'Iris-Setosa', 1: 'Iris-Versicolour / Iris-Virginica'}
label_colours_binary = ['blue', 'red']

plt.figure(figsize=(8, 5))
plot_data(x_train, y_train_binary, labels_binary, label_colours_binary)
plt.show()

We will also make an extra modelling assumption that for each class kk, the class-conditional distribution P(Xi|Y=yk) for each feature i=0,1, has standard deviation σi, which is the same for each class k.

This means there are now six parameters in total: four for the means μik and two for the standard deviations σiσi (i,k=0,1).

We will again use maximum likelihood to estimate these parameters. The prior distribution will be as before, with the class prior probabilities given by

We will use our previous function get_prior to redefine the prior distribution.

prior_binary = get_prior(y_train_binary)
# Plot the prior distribution
plt.bar([0, 1], prior_binary.probs.numpy()[:-1], color=label_colours_binary)
plt.xlabel("Class")
plt.ylabel("Prior probability")
plt.title("Class prior distribution")
plt.xticks([0, 1], labels_binary)
plt.show()


For the class-conditional densities, the maximum likelihood estimate for the means are again given by

However, the estimate for the standard deviations σi is updated. There is also a closed-form solution for the shared standard deviations, but we will instead learn these from the data.

Let’s now implement a function that takes the training inputs and target labels as input, as well as an optimizer object, number of epochs and a TensorFlow Variable. This function should be written according to the following spec:

  • The inputs to the function are:
    • a numpy array x of shape (num_samples, num_features) for the data inputs
    • a numpy array y of shape (num_samples,) for the target labels
    • tf.Variable object scales of length 2 for the standard deviations σiσi
    • optimiser: an optimiser object
    • epochs: the number of epochs to run the training for
  • The function should first compute the means μik of the class-conditional Gaussians according to the above equation
  • Then create a batched multivariate Gaussian distribution object using MultivariateNormalDiag with the means set to μik and the scales set to scales
  • Run a custom training loop for epochs number of epochs, in which:
    • the average per-example negative log likelihood for the whole dataset is computed as the loss
    • the gradient of the loss with respect to the scales variables is computed
    • the scales variables are updated by the optimiser object
  • At each iteration, save the values of the scales variable and the loss
  • The function should return a tuple of three objects:
    • a numpy array of shape (epochs,) of loss values
    • a numpy array of shape (epochs, 2) of values for the scales variable at each iteration
    • the final learned batched MultivariateNormalDiag distribution object
mu = [np.mean(x_train[y_train_binary == k], axis=0) for k in range(2)]

def learn_stdevs(x, y, scales, optimiser, epochs):
    
    def nll(x, y, distribution):
        predictions = - distribution.log_prob(x)
        return tf.reduce_sum(predictions[y==0][:,0]) + tf.reduce_sum(predictions[y==1][:,1])

    @tf.function
    def get_loss_and_grads(x, y, distribution):
        with tf.GradientTape() as tape:
            tape.watch(distribution.trainable_variables)
            loss = nll(x, y, distribution)
            grads = tape.gradient(loss, distribution.trainable_variables)
        return loss, grads

    shape = (len(set(y)), x.shape[-1])
    loc = np.zeros(shape, dtype=np.float32)

    for feature in range(shape[0]):
        for category in range(shape[-1]):
            data_point = x[y == category][:, feature]
            loc[category, feature] = np.mean(data_point)

    distribution = tfd.MultivariateNormalDiag(loc=loc, scale_diag=scales) #b(2), e(2)
    x = np.expand_dims(x , 1).astype('float32')
    

    train_loss_results = []
    train_scale_results = []

    for epoch in range(epochs):
        loss, grads = get_loss_and_grads(x, y, distribution)
        optimiser.apply_gradients(zip(grads, distribution.trainable_variables))
        scales = distribution.parameters['scale_diag'].numpy()
        train_loss_results.append(loss)
        train_scale_results.append(scales)
        if epoch % 10 == 0:
            print(f'epoch: {epoch}, loss: {loss}')
        
    return np.array(train_loss_results), np.array(train_scale_results), distribution

scales = tf.Variable([1., 1.])
opt = tf.keras.optimizers.Adam(learning_rate=0.01)
epochs = 500
# run the training loop
nlls, scales_arr, class_conditionals_binary = learn_stdevs(x_train, y_train_binary, scales, opt, epochs)

epoch: 0, loss: 246.33450317382812
epoch: 10, loss: 227.07168579101562
epoch: 20, loss: 207.1158905029297
epoch: 30, loss: 187.12120056152344
epoch: 40, loss: 168.60015869140625
epoch: 50, loss: 153.5633087158203
epoch: 60, loss: 143.8475341796875
epoch: 70, loss: 142.80393981933594
epoch: 80, loss: 142.56259155273438
epoch: 90, loss: 142.23074340820312
epoch: 100, loss: 142.25711059570312
epoch: 110, loss: 142.18955993652344
epoch: 120, loss: 142.1979217529297
epoch: 130, loss: 142.18882751464844
epoch: 140, loss: 142.18991088867188
epoch: 150, loss: 142.1887664794922
epoch: 160, loss: 142.1888885498047
epoch: 170, loss: 142.18875122070312
epoch: 180, loss: 142.1887664794922
epoch: 190, loss: 142.1887664794922
epoch: 200, loss: 142.1887664794922
epoch: 210, loss: 142.18875122070312
epoch: 220, loss: 142.1887664794922
epoch: 230, loss: 142.18873596191406
epoch: 240, loss: 142.18878173828125
epoch: 250, loss: 142.18875122070312
epoch: 260, loss: 142.18875122070312
epoch: 270, loss: 142.18875122070312
epoch: 280, loss: 142.18875122070312
epoch: 290, loss: 142.18875122070312
epoch: 300, loss: 142.18878173828125
epoch: 310, loss: 142.18875122070312
epoch: 320, loss: 142.18875122070312
epoch: 330, loss: 142.18875122070312
epoch: 340, loss: 142.18875122070312
epoch: 350, loss: 142.1887664794922
epoch: 360, loss: 142.1887664794922
epoch: 370, loss: 142.1887664794922
epoch: 380, loss: 142.1887664794922
epoch: 390, loss: 142.1887664794922
epoch: 400, loss: 142.1887664794922
epoch: 410, loss: 142.1887664794922
epoch: 420, loss: 142.1887664794922
epoch: 430, loss: 142.1887664794922
epoch: 440, loss: 142.1887664794922
epoch: 450, loss: 142.1887664794922
epoch: 460, loss: 142.1887664794922
epoch: 470, loss: 142.1887664794922
epoch: 480, loss: 142.1887664794922
epoch: 490, loss: 142.1887664794922
print("Class conditional means:")
print(class_conditionals_binary.loc.numpy())
print("\nClass conditional standard deviations:")
print(class_conditionals_binary.stddev().numpy())
Class conditional means:
[[4.9692307 3.3820512]
 [6.2172837 2.8814814]]

Class conditional standard deviations:
[[0.5590086  0.34253535]
 [0.5590086  0.34253535]]
# Plot the loss and convergence of the standard deviation parameters

fig, ax = plt.subplots(1, 2, figsize=(14, 5))
ax[0].plot(nlls)
ax[0].set_title("Loss vs epoch")
ax[0].set_xlabel("Epoch")
ax[0].set_ylabel("Average negative log-likelihood")
for k in [0, 1]:
    ax[1].plot(scales_arr[:, k], color=label_colours_binary[k], label=labels_binary[k])
ax[1].set_title("Standard deviation ML estimates vs epoch")
ax[1].set_xlabel("Epoch")
ax[1].set_ylabel("Standard deviation")
plt.legend()
plt.show()

We can also plot the contours of the class-conditional Gaussian distributions as before, this time with just binary labelled data. Notice the contours are the same for each class, just with a different centre location.

We can also plot the decision regions for this binary classifier model, notice that the decision boundary is now linear.

The following animation shows how we can learn the standard deviation parameters for the class-conditional distributions for Naive Bayes using tensorflow, with the above code snippet.

In fact, we can see that our predictive distribution P(Y=y0|X) can be written as follows:

With our additional modelling assumption of a shared covariance matrix Σ, it can be shown (using the Gaussian pdf) that a is in fact a linear function of X:

The model therefore takes the form P(Y=y0|X)=σ(wTX+w0), with weights w∈R2 and bias w0∈R. This is the form used by logistic regression, and explains why the decision boundary above is linear.

In the above we have outlined the derivation of the generative logistic regression model. The parameters are typically estimated with maximum likelihood, as we have done.

Finally, we will use the above equations to directly parameterize the output Bernoulli distribution of the generative logistic regression model.

Let’s now write the following function, according to the following specification:

  • The inputs to the function are:
    • the prior distribution prior over the two classes
    • the (batched) class-conditional distribution class_conditionals
  • The function should use the parameters of the above distributions to compute the weights and bias terms w and w0 as above
  • The function should then return a tuple of two numpy arrays for w and w0
def get_logistic_regression_params(prior, class_conditionals):    
    Sigma = class_conditionals.covariance().numpy()
    SI = np.linalg.inv(Sigma)
    p = prior.probs.numpy()
    mu = class_conditionals.parameters['loc'] #.numpy()
    w = SI @ (mu[0] - mu[1])
    w0 = -0.5*mu[0].T@SI@mu[0] + 0.5*mu[1].T@SI@mu[1] + np.log(p[0]/p[1])
    return w, w0

w, w0 = get_logistic_regression_params(prior_binary, class_conditionals_binary)

We can now use these parameters to make a contour plot to display the predictive distribution of our logistic regression model.

Probabilistic generative models

Let’s start with generative models, using normalizing flow networks and the variational autoencoder algorithm. We shall create a synthetic dataset with a normalizing flow with randomised parameters. This dataset will then be used to train a variational autoencoder, and the trained model will be used to interpolate between the generated images. The concepts to be used will be

  • Distribution objects
  • Probabilistic layers
  • Bijectors
  • ELBO optimization
  • KL divergence Regularizers.

The next figure represents the theory required for the implementation:

Let’s start by running importing the required libraries first. 

import tensorflow as tf
import tensorflow_probability as tfp
tfd = tfp.distributions
tfb = tfp.bijectors
tfpl = tfp.layers

import numpy as np
import matplotlib.pyplot as plt

We shall create our own image dataset from contour plots of a transformed distribution using a random normalizing flow network and then use the variational autoencoder algorithm to train generative and inference networks, and synthesise new images by interpolating in the latent space.

The normalising flow

  • To construct the image dataset, you will build a normalizing flow to transform the 2-D Gaussian random variable z = (z1, z2), which has mean 0 and covariance matrix Σ=σ^2.I2, with σ=0.3.
  • This normalizing flow uses bijectors that are parameterized by the following random variables:
    • θ ∼ U[0,2π)
    • a ∼ N(3,1)

The complete normalising flow is given by the following chain of transformations:

  • f1(z)=(z1,z2−2)
  • f2(z)=(z1,z2/2),
  • f3(z)=(z1,z2+a.z1^2),
  • f4(z)=R.z, where R is a rotation matrix with angle θ,
  • f5(z)=tanh(z), where the tanh function is applied elementwise.

The transformed random variable x is given by x=f5(f4(f3(f2(f1(z))))).

  • We need to use or construct bijectors for each of the transformations fi, i=1,…,5 and use tfb.Chain and tfb.TransformedDistribution to construct the final transformed distribution.
  • Ensure to implement the log_det_jacobian methods for any subclassed bijectors that you write.
  • Display a scatter plot of samples from the base distribution.
  • Display 4 scatter plot images of the transformed distribution from your random normalizing flow, using samples of θ and a. Fix the axes of these 4 plots to the range [−1,1].

The following code block shows how to implement the above steps:

def plot_distribution(samples, ax, title, col='red'):
    ax.scatter(samples[:, 0], samples[:, 1], marker='.', c=col, alpha=0.5) 
    ax.set_xlim([-1,1])
    ax.set_ylim([-1,1])
    ax.set_title(title, size=15)
# f3(𝑧)=(𝑧1,𝑧2+𝑎𝑧1^2) 
class Degree2Polynomial(tfb.Bijector):

    def __init__(self, a):
        self.a = a
        super(Degree2Polynomial, self).__init__(forward_min_event_ndims=1, is_constant_jacobian=True)
        
    def _forward(self, x):
        return tf.concat([x[..., :1], x[..., 1:] + self.a * tf.square(x[..., :1])], axis=-1)
    
    def _inverse(self, y):
        return tf.concat([y[..., :1], y[..., 1:] - self.a * tf.square(y[..., :1])], axis=-1)
        
    def _forward_log_det_jacobian(self, x):
        return tf.constant(0., dtype=x.dtype)

    
# f4(𝑧)=Rz
class Rotation(tfb.Bijector):

    def __init__(self, theta):
        self.R = tf.constant([[np.cos(theta), -np.sin(theta)], 
                             [np.sin(theta), np.cos(theta)]], dtype=tf.float32)
        super(Rotation, self).__init__(forward_min_event_ndims=1, is_constant_jacobian=True)
        
    def _forward(self, x):
        return tf.linalg.matvec(self.R, x)
    
    def _inverse(self, y):
        return tf.linalg.matvec(tf.transpose(self.R), y)
    
    def _forward_log_det_jacobian(self, x):
        return tf.constant(0., x.dtype)
def get_normalizing_flow_dist(a, theta):
    bijectors = [
                    tfb.Shift([0.,-2]),   # f1
                    tfb.Scale([1,1/2]),   # f2
                    Degree2Polynomial(a), # f3
                    Rotation(theta),      # f4
                    tfb.Tanh()            # f5
               ]
    flow_bijector = tfb.Chain(list(reversed(bijectors)))
    return tfd.TransformedDistribution(distribution=base_distribution,
                                                        bijector=flow_bijector)
nsamples= 10000
sigma = 0.3
base_distribution = tfd.MultivariateNormalDiag(loc=tf.zeros(2), scale_diag=sigma*tf.ones(2))
samples = base_distribution.sample(nsamples)
fig, ax = plt.subplots(figsize=(8,8))
plot_distribution(samples, ax, 'Base distribution', 'blue')
plt.show()
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(15,15))
axes = axes.flatten()
plt.subplots_adjust(0, 0, 1, 0.925, 0.05, 0.05)
colors = ['red', 'green', 'orange', 'magenta']
for i in range(4):
    a = tfd.Normal(loc=3, scale=1).sample(1)[0].numpy()
    theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1)[0].numpy()
    transformed_distribution = get_normalizing_flow_dist(a, theta)
    samples = transformed_distribution.sample(nsamples)
    plot_distribution(samples, axes[i], r'$\theta$={:.02f}, a={:.02f}'.format(theta, a), colors[i])
plt.suptitle('Transformed Distribution with Normalizing Flow', size=20)
plt.show()

Create the image dataset

  • Let’s now use your random normalizing flow to generate an image dataset of contour plots from our random normalising flow network.
  • First, let’s display a sample of 4 contour plot images from our normalizing flow network using 4 independently sampled sets of parameters, using the following get_densities function: this function calculates density values for a (batched) Distribution for use in a contour plot.
  • The dataset should consist of at least 1000 images, stored in a numpy array of shape (N, 36, 36, 3). Each image in the dataset should correspond to a contour plot of a transformed distribution from a normalizing flow with an independently sampled set of parameters. It will take a few minutes to create the dataset.
  • As well as the get_densities function, the following get_image_array_from_density_values function will help to generate the dataset. This function creates a numpy array for an image of the contour plot for a given set of density values Z. Feel free to choose your own options for the contour plots.
  • Let’s display a sample of 20 images from your generated dataset in a figure.
X, Y = np.meshgrid(np.linspace(-1, 1, 100), np.linspace(-1, 1, 100))
inputs = np.transpose(np.stack((X, Y)), [1, 2, 0])

def get_densities(transformed_distribution):
    batch_shape = transformed_distribution.batch_shape
    Z = transformed_distribution.prob(np.expand_dims(inputs, 2))
    Z = np.transpose(Z, list(range(2, 2+len(batch_shape))) + [0, 1])
    return Z
import numpy as np
from matplotlib.backends.backend_agg import FigureCanvasAgg as FigureCanvas
from matplotlib.figure import Figure

def get_image_array_from_density_values(Z):
    assert Z.shape == (100, 100)
    fig = Figure(figsize=(0.5, 0.5))
    canvas = FigureCanvas(fig)
    ax = fig.gca()
    ax.contourf(X, Y, Z, cmap='hot', levels=100)
    ax.axis('off')
    fig.tight_layout(pad=0)

    ax.margins(0)
    fig.canvas.draw()
    image_from_plot = np.frombuffer(fig.canvas.tostring_rgb(), dtype=np.uint8)
    image_from_plot = image_from_plot.reshape(fig.canvas.get_width_height()[::-1] + (3,))
    return image_from_plot
plt.figure(figsize=(5,5))
plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
for i in range(4):
    a = tfd.Normal(loc=3, scale=1).sample(1)[0].numpy()
    theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1)[0].numpy()
    transformed_distribution = get_normalizing_flow_dist(a, theta)
    transformed_distribution = tfd.BatchReshape(transformed_distribution, [1])
    Z = get_densities(transformed_distribution)
    image = get_image_array_from_density_values(Z.squeeze())
    plt.subplot(2,2,i+1), plt.imshow(image), plt.axis('off')
    plt.title(r'$\theta$={:.02f}, a={:.02f}'.format(theta, a), size=10)
plt.show()
N = 1000
image_dataset = np.zeros((N, 36, 36, 3))
for i in range(N):
    a = tfd.Normal(loc=3, scale=1).sample(1)[0].numpy()
    theta = tfd.Uniform(low = 0, high = 2*np.pi).sample(1)[0].numpy()
    transformed_distribution = tfd.BatchReshape(get_normalizing_flow_dist(a, theta), [1])
    image_dataset[i,...] = get_image_array_from_density_values(get_densities(transformed_distribution).squeeze())
image_dataset = tf.convert_to_tensor(image_dataset, dtype=tf.float32)
image_dataset.shape
# TensorShape([1000, 36, 36, 3])

plt.figure(figsize=(20,4))
plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
indices = np.random.choice(N, 20)
for i in range(20):
    image = image_dataset[indices[i]].numpy()
    image = image / image.max()
    plt.subplot(2,10,i+1), plt.imshow(image), plt.axis('off')
plt.show()

Create tf.data.Dataset objects

  • Let’s now split your dataset to create tf.data.Dataset objects for training and validation data.
  • Using the map method, ;et’s normalize the pixel values so that they lie between 0 and 1.
  • These Datasets will be used to train a variational autoencoder (VAE). Use the map method to return a tuple of input and output Tensors where the image is duplicated as both input and output.
  • Randomly shuffle the training Dataset.
  • Batch both datasets with a batch size of 20, setting drop_remainder=True.
  • Print the element_spec property for one of the Dataset objects.
n = len(image_dataset)
tf_image_dataset = tf.data.Dataset.from_tensor_slices(image_dataset)
tf_image_dataset = tf_image_dataset.shuffle(3)
tf_image_dataset = tf_image_dataset.map(lambda x : x / tf.reduce_max(x))
tf_image_dataset = tf_image_dataset.map(lambda x: (x, x))
train_sz = int(0.8*n)
training = tf_image_dataset.take(train_sz)
validation = tf_image_dataset.skip(train_sz)
training = training.batch(batch_size=20, drop_remainder=True)
validation = validation.batch(batch_size=20, drop_remainder=True)
training.element_spec
#(TensorSpec(shape=(20, 36, 36, 3), dtype=tf.float32, name=None),
# TensorSpec(shape=(20, 36, 36, 3), dtype=tf.float32, name=None))

Build the encoder and decoder networks

  • Let’s now create the encoder and decoder for the variational autoencoder algorithm.
  • Let’s design these networks, subject to the following constraints:
    • The encoder and decoder networks should be built using the Sequential class.
    • The encoder and decoder networks should use probabilistic layers where necessary to represent distributions.
    • The prior distribution should be a zero-mean, isotropic Gaussian (identity covariance matrix).
    • The encoder network should add the KL divergence loss to the model.
  • Print the model summary for the encoder and decoder networks.
from tensorflow.keras.models import Sequential, Model
from tensorflow.keras.layers import (Dense, Flatten, Reshape, Concatenate, Conv2D, UpSampling2D, BatchNormalization)
latent_dim = 2 #50
prior = tfd.MultivariateNormalDiag(loc=tf.zeros(latent_dim))

def get_kl_regularizer(prior_distribution):
    return tfpl.KLDivergenceRegularizer(prior_distribution,
                                        weight=1.0,
                                        use_exact_kl=False,
                                        test_points_fn=lambda q: q.sample(3),
                                        test_points_reduce_axis=(0,1))        

kl_regularizer = get_kl_regularizer(prior)

def get_encoder(latent_dim, kl_regularizer):
    return Sequential([
            Conv2D(filters=32, kernel_size=3, activation='relu', strides=2, padding='same', input_shape=(36,36,3)),
            BatchNormalization(),
            Conv2D(filters=64, kernel_size=3, activation='relu', strides=2, padding='same'),
            BatchNormalization(),
            Conv2D(filters=128, kernel_size=3, activation='relu', strides=3, padding='same'),
            BatchNormalization(),
            Flatten(),
            Dense(tfpl.MultivariateNormalTriL.params_size(latent_dim)),
            tfpl.MultivariateNormalTriL(latent_dim, activity_regularizer=kl_regularizer)
        ], name='encoder')      

def get_decoder(latent_dim):
    return Sequential([
        Dense(1152, activation='relu', input_shape=(latent_dim,)), 
        Reshape((3,3,128)),
        UpSampling2D(size=(3,3)),
        Conv2D(filters=64, kernel_size=3, activation='relu', padding='same'),
        UpSampling2D(size=(2,2)),
        Conv2D(filters=32, kernel_size=2, activation='relu', padding='same'),
        UpSampling2D(size=(2,2)),
        Conv2D(filters=128, kernel_size=2, activation='relu', padding='same'),
        Conv2D(filters=3, kernel_size=2, activation=None, padding='same'),
        Flatten(),   
        tfpl.IndependentBernoulli(event_shape=(36,36,3))
    ], name='decoder')    

encoder = get_encoder(latent_dim=2, kl_regularizer=kl_regularizer)
#encoder.losses
encoder.summary()

Model: "encoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
conv2d (Conv2D)              (None, 18, 18, 32)        896       
_________________________________________________________________
batch_normalization (BatchNo (None, 18, 18, 32)        128       
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 9, 9, 64)          18496     
_________________________________________________________________
batch_normalization_1 (Batch (None, 9, 9, 64)          256       
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 3, 3, 128)         73856     
_________________________________________________________________
batch_normalization_2 (Batch (None, 3, 3, 128)         512       
_________________________________________________________________
flatten (Flatten)            (None, 1152)              0         
_________________________________________________________________
dense (Dense)                (None, 5)                 5765      
_________________________________________________________________
multivariate_normal_tri_l (M multiple                  0         
=================================================================
Total params: 99,909
Trainable params: 99,461
Non-trainable params: 448
_________________________________________________________________

decoder = get_decoder(latent_dim=2)
decoder.summary()

Model: "decoder"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
dense_1 (Dense)              (None, 1152)              3456      
_________________________________________________________________
reshape (Reshape)            (None, 3, 3, 128)         0         
_________________________________________________________________
up_sampling2d (UpSampling2D) (None, 9, 9, 128)         0         
_________________________________________________________________
conv2d_3 (Conv2D)            (None, 9, 9, 64)          73792     
_________________________________________________________________
up_sampling2d_1 (UpSampling2 (None, 18, 18, 64)        0         
_________________________________________________________________
conv2d_4 (Conv2D)            (None, 18, 18, 32)        8224      
_________________________________________________________________
up_sampling2d_2 (UpSampling2 (None, 36, 36, 32)        0         
_________________________________________________________________
conv2d_5 (Conv2D)            (None, 36, 36, 128)       16512     
_________________________________________________________________
conv2d_6 (Conv2D)            (None, 36, 36, 3)         1539      
_________________________________________________________________
flatten_1 (Flatten)          (None, 3888)              0         
_________________________________________________________________
independent_bernoulli (Indep multiple                  0         
=================================================================
Total params: 103,523
Trainable params: 103,523
Non-trainable params: 0
____________________________
def reconstruction_loss(batch_of_images, decoding_dist):
    return -tf.reduce_mean(decoding_dist.log_prob(batch_of_images))

Train the variational autoencoder

  • Let’s now train the variational autoencoder. Build the VAE using the Model class and the encoder and decoder models. Print the model summary.
  • Compile the VAE with the negative log likelihood loss and train with the fit method, using the training and validation Datasets.
  • Plot the learning curves for loss vs epoch for both training and validation sets.
vae = Model(inputs=encoder.inputs, outputs=decoder(encoder.outputs))
optimizer = tf.keras.optimizers.Adam(learning_rate=0.0005)
vae.compile(optimizer=optimizer, loss=reconstruction_loss)
history = vae.fit(training, validation_data=validation, epochs=20)

Epoch 1/20
40/40 [==============================] - 34s 777ms/step - loss: 1250.2296 - val_loss: 1858.7103
Epoch 2/20
40/40 [==============================] - 29s 731ms/step - loss: 661.8687 - val_loss: 1605.1261
Epoch 3/20
40/40 [==============================] - 29s 720ms/step - loss: 545.2802 - val_loss: 1245.0518
Epoch 4/20
40/40 [==============================] - 28s 713ms/step - loss: 489.1101 - val_loss: 1024.5863
Epoch 5/20
40/40 [==============================] - 29s 718ms/step - loss: 453.3464 - val_loss: 841.4725
Epoch 6/20
40/40 [==============================] - 29s 733ms/step - loss: 438.8413 - val_loss: 742.0212
Epoch 7/20
40/40 [==============================] - 30s 751ms/step - loss: 433.2563 - val_loss: 657.4024
Epoch 8/20
40/40 [==============================] - 30s 751ms/step - loss: 417.5353 - val_loss: 602.7039
Epoch 9/20
40/40 [==============================] - 29s 726ms/step - loss: 409.8351 - val_loss: 545.5004
Epoch 10/20
40/40 [==============================] - 30s 741ms/step - loss: 406.3284 - val_loss: 507.9868
Epoch 11/20
40/40 [==============================] - 30s 741ms/step - loss: 402.9056 - val_loss: 462.0777
Epoch 12/20
40/40 [==============================] - 29s 733ms/step - loss: 397.4801 - val_loss: 444.4444
Epoch 13/20
40/40 [==============================] - 30s 741ms/step - loss: 398.2078 - val_loss: 423.1287
Epoch 14/20
40/40 [==============================] - 29s 723ms/step - loss: 395.5187 - val_loss: 411.3030
Epoch 15/20
40/40 [==============================] - 30s 739ms/step - loss: 397.3987 - val_loss: 407.5134
Epoch 16/20
40/40 [==============================] - 29s 721ms/step - loss: 399.3271 - val_loss: 402.7288
Epoch 17/20
40/40 [==============================] - 29s 736ms/step - loss: 393.4259 - val_loss: 401.4711
Epoch 18/20
40/40 [==============================] - 29s 726ms/step - loss: 390.5508 - val_loss: 399.1924
Epoch 19/20
40/40 [==============================] - 29s 736ms/step - loss: 389.3187 - val_loss: 401.1656
Epoch 20/20
40/40 [==============================] - 29s 728ms/step - loss: 389.4718 - val_loss: 393.5178
nepochs = 20
plt.figure(figsize=(8,5))
plt.plot(range(nepochs), history.history['loss'], label='train-loss')
plt.plot(range(nepochs), history.history['val_loss'], label='valid-loss')
plt.legend()
plt.xlabel('epochs')
plt.ylabel('loss')
plt.show()

Use the encoder and decoder networks

  • Let’s now put your encoder and decoder networks into practice!
  • Randomly sample 1000 images from the dataset, and pass them through the encoder. Display the embeddings in a scatter plot (project to 2 dimensions if the latent space has dimension higher than two).
  • Randomly sample 4 images from the dataset and for each image, display the original and reconstructed image from the VAE in a figure.
    • Use the mean of the output distribution to display the images.
  • Randomly sample 6 latent variable realisations from the prior distribution, and display the images in a figure.
    • Again use the mean of the output distribution to display the images.
def reconstruct(encoder, decoder, batch_of_images):
    approx_distribution = encoder(batch_of_images)
    decoding_dist = decoder(approx_distribution.mean())
    return decoding_dist.mean()

embedding = encoder(image_dataset / 255).mean()
fig, ax = plt.subplots(figsize=(8,8))
plt.scatter(embedding[:,0], embedding[:,1], c='red', s=50, edgecolor='k')
plt.title('Embedding', size=20)
plt.show()
plt.figure(figsize=(6,12))
plt.subplots_adjust(0, 0, 1, 0.95, 0.05, 0.08)
indices = np.random.choice(len(image_dataset), 4)
for i in range(4):
    image = image_dataset[indices[i]].numpy()
    image = image / image.max()
    plt.subplot(4,2,2*i+1), plt.imshow(image), plt.axis('off')
    reconstructions = reconstruct(encoder, decoder, np.expand_dims(image, axis=0))
    plt.subplot(4,2,2*i+2), plt.imshow(reconstructions[0].numpy()), plt.axis('off')
plt.suptitle('original (left column) vs. VAE-reconstructed (right column)', size=15)
plt.show()
nsample = 6
samples = np.random.uniform(-10, 10, (nsample, latent_dim)) #prior.sample(6)
fig, ax = plt.subplots(figsize=(8,8))
plt.scatter(samples[:,0], samples[:,1], color='blue')
for i in range(nsample):
    plt.text(samples[i,0] + 0.05, samples[i,1] + 0.05, 'embedding {}'.format(i), fontsize=15)
plt.title('Embeddings', size=20)
plt.show()
reconstructions = decoder(samples).mean()
#print(samples.shape, reconstructions.shape)
plt.figure(figsize=(8,6))
plt.subplots_adjust(0, 0, 1, 0.9, 0.05, 0.08)
indices = np.random.choice(len(image_dataset), 4)
for i in range(nsample):
    plt.subplot(2,3,i+1), plt.imshow(reconstructions[i]), plt.title('image {}'.format(i)), plt.axis('off')
plt.suptitle('VAE-reconstructions', size=20)
plt.show()

The following animation of latent space interpolation shows the decoder’s generations, depending on the latent space.

To be continued…

Machine learning with H2O in R / Python

In this blog, we shall discuss about how to use H2O to build a few supervised Machine learning models. H2O is a Java-based software for data modeling and general computing, with the primary purpose of it being a distributed, parallel, in memory processing engine. It needs to be installed first (instructions) and by default an H2O instance will run on localhost:54321. Additionally, one needs to install R/python clients to to communicate with the H2O instance. Every new R / python session first needs to initialize a connection between the python client and the H2O cluster.

The problems to be described in this blog appeared in the exercises / projects in the coursera course “Practical Machine Learning on H2O“, by H2O. The problem statements / descriptions / steps are taken from the course itself. We shall use the concepts from the course, in order to, e.g.,

  • to build a few machine learning / deep learning models using different algorithms (such as Gradient Boosting, Random Forest, Neural Net, Elastic Net GLM etc.),
  • to review the classic bias-variance tradeoff (overfitting)
  • for hyper-parameter tuning using Grid Search
  • to use AutoML to automatically find a bunch of good performing models
  • to use Stacked Ensembles of models to improve performance.

Problem 1

In this problem we shall create an artificial data set, then run random forest / GBM on it with H2O, to create two supervised models for classification, one that is reasonable and another one that shows clear over-fitting. We shall use R client (package) for H2O for this problem.

  1. Let’s first create a data set to predict an employee’s job satisfaction in an organization. Let’s say an employee’s job satisfaction depends on the following factors (there are several other factors in general, but we shall limit us to the following few ones):
    • work environment
    • pay
    • flexibility
    • relationship with manager
    • age
set.seed(321)

# Let's say an employee's job satisfaction depends on the work environment, pay, flexibility, relationship with manager and age.

N <- 1000                                         # number of samples
d <- data.frame(id = 1:N)
d$workEnvironment <- sample(1:5, N, replace=TRUE) # on a scale of 1-5, 1 being bad and 5 being good
v <- round(rnorm(N, mean=60000, sd=20000))        # 68% are 40-80k
v <- pmax(v, 20000)
v <- pmin(v, 100000) #table(v)
d$pay <- v
d$flexibility <- sample(1:5, N, replace=TRUE)     # on a scale of 1-5, 1 being bad and 5 being good
d$managerRel <- sample(1:5, N, replace=TRUE)      # on a scale of 1-5, 1 being bad and 5 being good
d$age <- round(runif(N, min=20, max=60))
head(d)

#  id workEnvironment   pay flexibility managerRel age
#1  1               2 20000           2          2  21
#2  2               5 75817           1          2  31
#3  3               5 45649           5          3  25
#4  4               1 47157           1          5  55
#5  5               2 69729           2          4  33
#6  6               1 75101           2          2  39

v <- 125 * (d$pay/1000)^2 # e.g., job satisfaction score is proportional to square of pay (hypothetically)
v <- v + 250 / log(d$age) # e.g., inversely proportional to log of age
v <- v + 5 * d$flexibility
v <- v + 200 * d$workEnvironment
v <- v + 1000 * d$managerRel^3
v <- v + runif(N, 0, 5000)
v <- 100 * (v - 0) / (max(v) - min(v)) # min-max normalization to bring the score in 0-100
d$jobSatScore <- round(v) # Round to nearest integer (percentage)

2. Let’s start h2o, and import the data.

library(h2o)
h2o.init()
as.h2o(d, destination_frame = "jobsatisfaction")
jobsat <- h2o.getFrame("jobsatisfaction")

#  |===========================================================================================================| 100%
#  id workEnvironment   pay flexibility managerRel age jobSatScore
#1  1               2 20000           2          2  21           5
#2  2               5 75817           1          2  31          55
#3  3               5 45649           5          3  25          22
#4  4               1 47157           1          5  55          30
#5  5               2 69729           2          4  33          51
#6  6               1 75101           2          2  39          54

3. Let’s split the data. Here we plan to use cross-validation.

parts <- h2o.splitFrame(
  jobsat,
  ratios = 0.8,
  destination_frames=c("jobsat_train", "jobsat_test"),
  seed = 321)
train <- h2o.getFrame("jobsat_train")
test <- h2o.getFrame("jobsat_test")   
norw(train)
# 794
norw(test)
# 206 rows

y <- "jobSatScore"
x <- setdiff(names(train), c("id", y))

4. Let’s choose the gradient boosting model (gbm), and create a model. It’s a regression model since the output variable is treated to be continuous.

# the reasonable model with 10-fold cross-validation
m_res <- h2o.gbm(x, y, train,
              model_id = "model10foldsreasonable",
              ntrees = 20,
              nfolds = 10,
              seed = 123)
> h2o.performance(m_res, train = TRUE) # RMSE 2.973807
#H2ORegressionMetrics: gbm
#** Reported on training data. **

#MSE:  8.069509
#RMSE:  2.840688
#MAE:  2.266134
#RMSLE:  0.1357181
#Mean Residual Deviance :  8.069509

> h2o.performance(m_res, xval = TRUE)  # RMSE 3.299601
#H2ORegressionMetrics: gbm
#** Reported on cross-validation data. **
#** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

#MSE:  8.84353
#RMSE:  2.973807
#MAE:  2.320899
#RMSLE:  0.1384746
#Mean Residual Deviance :  8.84353

> h2o.performance(m_res, test)         # RMSE 0.6476077
#H2ORegressionMetrics: gbm

#MSE:  10.88737
#RMSE:  3.299601
#MAE:  2.524492
#RMSLE:  0.1409274
#Mean Residual Deviance :  10.88737

5. Let’s try some alternative parameters, to build a different model, and show how the results differ.

# overfitting model with 10-fold cross-validation
m_ovf <- h2o.gbm(x, y, train,
              model_id = "model10foldsoverfitting",
              ntrees = 2000,
              max_depth = 20,
              nfolds = 10,
              seed = 123)

> h2o.performance(m_ovf, train = TRUE) # RMSE 0.004474786
#H2ORegressionMetrics: gbm
#** Reported on training data. **

#MSE:  2.002371e-05
#RMSE:  0.004474786
#MAE:  0.0007455944
#RMSLE:  5.032019e-05
#Mean Residual Deviance :  2.002371e-05

> h2o.performance(m_ovf, xval = TRUE)  # RMSE 0.6801615
#H2ORegressionMetrics: gbm
#** Reported on cross-validation data. **
#** 10-fold cross-validation on training data (Metrics computed for combined holdout predictions) **

#MSE:  0.4626197
#RMSE:  0.6801615
#MAE:  0.4820542
#RMSLE:  0.02323415
#Mean Residual Deviance :  0.4626197

> h2o.performance(m_ovf, test)         # RMSE 0.4969761
#H2ORegressionMetrics: gbm

#MSE:  0.2469853
#RMSE:  0.4969761
#MAE:  0.3749822
#RMSLE:  0.01698435
#Mean Residual Deviance :  0.2469853

Problem 2

Predict Chocolate Makers Location with Deep Learning Model with H2O

The data is available here: http://coursera.h2o.ai/cacao.882.csv

This is a classification problem. We need to predict “Maker Location”. In other words, using the rating, and the other fields, how accurately we can identify if it is Belgian chocolate, French chocolate, and so on. We shall use python client (library) for H2O for this problem.

  1. Let’s start h2o, load the data set, and split it. By the end of this stage we should have
    three variables, pointing to three data frames on h2o: train, valid, test. However, if you are choosing to use
    cross-validation, you will only have two: train and test.
import h2o
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

df = pd.read_csv('http://coursera.h2o.ai/cacao.882.csv')
print(df.shape)
# (1795, 9)
df.head()

MakerOriginREFReview DateCocoa PercentMaker LocationRatingBean TypeBean Origin
0A. MorinAgua Grande1876201663%France3.75Sao Tome
1A. MorinKpime1676201570%France2.75Togo
2A. MorinAtsane1676201570%France3.00Togo
3A. MorinAkata1680201570%France3.50Togo
4A. MorinQuilla1704201570%France3.50Peru
print(df['Maker Location'].unique())

# ['France' 'U.S.A.' 'Fiji' 'Ecuador' 'Mexico' 'Switzerland' 'Netherlands'
# 'Spain' 'Peru' 'Canada' 'Italy' 'Brazil' 'U.K.' 'Australia' 'Wales'
# 'Belgium' 'Germany' 'Russia' 'Puerto Rico' 'Venezuela' 'Colombia' 'Japan'
# 'New Zealand' 'Costa Rica' 'South Korea' 'Amsterdam' 'Scotland'
# 'Martinique' 'Sao Tome' 'Argentina' 'Guatemala' 'South Africa' 'Bolivia'
# 'St. Lucia' 'Portugal' 'Singapore' 'Denmark' 'Vietnam' 'Grenada' 'Israel'
# 'India' 'Czech Republic' 'Domincan Republic' 'Finland' 'Madagascar'
# 'Philippines' 'Sweden' 'Poland' 'Austria' 'Honduras' 'Nicaragua'
# 'Lithuania' 'Niacragua' 'Chile' 'Ghana' 'Iceland' 'Eucador' 'Hungary'
# 'Suriname' 'Ireland']
print(len(df['Maker Location'].unique()))
# 60

loc_table = df['Maker Location'].value_counts()
print(loc_table)
#U.S.A.               764
#France               156
#Canada               125
#U.K.                  96
#Italy                 63
#Ecuador               54
#Australia             49
#Belgium               40
#Switzerland           38
#Germany               35
#Austria               26
#Spain                 25
#Colombia              23
#Hungary               22
#Venezuela             20
#Madagascar            17
#Japan                 17
#New Zealand           17
#Brazil                17
#Peru                  17
#Denmark               15
#Vietnam               11
#Scotland              10
#Guatemala             10
#Costa Rica             9
#Israel                 9
#Argentina              9
#Poland                 8
#Honduras               6
#Lithuania              6
#Sweden                 5
#Nicaragua              5
#Domincan Republic      5
#South Korea            5
#Netherlands            4
#Amsterdam              4
#Puerto Rico            4
#Fiji                   4
#Sao Tome               4
#Mexico                 4
#Ireland                4
#Portugal               3
#Singapore              3
#Iceland                3
#South Africa           3
#Grenada                3
#Chile                  2
#St. Lucia              2
#Bolivia                2
#Finland                2
#Martinique             1
#Eucador                1
#Wales                  1
#Czech Republic         1
#Suriname               1
#Ghana                  1
#India                  1
#Niacragua              1
#Philippines            1
#Russia                 1
#Name: Maker Location, dtype: int64

loc_table.hist()

As can be seen from the above table, some of the locations have too few records, which will result in poor accuracy of the model to be learnt on after splitting the dataset into train, validation and test datasets. Let’s get rid of the locations that have small number of (< 40) examples in the dataset, to make the results more easily comprehendable, by reducing number of categories in the output variable.

## filter out the countries for which there is < 40 examples present in the dataset
loc_gt_40_recs = loc_table[loc_table >= 40].index.tolist()
df_sub = df[df['Maker Location'].isin(loc_gt_40_recs)]

# now connect to H2O
h2o.init() # h2o.clusterStatus()

H2O cluster uptime:1 day 14 hours 48 mins
H2O cluster version:3.13.0.3978
H2O cluster version age:4 years and 9 days !!!
H2O cluster name:H2O_started_from_R_Sandipan.Dey_kpl973
H2O cluster total nodes:1
H2O cluster free memory:2.530 Gb
H2O cluster total cores:4
H2O cluster allowed cores:4
H2O cluster status:locked, healthy
H2O connection url:http://localhost:54321
H2O connection proxy:None
H2O internal security:False
H2O API Extensions:Algos, AutoML, Core V3, Core V4
Python version:3.7.6 final
h2o_df = h2o.H2OFrame(df_sub.values, destination_frame = "cacao_882", 
                      column_names=[x.replace(' ', '_') for x in df.columns.tolist()])
#h2o_df.head()
#h2o_df.summary()

df_cacao_882 = h2o.get_frame('cacao_882') # df_cacao_882.as_data_frame()
#df_cacao_882.head()
df_cacao_882.describe()

MakerOriginREFReview_DateCocoa_PercentMaker_LocationRatingBean_TypeBean_Origin
typeenumenumintintenumenumrealenumenum
mins5.02006.01.0
mean1025.88492947290392012.2739420935413.1818856718633928
maxs1952.02017.05.0
sigma553.78120137164412.9786156331850910.4911459825968248
zeros000
missing000000000
0A. MorinAgua Grande1876.02016.063%France3.75<0xA0>Sao Tome
1A. MorinKpime1676.02015.070%France2.75<0xA0>Togo
2A. MorinAtsane1676.02015.070%France3.0<0xA0>Togo
3A. MorinAkata1680.02015.070%France3.5<0xA0>Togo
4A. MorinQuilla1704.02015.070%France3.5<0xA0>Peru
5A. MorinCarenero1315.02014.070%France2.75CriolloVenezuela
6A. MorinCuba1315.02014.070%France3.5<0xA0>Cuba
7A. MorinSur del Lago1315.02014.070%France3.5CriolloVenezuela
8A. MorinPuerto Cabello1319.02014.070%France3.75CriolloVenezuela
9A. MorinPablino1319.02014.070%France4.0<0xA0>Peru
df_cacao_882['Maker_Location'].table()
#Maker_Location	Count
#Australia	 49
#Belgium	 40
#Canada	        125
#Ecuador	 54
#France	        156
#Italy	         63
#U.K.	         96
#U.S.A.	        764

train, valid, test = df_cacao_882.split_frame(ratios = [0.8, 0.1], 
                                              destination_frames = ['train', 'valid', 'test'], 
                                              seed = 321)
print("%d/%d/%d" %(train.nrows, valid.nrows, test.nrows))
# 1082/138/127

2. Let’s set x to be the list of columns we shall use to train on, to be the column we shall learn. Here it’s going to be a multi-class classification problem.

ignore_fields = ['Review_Date', 'Bean_Type', 'Maker_Location']
# Specify the response and predictor columns
y = 'Maker_Location' # multinomial Classification
x = [i for i in train.names if not i in ignore_fields]

3. Let’s now create a baseline deep learning model. It is recommended to use all default settings (remembering to
specify either nfolds or validation_frame) for the baseline model.

from h2o.estimators.deeplearning import H2ODeepLearningEstimator

model = H2ODeepLearningEstimator() 

%time model.train(x = x, y = y, training_frame = train, validation_frame = valid)
# deeplearning Model Build progress: |██████████████████████████████████████| 100%
# Wall time: 6.44 s

model.model_performance(train).mean_per_class_error()
# 0.05118279569892473
model.model_performance(valid).mean_per_class_error()
# 0.26888404593884047
perf_test = model.model_performance(test)
print('Mean class error', perf_test.mean_per_class_error())
# Mean class error 0.2149184149184149
print('log loss', perf_test.logloss())
# log loss 0.48864148412056846
print('MSE', perf_test.mse())
# MSE 0.11940531127368789
print('RMSE', perf_test.rmse())
# RMSE 0.3455507361787671
perf_test.hit_ratio_table()
Top-8 Hit Ratios: 
khit_ratio
10.8897638
20.9291338
30.9527559
40.9685039
50.9763779
60.9921259
70.9999999
80.9999999
perf_test.confusion_matrix().as_data_frame()

AustraliaBelgiumCanadaEcuadorFranceItalyU.K.U.S.A.ErrorRate
03.00.00.00.00.00.00.02.00.4000002 / 5
10.02.00.00.00.01.00.00.00.3333331 / 3
20.00.012.00.00.00.00.01.00.0769231 / 13
30.00.00.03.00.00.00.00.00.0000000 / 3
40.00.00.00.08.02.00.01.00.2727273 / 11
50.00.00.00.00.010.00.00.00.0000000 / 10
60.00.00.01.00.02.04.04.00.6363647 / 11
70.00.00.00.00.00.00.071.00.0000000 / 71
83.02.012.04.08.015.04.079.00.11023614 / 127
model.plot()

4. Now, let’s create a tuned model, that gives superior performance. However we should use no more than 10 times
the running time of your baseline model, so again our script should be timing the model.

model_tuned = H2ODeepLearningEstimator(epochs=200, 
                                       distribution="multinomial",
                                       activation="RectifierWithDropout",
                                       stopping_rounds=5, 
                                       stopping_tolerance=0, 
                                       stopping_metric="logloss",
                                       input_dropout_ratio=0.2,
                                       l1=1e-5,
                                       hidden=[200,200,200])

%time model_tuned.train(x, y, training_frame = train, validation_frame = valid)
#deeplearning Model Build progress: |██████████████████████████████████████| 100%
#Wall time: 30.8 s

model_tuned.model_performance(train).mean_per_class_error()
#0.0
model_tuned.model_performance(valid).mean_per_class_error()
#0.07696485401964853
perf_test = model_tuned.model_performance(test)
print('Mean class error', perf_test.mean_per_class_error())
#Mean class error 0.05909090909090909
print('log loss', perf_test.logloss())
#log loss 0.14153784501504524
print('MSE', perf_test.mse())
#MSE 0.03497231075826773
print('RMSE', perf_test.rmse())
#RMSE 0.18700885208531637

perf_test.hit_ratio_table()
Top-8 Hit Ratios: 
khit_ratio
10.9606299
20.984252
30.984252
40.992126
50.992126
60.992126
71.0
81.0
perf_test.confusion_matrix().as_data_frame()
AustraliaBelgiumCanadaEcuadorFranceItalyU.K.U.S.A.ErrorRate
05.00.00.00.00.00.00.00.00.0000000 / 5
10.03.00.00.00.00.00.00.00.0000000 / 3
20.00.013.00.00.00.00.00.00.0000000 / 13
30.00.00.03.00.00.00.00.00.0000000 / 3
40.00.00.00.011.00.00.00.00.0000000 / 11
50.00.00.00.01.08.00.01.00.2000002 / 10
60.00.00.00.00.00.08.03.00.2727273 / 11
70.00.00.00.00.00.00.071.00.0000000 / 71
85.03.013.03.012.08.08.075.00.0393705 / 127
model_tuned.plot()

As can be seen from the above plot, the early-stopping strategy stopped the model to overfit and the model achieves better accruacy on the test dataset..

5. Let’s save both the models, to the local disk, using save_model(), to export the binary version of the model. (Do not export a POJO.)

h2o.save_model(model, 'base_model')
h2o.save_model(model_tuned, 'tuned_model')

We may want to include a seed in the model function above to get reproducible results.

Problem 3

Predict Price of a house with Stacked Ensemble model with H2O

The data is available at http://coursera.h2o.ai/house_data.3487.csv. This is a regression problem. We have to predict the “price” of a house given different feature values. We shall use python client for H2O again for this problem.

The data needs to be split into train and test, using 0.9 for the ratio, and a seed of 123. That should give 19,462 training rows and 2,151 test rows. The target is an RMSE below $123,000.

  1. Let’s start h2o, load the chosen dataset and follow the data manipulation steps. For example, we can split date into year and month columns. We can then optionally combine them into a numeric date column. At the end of this step we shall have traintestx and y variables, and possibly valid also. The below shows the code snippet to do this.
import h2o
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import random
from time import time

h2o.init()

url = "http://coursera.h2o.ai/house_data.3487.csv"
house_df = h2o.import_file(url, destination_frame = "house_data")
# Parse progress: |█████████████████████████████████████████████████████████| 100%

Preporcessing

house_df['year'] = house_df['date'].substring(0,4).asnumeric()
house_df['month'] = house_df['date'].substring(4,6).asnumeric()
house_df['day'] = house_df['date'].substring(6,8).asnumeric()
house_df = house_df.drop('date')
house_df.head()
idpricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontviewconditiongradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15yearmonthday
7.1293e+0922190031118056501003711800195509817847.5112-122.2571340565020141013
6.4141e+0953800032.2525707242200372170400195119919812547.721-122.319169076392014129
5.6315e+091800002177010000100367700193309802847.7379-122.233272080622015225
2.4872e+096040004319605000100571050910196509813647.5208-122.393136050002014129
1.9544e+0951000032168080801003816800198709807447.6168-122.045180075032015218
7.23755e+091.225e+0644.5542010193010031138901530200109805347.6561-122.00547601019302014512
1.3214e+0925750032.25171568192003717150199509800347.3097-122.327223868192014627
2.008e+0929185031.5106097111003710600196309819847.4095-122.315165097112015115
2.4146e+092295003117807470100371050730196009814647.5123-122.337178081132015415
3.7935e+0932300032.5189065602003718900200309803847.3684-122.031239075702015312
house_df.describe()


id
pricebedroomsbathroomssqft_livingsqft_lotfloorswaterfrontviewconditiongradesqft_abovesqft_basementyr_builtyr_renovatedzipcodelatlongsqft_living15sqft_lot15yearmonthday
typeintintintrealintintrealintintintintintintintintintrealrealintintintintint
mins1000102.075000.00.00.0290.0520.01.00.00.01.01.0290.00.01900.00.098001.047.1559-122.519399.0651.02014.01.01.0
mean4580301520.864987540088.14176652843.3708416230972182.1147573219821392079.89973626981915106.967565816951.49430898070605260.0075417572757136910.234303428492110973.40942951001711647.65687317817981051788.3906907879518291.509045481885551971.005135797906484.402257900337798077.9398047467447.56005251931665-122.213896404941581986.552491556003612768.455651691182014.32295377781026.57442280109188315.688196918521294
maxs9900000190.07700000.033.08.013540.01651359.03.51.04.05.013.09410.04820.02015.02015.098199.047.7776-121.3156210.0871200.02015.012.031.0
sigma2876565571.3120522367127.196482700350.9300618311474510.7701631572177408918.440897046809541420.511515135510.53998889514234890.086517197727887660.76631756927361170.65074304636620441.1754587569743344828.0909776519175442.5750426774668529.373410802386235401.6792400191755553.505026257472480.138563710241923680.14082834238139297685.391304252778827304.1796313385240.46761603104515363.11530777872636488.635062534286034
zeros00131000021450194890001312602069900000000
missing00000000000000000000000
07129300520.0221900.03.01.01180.05650.01.00.00.03.07.01180.00.01955.00.098178.047.5112-122.2571340.05650.02014.010.013.0
16414100192.0538000.03.02.252570.07242.02.00.00.03.07.02170.0400.01951.01991.098125.047.721000000000004-122.3191690.07639.02014.012.09.0
25631500400.0180000.02.01.0770.010000.01.00.00.03.06.0770.00.01933.00.098028.047.7379-122.2332720.08062.02015.02.025.0
32487200875.0604000.04.03.01960.05000.01.00.00.05.07.01050.0910.01965.00.098136.047.5208-122.3931360.05000.02014.012.09.0
41954400510.0510000.03.02.01680.08080.01.00.00.03.08.01680.00.01987.00.098074.047.616800000000005-122.0451800.07503.02015.02.018.0
57237550310.01225000.04.04.55420.0101930.01.00.00.03.011.03890.01530.02001.00.098053.047.6561-122.0054760.0101930.02014.05.012.0
61321400060.0257500.03.02.251715.06819.02.00.00.03.07.01715.00.01995.00.098003.047.3097-122.3272238.06819.02014.06.027.0
72008000270.0291850.03.01.51060.09711.01.00.00.03.07.01060.00.01963.00.098198.047.4095-122.3151650.09711.02015.01.015.0
82414600126.0229500.03.01.01780.07470.01.00.00.03.07.01050.0730.01960.00.098146.047.5123-122.3371780.08113.02015.04.015.0
93793500160.0323000.03.02.51890.06560.02.00.00.03.07.01890.00.02003.00.098038.047.3684-122.0312390.07570.02015.03.012.0
plt.hist(house_df.as_data_frame()['price'].tolist(), bins=np.linspace(0,10**6,1000))
plt.show()

We shall use cross-validation and not a validation dataset.

train, test = house_df.split_frame(ratios=[0.9], destination_frames = ['train', 'test'], seed=123)
print("%d/%d" %(train.nrows, test.nrows))
# 19462/2151
ignore_fields = ['id', 'price'] 
x = [i for i in train.names if not i in ignore_fields]
y = 'price'

2. Let’s now train at least four different models on the preprocessed datseet, using at least three different supervised algorithms. Let’s save all the models.

from h2o.estimators.gbm import H2OGradientBoostingEstimator
from h2o.estimators.random_forest import H2ORandomForestEstimator
from h2o.estimators.glm import H2OGeneralizedLinearEstimator
from h2o.estimators.deeplearning import H2ODeepLearningEstimator
from h2o.estimators.stackedensemble import H2OStackedEnsembleEstimator

nfolds = 5 # for cross-validation

Let’s first fit a GLM model. The best performing α hyperparameter value (for controlling L1 vs. L2 regularization) for GLM will be found using GridSearch, as shown in the below code snippet.

g= h2o.grid.H2OGridSearch(
    H2OGeneralizedLinearEstimator(family="gaussian",
    nfolds=nfolds,
    fold_assignment="Modulo",
    keep_cross_validation_predictions=True,
    lambda_search=True),
hyper_params={
    "alpha":[x * 0.01 for x in range(0,100)],
},
search_criteria={
    "strategy":"RandomDiscrete",
    "max_models":8,
    "stopping_metric": "rmse",
    "max_runtime_secs":60
}
)
g.train(x, y, train)
g

#glm Grid Build progress: |████████████████████████████████████████████████| 100%
#                     alpha  \
#0                   [0.61]   
#1                   [0.78]   
#2                   [0.65]   
#3                   [0.13]   
#4    [0.35000000000000003]   
#5                   [0.05]   
#6                   [0.32]   
#7                   [0.55]   

#                                              model_ids     residual_deviance  
#0  Grid_GLM_train_model_python_1628864392402_41_model_3  2.626981989511134E15  
#1  Grid_GLM_train_model_python_1628864392402_41_model_6  2.626981989511134E15  
#2  Grid_GLM_train_model_python_1628864392402_41_model_5  2.626981989511134E15  
#3  Grid_GLM_train_model_python_1628864392402_41_model_2  2.626981989511134E15  
#4  Grid_GLM_train_model_python_1628864392402_41_model_4  2.626981989511134E15  
#5  Grid_GLM_train_model_python_1628864392402_41_model_7  2.626981989511134E15  
#6  Grid_GLM_train_model_python_1628864392402_41_model_0  2.626981989511134E15  
#7  Grid_GLM_train_model_python_1628864392402_41_model_1  2.626981989511134E15  

Model 1

model_GLM = H2OGeneralizedLinearEstimator(
    family='gaussian', #'gamma',
    model_id='glm_house',
    nfolds=nfolds,
    alpha=0.61,
    fold_assignment="Modulo",
    keep_cross_validation_predictions=True)

%time model_GLM.train(x, y, train)
#glm Model Build progress: |███████████████████████████████████████████████| 100%
#Wall time: 259 ms

model_GLM.cross_validation_metrics_summary().as_data_frame()
meansdcv_1_validcv_2_validcv_3_validcv_4_validcv_5_valid
0mae230053.23715.8795229225.16230969.69228503.45230529.47231038.42
1mean_residual_deviance1.31780157E114.5671977E91.32968604E111.41431144E111.31364495E111.32024402E111.21112134E11
2mse1.31780157E114.5671977E91.32968604E111.41431144E111.31364495E111.32024402E111.21112134E11
3null_deviance5.25455325E141.80834544E135.3056184E145.636807E145.23549568E145.26203388E144.83281095E14
4r20.0235225354.801036E-40.0242993570.0231689330.0225319340.0233402570.024272196
5residual_deviance5.12943247E141.7808912E135.17646773E145.5059142E145.11270625E145.13838982E144.71368433E14
6rmse362905.536314.0225364648.6376073.3362442.4363351.62348011.7
7rmsle0.539115850.00474044450.542771760.53890130.52754750.538464840.54789394
model_GLM.model_performance(test)
#ModelMetricsRegressionGLM: glm
#** Reported on test data. **

#MSE: 128806123545.59714
#RMSE: 358895.7000934911
#MAE: 233890.6933813204
#RMSLE: 0.5456714021880726
#R^2: 0.03102347771355851
#Mean Residual Deviance: 128806123545.59714
#Null degrees of freedom: 2150
#Residual degrees of freedom: 2129
#Null deviance: 285935013037402.7
#Residual deviance: 277061971746579.44
#AIC: 61176.23965800522

As can be seen from above, GLM could not achieve the target of RMSE below $123k neither on cross-validation nor on test dataset.

The below models (GBMDRF and DL) and the corresponding parameters were found with AutoML leaderboard and 
GridSearch, along with some manual tuning.

from h2o.automl import H2OAutoML
model_auto = H2OAutoML(max_runtime_secs=60, seed=123)
model_auto.train(x, y, train)
# AutoML progress: |████████████████████████████████████████████████████████| 100%
# Parse progress: |█████████████████████████████████████████████████████████| 100%

model_auto.leaderboard
model_idmean_residual_deviancermsemaermsle
GBM_grid_0_AutoML_20210814_005121_model_02.01725e+1014203077779.10.184269
GBM_grid_0_AutoML_20210814_005121_model_12.6037e+1016136093068.10.218365
DRF_0_AutoML_20210814_0051213.27251e+101809011027820.243474
XRT_0_AutoML_20210814_0051213.53492e+101880141042590.246899
GBM_grid_0_AutoML_20210813_201225_model_05.99803e+102449091535480.351959
GBM_grid_0_AutoML_20210813_201225_model_26.09613e+102469031525700.349919
GBM_grid_0_AutoML_20210813_201225_model_16.09941e+102469701530960.350852
GBM_grid_0_AutoML_20210813_201225_model_36.22174e+102494341531050.350598
DeepLearning_0_AutoML_20210813_2012256.39672e+102529171639930.378761
DRF_0_AutoML_20210813_2012256.76936e+102601801580780.360337
model_auto.leader.model_performance(test)
# model_auto.leader.explain(test)

#ModelMetricsRegression: gbm
#** Reported on test data. **

#MSE: 17456681023.716145
#RMSE: 132123.73376390839
#MAE: 77000.00253466706
#RMSLE: 0.1899899418603569
#Mean Residual Deviance: 17456681023.716145

model = h2o.get_model(model_auto.leaderboard[4, 'model_id']) # get model by model_id
print(model.params['model_id']['actual']['name'])
print(model.model_performance(test).rmse())
[(k, v) for (k, v) in model.params.items() if v['default'] != v['actual'] and \
                     not k in ['model_id', 'training_frame', 'validation_frame', 'nfolds',             
                               'keep_cross_validation_predictions', 'seed', 
                               'response_column', 'fold_assignment', 'ignored_columns']]

# GBM_grid_0_AutoML_20210813_201225_model_0
# 235011.60404473927
# [('score_tree_interval', {'default': 0, 'actual': 5}),
#  ('ntrees', {'default': 50, 'actual': 60}),
#  ('max_depth', {'default': 5, 'actual': 6}),
#  ('min_rows', {'default': 10.0, 'actual': 1.0}),
#  ('stopping_tolerance', {'default': 0.001, 'actual': 0.008577452408351779}),
#  ('seed', {'default': -1, 'actual': 123}),
#  ('distribution', {'default': 'AUTO', 'actual': 'gaussian'}),
#  ('sample_rate', {'default': 1.0, 'actual': 0.8}),
#  ('col_sample_rate', {'default': 1.0, 'actual': 0.8}),
#  ('col_sample_rate_per_tree', {'default': 1.0, 'actual': 0.8})]

Model 2

model_GBM = H2OGradientBoostingEstimator(
    model_id='gbm_house',
    nfolds=nfolds,
    ntrees=500,
    fold_assignment="Modulo",
    keep_cross_validation_predictions=True,
    seed=123)

%time model_GBM.train(x, y, train)
#gbm Model Build progress: |███████████████████████████████████████████████| 100%
#Wall time: 54.9 s

model_GBM.cross_validation_metrics_summary().as_data_frame()
meansdcv_1_validcv_2_validcv_3_validcv_4_validcv_5_valid
0mae64136.496912.238762751.68866573.6363946.3163873.70763537.137
1mean_residual_deviance1.38268457E101.43582912E91.24595825E101.75283814E101.2894718E101.43893801E101.18621655E10
2mse1.38268457E101.43582912E91.24595825E101.75283814E101.2894718E101.43893801E101.18621655E10
3r20.89790970.00756967950.908573750.878935640.90405190.893553560.90443367
4residual_deviance1.38268457E101.43582912E91.24595825E101.75283814E101.2894718E101.43893801E101.18621655E10
5rmse117288.3055928.7188111622.5132394.8113554.914119955.74108913.57
6rmsle0.164419890.00257377070.162316710.170414090.159411880.165282620.16467415

As can be seen from the above table (row 5, column 1), the mean RMSE for cross-validation is 117288.305, which is below $123k.

model_GBM.model_performance(test)

#ModelMetricsRegression: gbm
#** Reported on test data. **

#MSE: 14243079402.729088
#RMSE: 119344.37315068142
#MAE: 65050.344749203745
#RMSLE: 0.16421689257411975
#Mean Residual Deviance: 14243079402.729088

As can be seen from above, GBM could achieve the target of RMSE below $123k on test dataset.

Now, let’s try random forest model by finding best parameters with Grid Search:

g= h2o.grid.H2OGridSearch(
    H2ORandomForestEstimator(
    nfolds=nfolds,
    fold_assignment="Modulo",
    keep_cross_validation_predictions=True,
    seed=123),
hyper_params={
    "ntrees": [20, 25, 30],
    "stopping_tolerance": [0.005, 0.006, 0.0075],
    "max_depth": [20, 50, 100],
    "min_rows": [5, 7, 10]
},
search_criteria={
    "strategy":"RandomDiscrete",
    "max_models":10,
    "stopping_metric": "rmse",
    "max_runtime_secs":60
}
)
g.train(x, y, train)
#drf Grid Build progress: |████████████████████████████████████████████████| 100%
g
#    max_depth min_rows ntrees stopping_tolerance  \
#0         100      5.0     20              0.006   
#1         100      5.0     20              0.005   
#2         100      5.0     20              0.005   
#3         100      7.0     30              0.006   
#4          50     10.0     25              0.006   
#5          50     10.0     20              0.005   

#                                              model_ids      residual_deviance  
#0  Grid_DRF_train_model_python_1628864392402_40_model_0  2.0205038467456142E10  
#1  Grid_DRF_train_model_python_1628864392402_40_model_5  2.0205038467456142E10  
#2  Grid_DRF_train_model_python_1628864392402_40_model_1  2.0205038467456142E10  
#3  Grid_DRF_train_model_python_1628864392402_40_model_3   2.099520493338354E10  
#4  Grid_DRF_train_model_python_1628864392402_40_model_2   2.260686283035833E10  
#5  Grid_DRF_train_model_python_1628864392402_40_model_4   2.279037520277947E10  

Model 3

model_RF = H2ORandomForestEstimator(
    model_id='rf_house',
    nfolds=nfolds,
    ntrees=20,
    fold_assignment="Modulo",
    keep_cross_validation_predictions=True,
    seed=123)

%time model_RF.train(x, y, train)
#drf Model Build progress: |███████████████████████████████████████████████| 100%
#Wall time: 13.2 s

model_RF.cross_validation_metrics_summary().as_data_frame()
meansdcv_1_validcv_2_validcv_3_validcv_4_validcv_5_valid
0mae72734.01162.915373242.2675062.2173461.6571646.19570257.7
1mean_residual_deviance1.8545494E102.2018921E91.79095654E102.45911347E101.74433321E101.71117425E101.56716954E10
2mse1.8545494E102.2018921E91.79095654E102.45911347E101.74433321E101.71117425E101.56716954E10
3r20.86322020.0117708160.86858270.83015490.87020620.87341470.8737426
4residual_deviance1.8545494E102.2018921E91.79095654E102.45911347E101.74433321E101.71117425E101.56716954E10
5rmse135742.787726.2373133826.62156815.61132073.2130811.86125186.64
6rmsle0.182755350.00201553730.184418680.186897670.179457780.18332880.17967385
model_RF.model_performance(test)
ModelMetricsRegression: drf
** Reported on test data. **

MSE: 16405336914.530426
RMSE: 128083.3202041953
MAE: 71572.37981480274
RMSLE: 0.17712324625977907
Mean Residual Deviance: 16405336914.530426

As can be seen from above, DRF just missed the target of RMSE below $123k for on both the cross-validation and on test dataset.

Now, let’s try to fit a deep learning model, again tuning the parameters with Grid Search.

g= h2o.grid.H2OGridSearch(
    H2ODeepLearningEstimator(
    nfolds=nfolds,
    fold_assignment="Modulo",
    keep_cross_validation_predictions=True,
    reproducible=True,
    seed=123),
hyper_params={
    "epochs": [20, 25],
    "hidden": [[20, 20, 20], [25, 25, 25]],
    "stopping_rounds": [0, 5],
    "stopping_tolerance": [0.006]
},
search_criteria={
    "strategy":"RandomDiscrete",
    "max_models":10,
    "stopping_metric": "rmse",
    "max_runtime_secs":60
}
)
g.train(x, y, train)
g
#deeplearning Grid Build progress: |███████████████████████████████████████| 100%

#                 epochs        hidden stopping_rounds stopping_tolerance  \
#0     16.79120554889533  [25, 25, 25]               0              0.006   
#1    3.1976799968879086  [25, 25, 25]               0              0.006   

#                                                       model_ids  \
#0  Grid_DeepLearning_train_model_python_1628864392402_55_model_0   
#1  Grid_DeepLearning_train_model_python_1628864392402_55_model_1   

#       residual_deviance  
#0  1.6484562934855278E10  
#1  2.1652538389322113E10 

Model 4

model_DL = H2ODeepLearningEstimator(epochs=30, 
                                       model_id='dl_house',
                                       nfolds=nfolds,
                                       stopping_rounds=7, 
                                       stopping_tolerance=0.006, 
                                       hidden=[30, 30, 30],
                                       reproducible=True,
                                       fold_assignment="Modulo",
                                       keep_cross_validation_predictions=True,
                                       seed=123
                                   )
%time model_DL.train(x, y, train)
#deeplearning Model Build progress: |██████████████████████████████████████| 100%
#Wall time: 55.7 s

model_DL.cross_validation_metrics_summary().as_data_frame()
meansdcv_1_validcv_2_validcv_3_validcv_4_validcv_5_valid
0mae72458.191241.893671992.1873569.98475272.7570553.3870902.65
1mean_residual_deviance1.48438886E105.5005555E81.42477005E101.59033723E101.54513889E101.48586271E101.37583514E10
2mse1.48438886E105.5005555E81.42477005E101.59033723E101.54513889E101.48586271E101.37583514E10
3r20.88997590.00234933380.895452860.89015920.8850280.890082240.88915724
4residual_deviance1.48438886E105.5005555E81.42477005E101.59033723E101.54513889E101.48586271E101.37583514E10
5rmse121793.582259.6975119363.734126108.58124303.62121895.97117296.0
6rmsle0.184311150.00114695810.182515950.186509530.184533180.185556550.18244053

As can be seen from the above table (row 5, column 1), the mean RMSE for cross-validation is 121793.58, which is below $123k.

model_DL.model_performance(test)

#ModelMetricsRegression: deeplearning
#** Reported on test data. **

#MSE: 14781990070.095192
#RMSE: 121581.20771770278
#MAE: 72522.60487846025
#RMSLE: 0.1834924698171073
#Mean Residual Deviance: 14781990070.095192

As can be seen from above, the deep learning model could achieve the target of RMSE below $123k on test dataset.

3. Finally, let’s train a stacked ensemble of the models created in earlier steps. We may need to repeat steps two and three until the best model (which is usually the ensemble model, but does not have to be) has the minimum required performance on the cross-validation dataset. Note: only one model has to achieve the minimum required performance. If multiple models achieve it, so we need to choose the best performing one.

models = [model_GBM.model_id, model_RF.model_id, model_DL.model_id] #model_GLM.model_id,
model_SE = H2OStackedEnsembleEstimator(model_id = 'se_gbm_dl_house', base_models=models)

%time model_SE.train(x, y, train)
#stackedensemble Model Build progress: |███████████████████████████████████| 100%
#Wall time: 2.67 s
#model_SE.model_performance(test)
#ModelMetricsRegressionGLM: stackedensemble
#** Reported on test data. **

#MSE: 130916347835.45828
#RMSE: 361823.6418967924
#MAE: 236448.3672215734
#RMSLE: 0.5514878971097109
#R^2: 0.015148783736682492
#Mean Residual Deviance: 130916347835.45828
#Null degrees of freedom: 2150
#Residual degrees of freedom: 2147
#Null deviance: 285935013037402.7
#Residual deviance: 281601064194070.75
#AIC: 61175.193832813566

As can be seen from above, the stacked ensemble model could not reach the required performance, neither on the cross-validation, nor on the test dataset.

4. Now let’s get the performance on the test data of the chosen model/ensemble, and confirm that this also reaches the minimum target on the test data.

Best Model

The model that performs best in terms of mean cross-validation RMSE and RMSE on the test dataset (both of them are below the minimum target $123k) is the gradient boositng model (GBM), which is the Model 2 above.

model_GBM.model_performance(test)
#ModelMetricsRegression: gbm
#** Reported on test data. **

#MSE: 14243079402.729088
#RMSE: 119344.37315068142
#MAE: 65050.344749203745
#RMSLE: 0.16421689257411975
#Mean Residual Deviance: 14243079402.729088

# save the models
h2o.save_model(model_GBM, 'best_model (GBM)') # the final best model
h2o.save_model(model_SE, 'SE_model')
h2o.save_model(model_GBM, 'GBM_model')
h2o.save_model(model_RF, 'RF_model')
h2o.save_model(model_GLM, 'GLM_model')
h2o.save_model(model_DL, 'DL_model')

Neural Translation – Machine Translation with Neural Nets (BiLSTM) with Keras / Python

In this blog, we shall discuss about how to build a neural network to translate from English to German. This problem appeared as the Capstone project for the coursera course “Tensorflow 2: Customising your model“, a part of the specialization “Tensorflow2 for Deep Learning“, by the Imperial College, London. The problem statement / description / steps are taken from the course itself. We shall use the concepts from the course, including building more flexible model architectures, freezing layers, data processing pipeline and sequence modelling.

Image taken from the Capstone project

Here we shall use a language dataset from http://www.manythings.org/anki/ to build a neural translation model. This dataset consists of over 200k pairs of sentences in English and German. In order to make the training quicker, we will restrict to our dataset to 20k pairs. The below figure shows a few sentence pairs taken from the file.

Our goal is to develop a neural translation model from English to German, making use of a pre-trained English word embedding module.

1. Text preprocessing

We need to start with preprocessing the above input file. Here are the steps that we need to follow:

  • First let’s create separate lists of English and German sentences.
  • Add a special “” and “” token to the beginning and end of every German sentence.
  • Use the Tokenizer class from the tf.keras.preprocessing.text module to tokenize the German sentences, ensuring that no character filters are applied.

The next figure shows 5 randomly chosen examples of (preprocessed) English and German sentence pairs. For the German sentence, the text (with start and end tokens) as well as the tokenized sequence are shown.

  • Pad the end of the tokenized German sequences with zeros, and batch the complete set of sequences into a single numpy array, using the following code snippet.

padded_tokenized_german_sentences = tf.keras.preprocessing.sequence.pad_sequences(tokenized_german_sentences, 
                                                                                  maxlen=14, padding='post', value=0) 
padded_tokenized_german_sentences.shape
#(20000, 14)

As can be seen from the next code block, the maximum length of a German sentence is 14, whereas there are 5743 unique words in the German sentences from the subset of the corpus. The index of the <start> token is 1.

max([len(tokenized_german_sentences[i]) for i in range(20000)])
# 14
len(tokenizer.index_word)
# 5743
tokenizer.word_index['']
# 1

2. Preparing the data with tf.data.Dataset

Loading the embedding layer

As part of the dataset preproceessing for this project we shall use a pre-trained English word-embedding module from TensorFlow Hub. The URL for the module is https://tfhub.dev/google/tf2-preview/nnlm-en-dim128-with-normalization/1.

This embedding takes a batch of text tokens in a 1-D tensor of strings as input. It then embeds the separate tokens into a 128-dimensional space.

Although this model can also be used as a sentence embedding module (e.g., where the module will process each token by removing punctuation and splitting on spaces and then averages the word embeddings over a sentence to give a single embedding vector), however, we will use it only as a word embedding module here, and will pass each word in the input sentence as a separate token.

The following code snippet shows how an English sentence with 7 words is mapped into a 7×128 tensor in the embedding space.


embedding_layer(tf.constant(["these", "aren't", "the", "droids", "you're", "looking", "for"])).shape
# TensorShape([7, 128])

Now, let’s prepare the training and validation Datasets as follows:

  • Create a random training and validation set split of the data, reserving e.g. 20% of the data for validation (each English dataset example is a single sentence string, and each German dataset example is a sequence of padded integer tokens).
  • Load the training and validation sets into a tf.data.Dataset object, passing in a tuple of English and German data for both training and validation sets, using the following code snippet.

def make_Dataset(input_array, target_array):
    return tf.data.Dataset.from_tensor_slices((input_array, target_array)) 

train_data = make_Dataset(input_train, target_train)
valid_data = make_Dataset(input_valid, target_valid)

  • Create a function to map over the datasets that splits each English sentence at spaces. Apply this function to both Dataset objects using the map method, using the following code snippet.

def str_split(e, g):
    e = tf.strings.split(e)
    return e, g

train_data = train_data.map(str_split)
valid_data = valid_data.map(str_split)

  • Create a function to map over the datasets that embeds each sequence of English words using the loaded embedding layer/model. Apply this function to both Dataset objects using the map method, using the following code snippet.

def embed_english(x, y):
    return embedding_layer(x), y

train_data = train_data.map(embed_english)
valid_data = valid_data.map(embed_english)

  • Create a function to filter out dataset examples where the English sentence is more than 13 (embedded) tokens in length. Apply this function to both Dataset objects using the filter method, using the following code snippet.

def remove_long_sentence(e, g):
    return tf.shape(e)[0] <= 13

train_data = train_data.filter(remove_long_sentence)
valid_data = valid_data.filter(remove_long_sentence)

  • Create a function to map over the datasets that pads each English sequence of embeddings with some distinct padding value before the sequence, so that each sequence is length 13. Apply this function to both Dataset objects using the map method, as shown in the next code block. 

def pad_english(e, g):
    return tf.pad(e, paddings = [[13-tf.shape(e)[0],0], [0,0]], mode='CONSTANT', constant_values=0), g

train_data = train_data.map(pad_english)
valid_data = valid_data.map(pad_english)

  • Batch both training and validation Datasets with a batch size of 16.

train_data = train_data.batch(16)
valid_data = valid_data.batch(16)

  • Let’s now print the element_spec property for the training and validation Datasets. Also, let’s print the shape of an English data example from the training Dataset and a German data example Tensor from the validation Dataset.

train_data.element_spec
#(TensorSpec(shape=(None, None, 128), dtype=tf.float32, name=None),
# TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

valid_data.element_spec
#(TensorSpec(shape=(None, None, 128), dtype=tf.float32, name=None),
 #TensorSpec(shape=(None, 14), dtype=tf.int32, name=None))

for e, g in train_data.take(1):
    print(e.shape)
#(16, 13, 128)

for e, g in valid_data.take(1):
    print(g)
#tf.Tensor(
#[[   1   11  152    6  458    3    2    0    0    0    0    0    0    0]
# [   1   11  333  429    3    2    0    0    0    0    0    0    0    0]
# [   1   11   59   12    3    2    0    0    0    0    0    0    0    0]
# [   1  990   25   42  444    7    2    0    0    0    0    0    0    0]
# [   1    4   85 1365    3    2    0    0    0    0    0    0    0    0]
# [   1  131    8   22    5  583    3    2    0    0    0    0    0    0]
# [   1    4   85 1401    3    2    0    0    0    0    0    0    0    0]
# [   1   17  381   80    3    2    0    0    0    0    0    0    0    0]
# [   1 2998   13   33    7    2    0    0    0    0    0    0    0    0]
# [   1  242    6  479    3    2    0    0    0    0    0    0    0    0]
# [   1   35   17   40    7    2    0    0    0    0    0    0    0    0]
# [   1   11   30  305   46   47 1913  471    3    2    0    0    0    0]
# [   1    5   48 1184    3    2    0    0    0    0    0    0    0    0]
# [   1    5  287   12  834 5268    3    2    0    0    0    0    0    0]
# [   1    5    6  523    3    2    0    0    0    0    0    0    0    0]
# [   1   13  109   28   29   44  491    3    2    0    0    0    0    0]], shape=(16, 14), dtype=int32)

The custom translation model

The following is a schematic of the custom translation model architecture we shall develop now.

Image taken from the Capstone project

The custom model consists of an encoder RNN and a decoder RNN. The encoder takes words of an English sentence as input, and uses a pre-trained word embedding to embed the words into a 128-dimensional space. To indicate the end of the input sentence, a special end token (in the same 128-dimensional space) is passed in as an input. This token is a TensorFlow Variable that is learned in the training phase (unlike the pre-trained word embedding, which is frozen).

The decoder RNN takes the internal state of the encoder network as its initial state. A start token is passed in as the first input, which is embedded using a learned German word embedding. The decoder RNN then makes a prediction for the next German word, which during inference is then passed in as the following input, and this process is repeated until the special <end> token is emitted from the decoder.

Create the custom layer

Let’s create a custom layer to add the learned end token embedding to the encoder model:

Image taken from the capstone project

Now let’s first build the custom layer, which will be later used to create the encoder.

  • Using layer subclassing, create a custom layer that takes a batch of English data examples from one of the Datasets, and adds a learned embedded ‘end’ token to the end of each sequence.
  • This layer should create a TensorFlow Variable (that will be learned during training) that is 128-dimensional (the size of the embedding space).

from tensorflow.keras.models import  Sequential, Model
from tensorflow.keras.layers import Layer, Concatenate, Input, Masking, LSTM, Embedding, Dense
from tensorflow.keras.optimizers import Adam
from tensorflow.keras.losses import SparseCategoricalCrossentropy

class CustomLayer(Layer):

    def __init__(self, **kwargs):
        super(CustomLayer, self).__init__(**kwargs)
        self.embed = tf.Variable(initial_value=tf.zeros(shape=(1,128)), trainable=True, dtype='float32')
        
    def call(self, inputs):
        x = tf.tile(self.embed, [tf.shape(inputs)[0], 1])
        x = tf.expand_dims(x, axis=1)
        return tf.concat([inputs, x], axis=1)
        <em>#return Concatenate(axis=1)([inputs, x])</em>

  • Let’s extract a batch of English data examples from the training Dataset and print the shape. Test the custom layer by calling the layer on the English data batch Tensor and print the resulting Tensor shape (the layer should increase the sequence length by one).

custom_layer = CustomLayer()
e, g = next(iter(train_data.take(1)))
print(e.shape)
# (16, 13, 128)
o = custom_layer(e)
o.shape
# TensorShape([16, 14, 128])

Build the encoder network

The encoder network follows the schematic diagram above. Now let’s build the RNN encoder model.

  • Using the keras functional API, build the encoder network according to the following spec:
    • The model will take a batch of sequences of embedded English words as input, as given by the Dataset objects.
    • The next layer in the encoder will be the custom layer you created previously, to add a learned end token embedding to the end of the English sequence.
    • This is followed by a Masking layer, with the mask_value set to the distinct padding value you used when you padded the English sequences with the Dataset preprocessing above.
    • The final layer is an LSTM layer with 512 units, which also returns the hidden and cell states.
    • The encoder is a multi-output model. There should be two output Tensors of this model: the hidden state and cell states of the LSTM layer. The output of the LSTM layer is unused.

inputs = Input(batch_shape = (<strong>None</strong>, 13, 128), name='input')
x = CustomLayer(name='custom_layer')(inputs)
x = Masking(mask_value=0, name='masking_layer')(x)
x, h, c = LSTM(units=512, return_state=<strong>True</strong>, name='lstm')(x)
encoder_model = Model(inputs = inputs, outputs = [h, c], name='encoder')

encoder_model.summary()

# Model: "encoder"
# _________________________________________________________________
# Layer (type)                 Output Shape              Param #   
# =================================================================
# input (InputLayer)           [(None, 13, 128)]         0         
# _________________________________________________________________
# custom_layer (CustomLayer)   (None, 14, 128)           128       
# _________________________________________________________________
# masking_layer (Masking)      (None, 14, 128)           0         
# _________________________________________________________________
# lstm (LSTM)                  [(None, 512), (None, 512) 1312768   
# =================================================================
# Total params: 1,312,896
# Trainable params: 1,312,896
# Non-trainable params: 0
# _________________________________________________________________

Build the decoder network

The decoder network follows the schematic diagram below.

image taken from the capstone project

Now let’s build the RNN decoder model.

  • Using Model subclassing, build the decoder network according to the following spec:
    • The initializer should create the following layers:
      • An Embedding layer with vocabulary size set to the number of unique German tokens, embedding dimension 128, and set to mask zero values in the input.
      • An LSTM layer with 512 units, that returns its hidden and cell states, and also returns sequences.
      • A Dense layer with number of units equal to the number of unique German tokens, and no activation function.
    • The call method should include the usual inputs argument, as well as the additional keyword arguments hidden_state and cell_state. The default value for these keyword arguments should be None.
    • The call method should pass the inputs through the Embedding layer, and then through the LSTM layer. If the hidden_state and cell_state arguments are provided, these should be used for the initial state of the LSTM layer. 
    • The call method should pass the LSTM output sequence through the Dense layer, and return the resulting Tensor, along with the hidden and cell states of the LSTM layer.

class Decoder(Model):
    
    def __init__(self, **kwargs):
        super(Decoder, self).__init__(**kwargs)
        self.embed = Embedding(input_dim=len(tokenizer.index_word)+1, output_dim=128, mask_zero=True, name='embedding_layer')
        self.lstm = LSTM(units = 512, return_state = True, return_sequences = True, name='lstm_layer')
        self.dense = Dense(len(tokenizer.index_word)+1, name='dense_layer')
        
    def call(self, inputs, hidden_state = None, cell_state = None):
        x = self.embed(inputs)
        x, hidden_state, cell_state = self.lstm(x, initial_state = [hidden_state, cell_state]) \
                                                     if hidden_state is not None and cell_state is not None else self.lstm(x)
        x = self.dense(x)
        return x, hidden_state, cell_state

decoder_model = Decoder(name='decoder')
e, g_in = next(iter(train_data.take(1)))
h, c = encoder_model(e)
g_out, h, c = decoder_model(g_in, h, c)

print(g_out.shape, h.shape, c.shape)
# (16, 14, 5744) (16, 512) (16, 512)

decoder_model.summary()

#Model: "decoder"
#_________________________________________________________________
#Layer (type)                 Output Shape              Param #   
#=================================================================
#embedding_layer (Embedding)  multiple                  735232    
#_________________________________________________________________
#lstm_layer (LSTM)            multiple                  1312768   
#_________________________________________________________________
#dense_layer (Dense)          multiple                  2946672   
#=================================================================
#Total params: 4,994,672
#Trainable params: 4,994,672
#Non-trainable params: 0

Create a custom training loop

custom training loop to train your custom neural translation model.

  • Define a function that takes a Tensor batch of German data (as extracted from the training Dataset), and returns a tuple containing German inputs and outputs for the decoder model (refer to schematic diagram above).
  • Define a function that computes the forward and backward pass for your translation model. This function should take an English input, German input and German output as arguments, and should do the following:
    • Pass the English input into the encoder, to get the hidden and cell states of the encoder LSTM.
    • These hidden and cell states are then passed into the decoder, along with the German inputs, which returns a sequence of outputs (the hidden and cell state outputs of the decoder LSTM are unused in this function).
    • The loss should then be computed between the decoder outputs and the German output function argument.
    • The function returns the loss and gradients with respect to the encoder and decoder’s trainable variables.
    • Decorate the function with @tf.function
  • Define and run a custom training loop for a number of epochs (for you to choose) that does the following:
    • Iterates through the training dataset, and creates decoder inputs and outputs from the German sequences.
    • Updates the parameters of the translation model using the gradients of the function above and an optimizer object.
    • Every epoch, compute the validation loss on a number of batches from the validation and save the epoch training and validation losses.
  • Plot the learning curves for loss vs epoch for both training and validation sets.

@tf.function
def forward_backward(encoder_model, decoder_model, e, g_in, g_out, loss):
    with tf.GradientTape() as tape:
        h, c = encoder_model(e)
        d_g_out, _, _ = decoder_model(g_in, h, c)
        cur_loss = loss(g_out, d_g_out)
        grads = tape.gradient(cur_loss, encoder_model.trainable_variables + decoder_model.trainable_variables)
    return cur_loss, grads

def train_encoder_decoder(encoder_model, decoder_model, num_epochs, train_data, valid_data, valid_steps, 
                          optimizer, loss, grad_fn):
    train_losses = []
    val_loasses = []
    for epoch in range(num_epochs):
        train_epoch_loss_avg = tf.keras.metrics.Mean()
        val_epoch_loss_avg = tf.keras.metrics.Mean()
        for e, g in train_data:
            g_in, g_out = get_german_decoder_data(g)
            train_loss, grads = grad_fn(encoder_model, decoder_model, e, g_in, g_out, loss)
            optimizer.apply_gradients(zip(grads, encoder_model.trainable_variables + decoder_model.trainable_variables))
            train_epoch_loss_avg.update_state(train_loss)    
        for e_v, g_v in valid_data.take(valid_steps):
            g_v_in, g_v_out = get_german_decoder_data(g_v)
            val_loss, _ = grad_fn(encoder_model, decoder_model, e_v, g_v_in, g_v_out, loss)
            val_epoch_loss_avg.update_state(val_loss)        
        print(f'epoch: {epoch}, train loss: {train_epoch_loss_avg.result()}, validation loss: {val_epoch_loss_avg.result()}')    
        train_losses.append(train_epoch_loss_avg.result())
        val_loasses.append(val_epoch_loss_avg.result())
    return train_losses, val_loasses

optimizer_obj = Adam(learning_rate = 1e-3)
loss_obj = SparseCategoricalCrossentropy(from_logits=True)
train_loss_results, valid_loss_results = train_encoder_decoder(encoder_model, decoder_model, 20, train_data, valid_data, 20,
                                                          optimizer_obj, loss_obj, forward_backward)

#epoch: 0, train loss: 4.4570465087890625, validation loss: 4.1102800369262695
#epoch: 1, train loss: 3.540217399597168, validation loss: 3.36271333694458
#epoch: 2, train loss: 2.756622076034546, validation loss: 2.7144060134887695
#epoch: 3, train loss: 2.049957275390625, validation loss: 2.1480133533477783
#epoch: 4, train loss: 1.4586931467056274, validation loss: 1.7304519414901733
#epoch: 5, train loss: 1.0423369407653809, validation loss: 1.4607685804367065
#epoch: 6, train loss: 0.7781839370727539, validation loss: 1.314332127571106
#epoch: 7, train loss: 0.6160411238670349, validation loss: 1.2391613721847534
#epoch: 8, train loss: 0.5013922452926636, validation loss: 1.1840368509292603
#epoch: 9, train loss: 0.424654096364975, validation loss: 1.1716119050979614
#epoch: 10, train loss: 0.37027251720428467, validation loss: 1.1612160205841064
#epoch: 11, train loss: 0.3173922598361969, validation loss: 1.1330692768096924
#epoch: 12, train loss: 0.2803193926811218, validation loss: 1.1394184827804565
#epoch: 13, train loss: 0.24854864180088043, validation loss: 1.1354353427886963
#epoch: 14, train loss: 0.22135266661643982, validation loss: 1.1059410572052002
#epoch: 15, train loss: 0.2019050121307373, validation loss: 1.1111358404159546
#epoch: 16, train loss: 0.1840481162071228, validation loss: 1.1081823110580444
#epoch: 17, train loss: 0.17126116156578064, validation loss: 1.125329852104187
#epoch: 18, train loss: 0.15828527510166168, validation loss: 1.0979799032211304
#epoch: 19, train loss: 0.14451280236244202, validation loss: 1.0899451971054077

import matplotlib.pyplot as plt
plt.figure(figsize=(10,6))
plt.xlabel("Epochs", fontsize=14)
plt.ylabel("Loss", fontsize=14)
plt.title('Loss vs epochs')
plt.plot(train_loss_results, label='train')
plt.plot(valid_loss_results, label='valid')
plt.legend()
plt.show()

The following figure shows how the training and validation loss decrease with epochs (the model is trained for 20 epochs).

Use the model to translate

Now it’s time to put the model into practice! Let’s run the translation for five randomly sampled English sentences from the dataset. For each sentence, the process is as follows:

  • Preprocess and embed the English sentence according to the model requirements.
  • Pass the embedded sentence through the encoder to get the encoder hidden and cell states.
  • Starting with the special "<start>" token, use this token and the final encoder hidden and cell states to get the one-step prediction from the decoder, as well as the decoder’s updated hidden and cell states.
  • Create a loop to get the next step prediction and updated hidden and cell states from the decoder, using the most recent hidden and cell states. Terminate the loop when the "<end>" token is emitted, or when the sentence has reached a maximum length.
  • Decode the output token sequence into German text and print the English text and the model’s German translation.


indices = np.random.choice(len(english_sentences), 5)
test_data = tf.data.Dataset.from_tensor_slices(np.array([english_sentences[i] for i in indices]))
test_data = test_data.map(tf.strings.split)
test_data = test_data.map(embedding_layer)
test_data = test_data.filter(lambda x: tf.shape(x)[0] <= 13)
test_data = test_data.map(lambda x: tf.pad(x, paddings = [[13-tf.shape(x)[0],0], [0,0]], mode='CONSTANT', constant_values=0))
print(test_data.element_spec)
# TensorSpec(shape=(None, 128), dtype=tf.float32, name=None)

start_token = np.array(tokenizer.texts_to_sequences(['']))
end_token = np.array(tokenizer.texts_to_sequences(['']))
for e, i in zip(test_data.take(n), indices):
    h, c = encoder_model(tf.expand_dims(e, axis=0))
    g_t = []
    g_in = start_token
    g_out, h, c = decoder_model(g_in, h, c)
    g_t.append('')
    g_out = tf.argmax(g_out, axis=2)
    while g_out != end_token: 
        g_out, h, c = decoder_model(g_in, h, c)
        g_out = tf.argmax(g_out, axis=2)
        g_in = g_out
        g_t.append(tokenizer.index_word.get(tf.squeeze(g_out).numpy(), 'UNK'))
    print(f'English Text: {english_sentences[i]}')
    print(f'German Translation: {" ".join(g_t)}')
    print()

# English Text: i'll see tom .
# German Translation:  ich werde tom folgen . 

# English Text: you're not alone .
# German Translation:  keine nicht allein . 

# English Text: what a hypocrite !
# German Translation:  fuer ein idiot ! 

# English Text: he kept talking .
# German Translation:  sie hat ihn erwuergt . 

# English Text: tom's in charge .
# German Translation:  tom ist im bett . 

The above output shows the sample English sentences and their German translations predicted by the model.

The following animation (click and open on a new tab) shows how the predicted German translation improves (with the decrease in loss) for a few sample English sentences as the deep learning model is trained for more and more epochs.

As can be seen from the above animation, translation gets better as the deep learning model is trained for more and more epochs.

NLP with Bangla: semantic similarity with word2vec, Deep learning (RNN) to generate Bangla song-like texts and to do sentiment analysis on astrological prediction dataset, creating a simple Bangla ChatBot using RASA NLU with Python

In this blog, we shall discuss on a few NLP techniques with Bangla language. We shall start with a demonstration on how to train a word2vec model with Bangla wiki corpus with tensorflow and how to visualize the semantic similarity between words using t-SNE. Next, we shall demonstrate how to train a character / word LSTM on selected Tagore’s songs to generate songs like Tagore with keras. Next, we shall create sentiment analysis dataset by crawling the daily astrological prediction pages of a leading Bangla newspaper and manually labeling the sentiment of each of the predictions corresponding to each moon-sign. We shall train an LSTM sentiment a analysis model to predict the sentiment of a moon-sign prediction. Finally, we shall use RASA NLU (natural language understanding) to build a very simple chatbot in Bangla.

Word2vec model with Bangla wiki corpus with tensorflow

  • Let’s start by importing the required libraries
import collections
import numpy as np
import tensorflow as tf
from matplotlib import pylab
  • Download the Bangla wikipedia corpus from Kaggle. The first few lines from the corpus are shown below:

id,text,title,url

1528,

“রবীন্দ্রনাথ ঠাকুর”

রবীন্দ্রনাথ ঠাকুর (৭ই মে, ১৮৬১ – ৭ই আগস্ট, ১৯৪১) (২৫ বৈশাখ, ১২৬৮ – ২২ শ্রাবণ, ১৩৪৮ বঙ্গাব্দ) ছিলেন অগ্রণী বাঙালি কবি, ঔপন্যাসিক, সংগীতস্রষ্টা, নাট্যকার, চিত্রকর, ছোটগল্পকার, প্রাবন্ধিক, অভিনেতা, কণ্ঠশিল্পী ও দার্শনিক। তাঁকে বাংলা ভাষার সর্বশ্রেষ্ঠ সাহিত্যিক মনে করা হয়। রবীন্দ্রনাথকে গুরুদেব, কবিগুরু ও বিশ্বকবি অভিধায় ভূষিত করা হয়। রবীন্দ্রনাথের ৫২টি কাব্যগ্রন্থ, ৩৮টি নাটক, ১৩টি উপন্যাস ও ৩৬টি প্রবন্ধ ও অন্যান্য গদ্যসংকলন তাঁর জীবদ্দশায় বা মৃত্যুর অব্যবহিত পরে প্রকাশিত হয়। তাঁর সর্বমোট ৯৫টি ছোটগল্প ও ১৯১৫টি গান যথাক্রমে “”গল্পগুচ্ছ”” ও “”গীতবিতান”” সংকলনের অন্তর্ভুক্ত হয়েছে। রবীন্দ্রনাথের যাবতীয় প্রকাশিত ও গ্রন্থাকারে অপ্রকাশিত রচনা ৩২ খণ্ডে “”রবীন্দ্র রচনাবলী”” নামে প্রকাশিত হয়েছে। রবীন্দ্রনাথের যাবতীয় পত্রসাহিত্য উনিশ খণ্ডে “”চিঠিপত্র”” ও চারটি পৃথক গ্রন্থে প্রকাশিত। এছাড়া তিনি প্রায় দুই হাজার ছবি এঁকেছিলেন। রবীন্দ্রনাথের রচনা বিশ্বের বিভিন্ন ভাষায় অনূদিত হয়েছে। ১৯১৩ সালে “”গীতাঞ্জলি”” কাব্যগ্রন্থের ইংরেজি অনুবাদের জন্য তিনি সাহিত্যে নোবেল পুরস্কার লাভ করেন।রবীন্দ্রনাথ ঠাকুর কলকাতার এক ধনাঢ্য ও সংস্কৃতিবান ব্রাহ্ম পিরালী ব্রাহ্মণ পরিবারে জন্মগ্রহণ করেন। বাল্যকালে প্রথাগত বিদ্যালয়-শিক্ষা তিনি গ্রহণ করেননি; গৃহশিক্ষক রেখে বাড়িতেই তাঁর শিক্ষার ব্যবস্থা করা হয়েছিল। আট বছর বয়সে তিনি কবিতা লেখা শুরু করেন। ১৮৭৪ সালে “”তত্ত্ববোধিনী পত্রিকা””-এ তাঁর “””” কবিতাটি প্রকাশিত হয়। এটিই ছিল তাঁর প্রথম প্রকাশিত রচনা। ১৮৭৮ সালে মাত্র সতেরো বছর বয়সে রবীন্দ্রনাথ প্রথমবার ইংল্যান্ডে যান। ১৮৮৩ সালে মৃণালিনী দেবীর সঙ্গে তাঁর বিবাহ হয়। ১৮৯০ সাল থেকে রবীন্দ্রনাথ পূর্ববঙ্গের শিলাইদহের জমিদারি এস্টেটে বসবাস শুরু করেন। ১৯০১ সালে তিনি পশ্চিমবঙ্গের শান্তিনিকেতনে ব্রহ্মচর্যাশ্রম প্রতিষ্ঠা করেন এবং সেখানেই পাকাপাকিভাবে বসবাস শুরু করেন। ১৯০২ সালে তাঁর পত্নীবিয়োগ হয়। ১৯০৫ সালে তিনি বঙ্গভঙ্গ-বিরোধী আন্দোলনে জড়িয়ে পড়েন। ১৯১৫ সালে ব্রিটিশ সরকার তাঁকে নাইট উপাধিতে ভূষিত করেন। কিন্তু ১৯১৯ সালে জালিয়ানওয়ালাবাগ হত্যাকাণ্ডের প্রতিবাদে তিনি সেই উপাধি ত্যাগ করেন। ১৯২১ সালে গ্রামোন্নয়নের জন্য তিনি শ্রীনিকেতন নামে একটি সংস্থা প্রতিষ্ঠা করেন। ১৯২৩ সালে আনুষ্ঠানিকভাবে বিশ্বভারতী প্রতিষ্ঠিত হয়। দীর্ঘজীবনে তিনি বহুবার বিদেশ ভ্রমণ করেন এবং সমগ্র বিশ্বে বিশ্বভ্রাতৃত্বের বাণী প্রচার করেন। ১৯৪১ সালে দীর্ঘ রোগভোগের পর কলকাতার পৈত্রিক বাসভবনেই তাঁর মৃত্যু হয়।রবীন্দ্রনাথের কাব্যসাহিত্যের বৈশিষ্ট্য ভাবগভীরতা, গীতিধর্মিতা চিত্ররূপময়তা, অধ্যাত্মচেতনা, ঐতিহ্যপ্রীতি, প্রকৃতিপ্রেম, মানবপ্রেম, স্বদেশপ্রেম, বিশ্বপ্রেম, রোম্যান্টিক সৌন্দর্যচেতনা, ভাব, ভাষা, ছন্দ ও আঙ্গিকের বৈচিত্র্য, বাস্তবচেতনা ও প্রগতিচেতনা। রবীন্দ্রনাথের গদ্যভাষাও কাব্যিক। ভারতের ধ্রুপদি ও লৌকিক সংস্কৃতি এবং পাশ্চাত্য বিজ্ঞানচেতনা ও শিল্পদর্শন তাঁর রচনায় গভীর প্রভাব বিস্তার করেছিল। কথাসাহিত্য ও প্রবন্ধের মাধ্যমে তিনি সমাজ, রাজনীতি ও রাষ্ট্রনীতি সম্পর্কে নিজ মতামত প্রকাশ করেছিলেন। সমাজকল্যাণের উপায় হিসেবে তিনি গ্রামোন্নয়ন ও গ্রামের দরিদ্র মানুষ কে শিক্ষিত করে তোলার পক্ষে মতপ্রকাশ করেন। এর পাশাপাশি সামাজিক ভেদাভেদ, অস্পৃশ্যতা, ধর্মীয় গোঁড়ামি ও ধর্মান্ধতার বিরুদ্ধেও তিনি তীব্র প্রতিবাদ জানিয়েছিলেন। রবীন্দ্রনাথের দর্শনচেতনায় ঈশ্বরের মূল হিসেবে মানব সংসারকেই নির্দিষ্ট করা হয়েছে; রবীন্দ্রনাথ দেববিগ্রহের পরিবর্তে কর্মী অর্থাৎ মানুষ ঈশ্বরের পূজার কথা বলেছিলেন। সংগীত ও নৃত্যকে তিনি শিক্ষার অপরিহার্য অঙ্গ মনে করতেন। রবীন্দ্রনাথের গান তাঁর অন্যতম শ্রেষ্ঠ কীর্তি। তাঁর রচিত “”আমার সোনার বাংলা”” ও “”জনগণমন-অধিনায়ক জয় হে”” গানদুটি যথাক্রমে গণপ্রজাতন্ত্রী বাংলাদেশ ও ভারতীয় প্রজাতন্ত্রের জাতীয় সংগীত।

জীবন.

প্রথম জীবন (১৮৬১–১৯০১).

শৈশব ও কৈশোর (১৮৬১ – ১৮৭৮).
রবীন্দ্রনাথ ঠাকুর কলকাতার জোড়াসাঁকো ঠাকুরবাড়িতে জন্মগ্রহণ করেছিলেন। তাঁর পিতা ছিলেন ব্রাহ্ম ধর্মগুরু দেবেন্দ্রনাথ ঠাকুর (১৮১৭–১৯০৫) এবং মাতা ছিলেন সারদাসুন্দরী দেবী (১৮২৬–১৮৭৫)। রবীন্দ্রনাথ ছিলেন পিতামাতার চতুর্দশ সন্তান। জোড়াসাঁকোর ঠাকুর পরিবার ছিল ব্রাহ্ম আদিধর্ম মতবাদের প্রবক্তা। রবীন্দ্রনাথের পূর্ব পুরুষেরা খুলনা জেলার রূপসা উপজেলা পিঠাভোগে বাস করতেন। ১৮৭৫ সালে মাত্র চোদ্দ বছর বয়সে রবীন্দ্রনাথের মাতৃবিয়োগ ঘটে। পিতা দেবেন্দ্রনাথ দেশভ্রমণের নেশায় বছরের অধিকাংশ সময় কলকাতার বাইরে অতিবাহিত করতেন। তাই ধনাঢ্য পরিবারের সন্তান হয়েও রবীন্দ্রনাথের ছেলেবেলা কেটেছিল ভৃত্যদের অনুশাসনে। শৈশবে রবীন্দ্রনাথ কলকাতার ওরিয়েন্টাল সেমিনারি, নর্ম্যাল স্কুল, বেঙ্গল অ্যাকাডেমি এবং সেন্ট জেভিয়ার্স কলেজিয়েট স্কুলে কিছুদিন করে পড়াশোনা করেছিলেন। কিন্তু বিদ্যালয়-শিক্ষায় অনাগ্রহী হওয়ায় বাড়িতেই গৃহশিক্ষক রেখে তাঁর শিক্ষার ব্যবস্থা করা হয়েছিল। ছেলেবেলায় জোড়াসাঁকোর বাড়িতে অথবা বোলপুর ও পানিহাটির বাগানবাড়িতে প্রাকৃতিক পরিবেশের মধ্যে ঘুরে বেড়াতে বেশি স্বচ্ছন্দবোধ করতেন রবীন্দ্রনাথ।১৮৭৩ সালে এগারো বছর বয়সে রবীন্দ্রনাথের উপনয়ন অনুষ্ঠিত হয়েছিল। এরপর তিনি কয়েক মাসের জন্য পিতার সঙ্গে দেশভ্রমণে বের হন। প্রথমে তাঁরা আসেন শান্তিনিকেতনে। এরপর পাঞ্জাবের অমৃতসরে কিছুকাল কাটিয়ে শিখদের উপাসনা পদ্ধতি পরিদর্শন করেন। শেষে পুত্রকে নিয়ে দেবেন্দ্রনাথ যান পাঞ্জাবেরই (অধুনা ভারতের হিমাচল প্রদেশ রাজ্যে অবস্থিত) ডালহৌসি শৈলশহরের নিকট বক্রোটায়। এখানকার বক্রোটা বাংলোয় বসে রবীন্দ্রনাথ পিতার কাছ থেকে সংস্কৃত ব্যাকরণ, ইংরেজি, জ্যোতির্বিজ্ঞান, সাধারণ বিজ্ঞান ও ইতিহাসের নিয়মিত পাঠ নিতে শুরু করেন। দেবেন্দ্রনাথ তাঁকে বিশিষ্ট ব্যক্তিবর্গের জীবনী, কালিদাস রচিত ধ্রুপদি সংস্কৃত কাব্য ও নাটক এবং উপনিষদ্‌ পাঠেও উৎসাহিত করতেন। ১৮৭৭ সালে “”ভারতী”” পত্রিকায় তরুণ রবীন্দ্রনাথের কয়েকটি গুরুত্বপূর্ণ রচনা প্রকাশিত হয়। এগুলি হল মাইকেল মধুসূদনের “”””, “”ভানুসিংহ ঠাকুরের পদাবলী”” এবং “””” ও “””” নামে দুটি গল্প। এর মধ্যে “”ভানুসিংহ ঠাকুরের পদাবলী”” বিশেষভাবে উল্লেখযোগ্য। এই কবিতাগুলি রাধা-কৃষ্ণ বিষয়ক পদাবলির অনুকরণে “”ভানুসিংহ”” ভণিতায় রচিত। রবীন্দ্রনাথের “”ভিখারিণী”” গল্পটি (১৮৭৭) বাংলা সাহিত্যের প্রথম ছোটগল্প। ১৮৭৮ সালে প্রকাশিত হয় রবীন্দ্রনাথের প্রথম কাব্যগ্রন্থ তথা প্রথম মুদ্রিত গ্রন্থ “”কবিকাহিনী””। এছাড়া এই পর্বে তিনি রচনা করেছিলেন “””” (১৮৮২) কাব্যগ্রন্থটি। রবীন্দ্রনাথের বিখ্যাত কবিতা “””” এই কাব্যগ্রন্থের অন্তর্গত।

যৌবন (১৮৭৮-১৯০১).
১৮৭৮ সালে ব্যারিস্টারি পড়ার উদ্দেশ্যে ইংল্যান্ডে যান রবীন্দ্রনাথ। প্রথমে তিনি ব্রাইটনের একটি পাবলিক স্কুলে ভর্তি হয়েছিলেন। ১৮৭৯ সালে ইউনিভার্সিটি কলেজ লন্ডনে আইনবিদ্যা নিয়ে পড়াশোনা শুরু করেন। কিন্তু সাহিত্যচর্চার আকর্ষণে সেই পড়াশোনা তিনি সমাপ্ত করতে পারেননি। ইংল্যান্ডে থাকাকালীন শেকসপিয়র ও অন্যান্য ইংরেজ সাহিত্যিকদের রচনার সঙ্গে রবীন্দ্রনাথের পরিচয় ঘটে। এই সময় তিনি বিশেষ মনোযোগ সহকারে পাঠ করেন “”রিলিজিও মেদিচি””, “”কোরিওলেনাস”” এবং “”অ্যান্টনি অ্যান্ড ক্লিওপেট্রা””। এই সময় তাঁর ইংল্যান্ডবাসের অভিজ্ঞতার কথা “”ভারতী”” পত্রিকায় পত্রাকারে পাঠাতেন রবীন্দ্রনাথ। উক্ত পত্রিকায় এই লেখাগুলি জ্যেষ্ঠভ্রাতা দ্বিজেন্দ্রনাথ ঠাকুরের সমালোচনাসহ প্রকাশিত হত “””” নামে। ১৮৮১ সালে সেই পত্রাবলি “””” নামে গ্রন্থাকারে ছাপা হয়। এটিই ছিল রবীন্দ্রনাথের প্রথম গদ্যগ্রন্থ তথা প্রথম চলিত ভাষায় লেখা গ্রন্থ। অবশেষে ১৮৮০ সালে প্রায় দেড় বছর ইংল্যান্ডে কাটিয়ে কোনো ডিগ্রি না নিয়ে এবং ব্যারিস্টারি পড়া শুরু না করেই তিনি দেশে ফিরে আসেন।১৮৮৩ সালের ৯ ডিসেম্বর (২৪ অগ্রহায়ণ, ১২৯০ বঙ্গাব্দ) ঠাকুরবাড়ির অধস্তন কর্মচারী বেণীমাধব রায়চৌধুরীর কন্যা ভবতারিণীর সঙ্গে রবীন্দ্রনাথের বিবাহ সম্পন্ন হয়। বিবাহিত জীবনে ভবতারিণীর নামকরণ হয়েছিল মৃণালিনী দেবী (১৮৭৩–১৯০২ )। রবীন্দ্রনাথ ও মৃণালিনীর সন্তান ছিলেন পাঁচ জন: মাধুরীলতা (১৮৮৬–১৯১৮), রথীন্দ্রনাথ (১৮৮৮–১৯৬১), রেণুকা (১৮৯১–১৯০৩), মীরা (১৮৯৪–১৯৬৯) এবং শমীন্দ্রনাথ (১৮৯৬–১৯০৭)। এঁদের মধ্যে অতি অল্প বয়সেই রেণুকা ও শমীন্দ্রনাথের মৃত্যু ঘটে।১৮৯১ সাল থেকে পিতার আদেশে নদিয়া (নদিয়ার উক্ত অংশটি অধুনা বাংলাদেশের কুষ্টিয়া জেলা), পাবনা ও রাজশাহী জেলা এবং উড়িষ্যার জমিদারিগুলির তদারকি শুরু করেন রবীন্দ্রনাথ। কুষ্টিয়ার শিলাইদহের কুঠিবাড়িতে রবীন্দ্রনাথ দীর্ঘ সময় অতিবাহিত করেছিলেন। জমিদার রবীন্দ্রনাথ শিলাইদহে “”পদ্মা”” নামে একটি বিলাসবহুল পারিবারিক বজরায় চড়ে প্রজাবর্গের কাছে খাজনা আদায় ও আশীর্বাদ প্রার্থনা করতে যেতেন। গ্রামবাসীরাও তাঁর সম্মানে ভোজসভার আয়োজন করত।১৮৯০ সালে রবীন্দ্রনাথের অপর বিখ্যাত কাব্যগ্রন্থ “””” প্রকাশিত হয়। কুড়ি থেকে ত্রিশ বছর বয়সের মধ্যে তাঁর আরও কয়েকটি উল্লেখযোগ্য কাব্যগ্রন্থ ও গীতিসংকলন প্রকাশিত হয়েছিল। এগুলি হলো “”””, “”””, “”রবিচ্ছায়া””, “””” ইত্যাদি। ১৮৯১ থেকে ১৮৯৫ সাল পর্যন্ত নিজের সম্পাদিত “”সাধনা”” পত্রিকায় রবীন্দ্রনাথের বেশ কিছু উৎকৃষ্ট রচনা প্রকাশিত হয়। তাঁর সাহিত্যজীবনের এই পর্যায়টি তাই “”সাধনা পর্যায়”” নামে পরিচিত। রবীন্দ্রনাথের “”গল্পগুচ্ছ”” গ্রন্থের প্রথম চুরাশিটি গল্পের অর্ধেকই এই পর্যায়ের রচনা। এই ছোটগল্পগুলিতে তিনি বাংলার গ্রামীণ জনজীবনের এক আবেগময় ও শ্লেষাত্মক চিত্র এঁকেছিলেন।

  • Preprocess the csv files with the following code using regular expressions (to get rid of punctuations). Remember we need to decode to utf-8 first, since we have unicode input files.
 
from glob import glob
import re
words = []
for f in glob('bangla/wiki/*.csv'):
    words += re.sub('[\r\n—?,;।!‘"’\.:\(\)\[\]…0-9]', ' ', open(f, 'rb').read().decode('utf8').strip()).split(' ')
words = list(filter(lambda x: not x in ['', '-'], words))
print(len(words))
# 13964346
words[:25]
#['রবীন্দ্রনাথ',
# 'ঠাকুর',
# 'রবীন্দ্রনাথ',
# 'ঠাকুর',
# '৭ই',
# 'মে',
# '১৮৬১',
# '৭ই',
# 'আগস্ট',
# '১৯৪১',
# '২৫',
# 'বৈশাখ',
# '১২৬৮',
# '২২',
# 'শ্রাবণ',
# '১৩৪৮',
# 'বঙ্গাব্দ',
# 'ছিলেন',
# 'অগ্রণী',
# 'বাঙালি',
# 'কবি',
# 'ঔপন্যাসিক',
# 'সংগীতস্রষ্টা',
# 'নাট্যকার',
# 'চিত্রকর']

  • Create indices for unique words in the dataset.
vocabulary_size = 25000
def build_dataset(words):
  count = [['UNK', -1]]
  count.extend(collections.Counter(words).most_common(vocabulary_size - 1))
  dictionary = dict()
  for word, _ in count:
    dictionary[word] = len(dictionary)
  data = list()
  unk_count = 0
  for word in words:
    if word in dictionary:
      index = dictionary[word]
    else:
      index = 0  # dictionary['UNK']
      unk_count = unk_count + 1
    data.append(index)
  count[0][1] = unk_count
  reverse_dictionary = dict(zip(dictionary.values(), dictionary.keys())) 
  return data, count, dictionary, reverse_dictionary

data, count, dictionary, reverse_dictionary = build_dataset(words)
print('Most common words (+UNK)', count[:5])
# Most common words (+UNK) [['UNK', 1961151], ('এবং', 196916), ('ও', 180042), ('হয়', 160533), ('করে', 131206)]
print('Sample data', data[:10])
#Sample data [1733, 1868, 1733, 1868, 5769, 287, 6855, 5769, 400, 2570]
del words  # Hint to reduce memory.
  • Generate batches to be trained with the word2vec skip-gram model.
  • The target label should be at the center of the buffer each time. That is, given a focus word, our goal will be to learn the most probable context words.
  • The input and the target vector will depend on num_skips and skip_window.
 
data_index = 0
def generate_batch(batch_size, num_skips, skip_window):
  global data_index
  assert batch_size % num_skips == 0
  assert num_skips <= 2 * skip_window
  batch = np.ndarray(shape=(batch_size), dtype=np.int32)
  labels = np.ndarray(shape=(batch_size, 1), dtype=np.int32)
  span = 2 * skip_window + 1 # [ skip_window target skip_window ]
  buffer = collections.deque(maxlen=span)
  for _ in range(span):
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  for i in range(batch_size // num_skips):
    target = skip_window  # 
    targets_to_avoid = [ skip_window ]
    for j in range(num_skips):
      while target in targets_to_avoid:
        target = random.randint(0, span - 1)
      targets_to_avoid.append(target)
      batch[i * num_skips + j] = buffer[skip_window]
      labels[i * num_skips + j, 0] = buffer[target]
    buffer.append(data[data_index])
    data_index = (data_index + 1) % len(data)
  return batch, labels

print('data:', [reverse_dictionary[di] for di in data[:8]])
# data: ['রবীন্দ্রনাথ', 'ঠাকুর', 'রবীন্দ্রনাথ', 'ঠাকুর', '৭ই', 'মে', '১৮৬১', '৭ই']
for num_skips, skip_window in [(2, 1), (4, 2)]:
    data_index = 0
    batch, labels = generate_batch(batch_size=8, num_skips=num_skips, skip_window=skip_window)
    print('\nwith num_skips = %d and skip_window = %d:' % 
          (num_skips, skip_window))
    print('    batch:', [reverse_dictionary[bi] for bi in batch])
    print('    labels:', [reverse_dictionary[li] for li in labels.reshape(8)])
    # data: ['রবীন্দ্রনাথ', 'ঠাকুর', 'রবীন্দ্রনাথ', 'ঠাকুর', '৭ই', 'মে',  '১৮৬১', '৭ই']
    # with num_skips = 2 and skip_window = 1:
    # batch: ['ঠাকুর', 'ঠাকুর', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর',  '৭ই', '৭ই']
    # labels: ['রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর', '৭ই', 'রবীন্দ্রনাথ', 'ঠাকুর', 'মে']
    # with num_skips = 4 and skip_window = 2:
    # batch: ['রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'রবীন্দ্রনাথ', 'ঠাকুর', 'ঠাকুর', 'ঠাকুর', 'ঠাকুর']
    # labels: ['রবীন্দ্রনাথ', '৭ই', 'ঠাকুর', 'ঠাকুর', 'মে', 'ঠাকুর', 'রবীন্দ্রনাথ', '৭ই']
  • Pick a random validation set to sample nearest neighbors.
  • Limit the validation samples to the words that have a low numeric ID, which by construction are also the most frequent.
  • Look up embeddings for inputs and compute the softmax loss, using a sample of the negative labels each time (this is known as negative sampling, which is used to make the computation efficient, since the number of labels are often too high).
  • The optimizer will optimize the softmax_weights and the embeddings.
    This is because the embeddings are defined as a variable quantity and the optimizer’s `minimize` method will by default modify all variable quantities that contribute to the tensor it is passed.
  • Compute the similarity between minibatch examples and all embeddings.
 
 batch_size = 128
 embedding_size = 128 # Dimension of the embedding vector.
 skip_window = 1 # How many words to consider left and right.
 num_skips = 2 # #times to reuse an input to generate a label.
 valid_size = 16 # Random set of words to evaluate similarity on.
 valid_window = 100 # Only pick dev samples in the head of the   
                    # distribution.
 valid_examples = np.array(random.sample(range(valid_window), 
                                         valid_size))
 num_sampled = 64 # Number of negative examples to sample.
 graph = tf.Graph()
 with graph.as_default(), tf.device('/cpu:0'):  
 # Input data.
   train_dataset = tf.placeholder(tf.int32, shape=[batch_size])
   train_labels = tf.placeholder(tf.int32, shape=[batch_size, 1])
   valid_dataset = tf.constant(valid_examples, dtype=tf.int32)
 # Variables.
   embeddings = tf.Variable( \
     tf.random_uniform([vocabulary_size, embedding_size], -1.0, 1.0))
   softmax_weights = tf.Variable( \
     tf.truncated_normal([vocabulary_size, embedding_size], \
                          stddev=1.0 / math.sqrt(embedding_size)))
   softmax_biases = tf.Variable(tf.zeros([vocabulary_size]))
 # Model.
   embed = tf.nn.embedding_lookup(embeddings, train_dataset)
   loss = tf.reduce_mean( \
                          tf.nn.sampled_softmax_loss(weights=softmax_weights, \   
                          biases=softmax_biases, inputs=embed, labels=train_labels, \ 
                          num_sampled=num_sampled, num_classes=vocabulary_size))
 # Optimizer.
 optimizer = tf.train.AdagradOptimizer(1.0).minimize(loss)  
 # use the cosine distance:
 norm = tf.sqrt(tf.reduce_sum(tf.square(embeddings), 1, keepdims=True))
   normalized_embeddings = embeddings / norm
   valid_embeddings = tf.nn.embedding_lookup(normalized_embeddings, valid_dataset)
   similarity = tf.matmul(valid_embeddings,  tf.transpose(normalized_embeddings))
 
  • Train the word2vec model with the batches constructed, for 100k steps.
 
num_steps = 100001with tf.Session(graph=graph) as session:
  tf.global_variables_initializer().run()
  print('Initialized')
  average_loss = 0
  for step in range(num_steps):
    batch_data, batch_labels = generate_batch(
      batch_size, num_skips, skip_window)
    feed_dict = {train_dataset : batch_data, \
                 train_labels : batch_labels}
    _, l = session.run([optimizer, loss], feed_dict=feed_dict)
    average_loss += l
    if step % 2000 == 0:
      if step > 0:
        average_loss = average_loss / 2000
      # The average loss is an estimate of the loss over the last 
      # 2000 batches.
      print('Average loss at step %d: %f' % (step, average_loss))
      average_loss = 0
      # note that this is expensive (~20% slowdown if computed every 
      # 500 steps)
    if step % 10000 == 0:
      sim = similarity.eval()
      for i in range(valid_size):
        valid_word = reverse_dictionary[valid_examples[i]]
        top_k = 8 # number of nearest neighbors
        nearest = (-sim[i, :]).argsort()[1:top_k+1]
        log = 'Nearest to %s:' % valid_word
        for k in range(top_k):
          close_word = reverse_dictionary[nearest[k]]
          log = '%s %s,' % (log, close_word)
        print(log)
  final_embeddings = normalized_embeddings.eval()
  • The following shows how the loss function decreases with the increase in training steps.
  • During the training process, the words that become semantically near come closer in the embedding space.
Image for post
  • Use t-SNE plot to map the following words from 128-dimensional embedding space to 2 dimensional manifold and visualize.
 
words = ['রাজা', 'রাণী', 'ভারত','বাংলাদেশ','দিল্লী','কলকাতা','ঢাকা',
         'পুরুষ','নারী','দুঃখ','লেখক','কবি','কবিতা','দেশ',
         'বিদেশ','লাভ','মানুষ', 'এবং', 'ও', 'গান', 'সঙ্গীত', 'বাংলা', 
         'ইংরেজি', 'ভাষা', 'কাজ', 'অনেক', 'জেলার', 'বাংলাদেশের', 
         'এক', 'দুই', 'তিন', 'চার', 'পাঁচ', 'দশ', '১', '৫', '২০', 
         'নবম', 'ভাষার', '১২', 'হিসাবে', 'যদি', 'পান', 'শহরের', 'দল', 
         'যদিও', 'বলেন', 'রান', 'করেছে', 'করে', 'এই', 'করেন', 'তিনি', 
         'একটি', 'থেকে', 'করা', 'সালে', 'এর', 'যেমন', 'সব',  'তার', 
         'খেলা',  'অংশ', 'উপর', 'পরে', 'ফলে',  'ভূমিকা', 'গঠন',  
         'তা', 'দেন', 'জীবন', 'যেখানে', 'খান', 'এতে',  'ঘটে', 'আগে', 
         'ধরনের', 'নেন', 'করতেন', 'তাকে', 'আর', 'যার', 'দেখা', 
         'বছরের', 'উপজেলা', 'থাকেন', 'রাজনৈতিক', 'মূলত', 'এমন', 
         'কিলোমিটার', 'পরিচালনা', '২০১১', 'তারা', 'তিনি', 'যিনি', 'আমি',  
         'তুমি', 'আপনি', 'লেখিকা', 'সুখ', 'বেদনা', 'মাস', 'নীল', 'লাল', 
         'সবুজ', 'সাদা', 'আছে', 'নেই', 'ছুটি', 'ঠাকুর',
         'দান', 'মণি', 'করুণা', 'মাইল', 'হিন্দু', 'মুসলমান','কথা', 'বলা',     
         'সেখানে', 'তখন', 'বাইরে', 'ভিতরে', 'ভগবান' ]
indices = []
for word in words:
    #print(word, dictionary[word])
    indices.append(dictionary[word])
two_d_embeddings = tsne.fit_transform(final_embeddings[indices, :])
plot(two_d_embeddings, words)
  • The following figure shows how the words similar in meaning are mapped to embedding vectors that are close to each other.
  • Also, note that arithmetic property of the word embeddings: e.g., the words ‘রাজা’ and ‘রাণী’ are approximately along the same distance and direction as the words ‘লেখক’ and ‘লেখিকা’, reflecting the fact that the nature of the semantic relatedness in terms of gender is same.
Image for post
  • The following animation shows how the embedding is learnt to preserve the semantic similarity in the 2D-manifold more and more as training proceeds.

Generating song-like texts with LSTM from Tagore’s Bangla songs

Text generation with Character LSTM

  • Let’s import the required libraries first.
from tensorflow.keras.callbacks import LambdaCallback
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.optimizers import RMSprop, Adam
import io, re
  • Read the input file, containing few selected songs of Tagore in Bangla.
raw_text = open('rabindrasangeet.txt','rb').read().decode('utf8')
print(raw_text[0:1000])

পূজা

অগ্নিবীণা বাজাও তুমি

অগ্নিবীণা বাজাও তুমি কেমন ক’রে !
আকাশ কাঁপে তারার আলোর গানের ঘোরে ।।
তেমনি ক’রে আপন হাতে ছুঁলে আমার বেদনাতে,
নূতন সৃষ্টি জাগল বুঝি জীবন-‘পরে ।।
বাজে ব’লেই বাজাও তুমি সেই গরবে,
ওগো প্রভু, আমার প্রাণে সকল সবে ।
বিষম তোমার বহ্নিঘাতে বারে বারে আমার রাতে
জ্বালিয়ে দিলে নূতন তারা ব্যথায় ভ’রে ।।

অচেনাকে ভয় কী
অচেনাকে ভয় কী আমার ওরে?
অচেনাকেই চিনে চিনে উঠবে জীবন ভরে ।।
জানি জানি আমার চেনা কোনো কালেই ফুরাবে না,
চিহ্নহারা পথে আমায় টানবে অচিন ডোরে ।।
ছিল আমার মা অচেনা, নিল আমায় কোলে ।
সকল প্রেমই অচেনা গো, তাই তো হৃদয় দোলে ।
অচেনা এই ভুবন-মাঝে কত সুরেই হৃদয় বাজে-
অচেনা এই জীবন আমার, বেড়াই তারি ঘোরে ।।অন্তর মম
অন্তর মম বিকশিত করো অন্তরতর হে-
নির্মল করো, উজ্জ্বল করো, সুন্দর করো হে ।।
জাগ্রত করো, উদ্যত করো, নির্ভয় করো হে ।।
মঙ্গল করো, নিরলস নিঃসংশয় করো হে ।।
যুক্ত করো হে সবার সঙ্গে, মুক্ত করো হে বন্ধ ।
সঞ্চার করো সকল কর্মে শান্ত তোমার ছন্দ ।
চরণপদ্মে মম চিত নিস্পন্দিত করো হে ।
নন্দিত করো, নন্দিত করো, নন্দিত করো হে ।।

অন্তরে জাগিছ অন্তর্যামী
অন্তরে জাগিছ অন্তর্যামী ।
  • Here we shall be using a many-to-many RNN as shown in the next figure.
  • Pre-process the text and create character indices to be used as the input in the model.
 
processed_text = raw_text.lower()
print('corpus length:', len(processed_text))
# corpus length: 207117
chars = sorted(list(set(processed_text)))
print('total chars:', len(chars))
# total chars: 89
char_indices = dict((c, i) for i, c in enumerate(chars))
indices_char = dict((i, c) for i, c in enumerate(chars))
  • Cut the text in semi-redundant sequences of maxlen characters.
 
def is_conjunction(c):
  h = ord(c) # print(hex(ord(c)))
  return (h >= 0x980 and h = 0x9bc and h = 0x9f2)
		 
maxlen = 40
step = 2
sentences = []
next_chars = []
i = 0
while i < len(processed_text) - maxlen:
  if is_conjunction(processed_text[i]):
    i += 1
    continue
  sentences.append(processed_text[i: i + maxlen])
  next_chars.append(processed_text[i + maxlen])
  i += step
  print('nb sequences:', len(sentences))
  # nb sequences: 89334
  • Create one-hot-encodings.
 
x = np.zeros((len(sentences), maxlen, len(chars)), dtype=np.bool)
y = np.zeros((len(sentences), len(chars)), dtype=np.bool)
for i, sentence in enumerate(sentences):
  for t, char in enumerate(sentence):
    x[i, t, char_indices[char]] = 1
    y[i, char_indices[next_chars[i]]] = 1
  • Build a model, a single LSTM.
 
model = Sequential()
model.add(LSTM(256, input_shape=(maxlen, len(chars))))
model.add(Dense(128, activation='relu'))
model.add(Dense(len(chars), activation='softmax'))
optimizer = Adam(lr=0.01) #RMSprop(lr=0.01)
model.compile(loss='categorical_crossentropy', optimizer=optimizer)
  • The following figure how the model architecture looks like:
Image for post
  • Print the model summary.
 
model.summary()

Model: "sequential" 
_________________________________________________________________ 
Layer (type)                 Output Shape              Param #    
================================================================= 
lstm (LSTM)                  (None, 256)               354304     
_________________________________________________________________ 
dense (Dense)                (None, 128)               32896      
_________________________________________________________________ 
dense_1 (Dense)              (None, 89)                11481      
================================================================= 
Total params: 398,681 Trainable params: 398,681 Non-trainable params: 0 
_________________________________________________________________
  • Use the following helper function to sample an index from a probability array.
 
def sample(preds, temperature=1.0):
  preds = np.asarray(preds).astype('float64')
  preds = np.log(preds) / temperature
  exp_preds = np.exp(preds)
  preds = exp_preds / np.sum(exp_preds)
  probas = np.random.multinomial(1, preds, 1)
  return np.argmax(probas)
  • Fit the model and register a callback to print the text generated by the model at the end of each epoch.
 
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(x, y, batch_size=128, epochs=60, callbacks=[print_callback])
  • The following animation shows how the model generates song-like texts with given seed texts, for different values of the temperature parameter.
Image for post

Text Generation with Word LSTM

  • Pre-process the input text, split by punctuation characters and create word indices to be used as the input in the model.
processed_text = raw_text.lower()
from string import punctuation
r = re.compile(r'[\s{}]+'.format(re.escape(punctuation)))
words = r.split(processed_text)
print(len(words))
words[:16]
39481
# ['পূজা',
# 'অগ্নিবীণা',
# 'বাজাও',
# 'তুমি',
# 'অগ্নিবীণা',
# 'বাজাও',
# 'তুমি',
# 'কেমন',
# 'ক’রে',
# 'আকাশ',
# 'কাঁপে',
# 'তারার',
# 'আলোর',
# 'গানের',
# 'ঘোরে',
# '।।']

unique_words = np.unique(words)
unique_word_index = dict((c, i) for i, c in enumerate(unique_words))
index_unique_word = dict((i, c) for i, c in enumerate(unique_words))
  • Create a word-window of length 5 to predict the next word.
WORD_LENGTH = 5
prev_words = []
next_words = []
for i in range(len(words) - WORD_LENGTH):
    prev_words.append(words[i:i + WORD_LENGTH])
    next_words.append(words[i + WORD_LENGTH])
print(prev_words[1])
# ['অগ্নিবীণা', 'বাজাও', 'তুমি', 'অগ্নিবীণা', 'বাজাও']
print(next_words[1])
# তুমি
print(len(unique_words))
# 7847
  • Create OHE for input and output words as done for character-RNN. Fit the model on the pre-rpocessed data.
print_callback = LambdaCallback(on_epoch_end=on_epoch_end)
model.fit(X, Y,
          batch_size=128,
          epochs=60,
          callbacks=[print_callback])
  • The following animation shows the song -like text generated by the word-LSTM at the end of an epoc.

Bangla Sentiment Analysis using LSTM with Daily Astrological Prediction Dataset

  • Let’s first create sentiment analysis dataset by crawling the daily astrological predictions (রাশিফল) page of the online edition of আনন্দবাজার পত্রিকা (e.g., for the year 2013), a leading Bangla newspaper and then manually labeling the sentiment of each of the predictions corresponding to each moon-sign.
  • Read the csv dataset, the first few lines look like the following.
 
df = pd.read_csv('horo_2013_labeled.csv')
pd.set_option('display.max_colwidth', 135) 
df.head(20)
  • Transform each text in texts in a sequence of integers.
 
tokenizer = Tokenizer(num_words=2000, split=' ')
tokenizer.fit_on_texts(df['আপনার আজকের দিনটি'].values)
X = tokenizer.texts_to_sequences(df['আপনার আজকের দিনটি'].values)
X = pad_sequences(X)
X
#array([[   0,    0,    0, ...,   26,  375,    3],        
#       [   0,    0,    0, ...,   54,    8,    1],        
#       [   0,    0,    0, ...,  108,   42,   43],        
#       ...,        
#       [   0,    0,    0, ..., 1336,  302,   82],        
#       [   0,    0,    0, ..., 1337,  489,  218],        
#       [   0,    0,    0, ...,    2,  316,   87]])
  • Here we shall use a many-to-one RNN for sentiment analysis as shown below.
  • Build an LSTM model that takes a sentence as input and outputs the sentiment label.
model = Sequential()
model.add(Embedding(2000, 128,input_length = X.shape[1]))
model.add(SpatialDropout1D(0.3))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(2,activation='softmax'))
model.compile(loss = 'categorical_crossentropy', optimizer='adam',metrics = ['accuracy'])
print(model.summary())

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_10 (Embedding)     (None, 12, 128)           256000    
_________________________________________________________________
spatial_dropout1d_10 (Spatia (None, 12, 128)           0         
_________________________________________________________________
lstm_10 (LSTM)               (None, 128)               131584    
_________________________________________________________________
dense_10 (Dense)             (None, 2)                 258       
=================================================================
Total params: 387,842
Trainable params: 387,842
Non-trainable params: 0
_________________________________________________________________
None
  • Divide the dataset into train and validation (test) dataset and train the LSTM model on the dataset.
 
Y = pd.get_dummies(df['sentiment']).values
X_train, X_test, Y_train, Y_test, _, indices = train_test_split(X,Y, np.arange(len(X)), test_size = 0.33, random_state = 5)
model.fit(X_train, Y_train, epochs = 5, batch_size=32, verbose = 2)

#Epoch 1/5  - 3s - loss: 0.6748 - acc: 0.5522 
#Epoch 2/5  - 1s - loss: 0.5358 - acc: 0.7925 
#Epoch 3/5  - 1s - loss: 0.2368 - acc: 0.9418 
#Epoch 4/5  - 1s - loss: 0.1011 - acc: 0.9761 
#Epoch 5/5  - 1s - loss: 0.0578 - acc: 0.9836 
  • Predict the sentiment labels of the (held out) test dataset.
 
result = model.predict(X[indices],batch_size=1,verbose = 2)
df1 = df.iloc[indices]
df1['neg_prob'] = result[:,0]
df1['pos_prob'] = result[:,1]
df1['pred'] = np.array(['negative', 'positive'])[np.argmax(result, axis=1)]
df1.head()
  • Finally, compute the accuracy of the model for the positive and negative ground-truth sentiment corresponding to daily astrological predictions.
 
df2 = df1[df1.sentiment == 'positive']
print('positive accuracy:' + str(np.mean(df2.sentiment == df2.pred)))
#positive accuracy:0.9177215189873418
df2 = df1[df1.sentiment == 'negative']
print('negative accuracy:' + str(np.mean(df2.sentiment == df2.pred)))
#negative accuracy:0.9352941176470588

Building a very simple Bangla Chatbot with RASA NLU

  • The following figure shows how to design a very simple Bangla chatbot to order food from restaurants using RASA NLU.
  • We need to design the intents, entities and slots to extract the entities properly and then design stories to define how the chatbot will respond to user inputs (core / dialog).
  • The following figure shows how the nlu, domain and stories files are written for the simple chatbot.
  • A sequence-to-sequence deep learning model is trained under the hood for intent classification. The next code block shows how the model can be trained.
import rasa
model_path = rasa.train('domain.yml', 'config.yml', ['data/'], 'models/')
  • The following gif demonstrates how the chatbot responds to user inputs.

References