# Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

• First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
• Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
• Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
• The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

## 1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:

or as the following if the explicit bias term is discarded:

where the 0-1 loss is approximated by the hinge-loss.

• Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
• The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
• Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
• The learning rate η is gradually decreased with iteration.

The following figure shows a simplified version of the algorithm:

The following python code implements the algorithm:


class SVMPegasos(LinearClassifier):
"""Implementation of SVM with SGD with PEGASOS Algorithm"""

def __init__(self, n_iter=10, lambda1=1):
self.n_iter = n_iter
self.lambda1 = lambda1

def fit(self, X, Y):
Y = list(Y)
self.find_classes(Y)
# convert all output values to +1 or -1
Yn = [sign(y, self.positive_class) for y in Y]
X = X.toarray()
m, n_features = X.shape[0], X.shape[1]
self.w = numpy.zeros( n_features )
for i in range(self.n_iter):
eta = 1. / (self.lambda1*(i+1))
j = numpy.random.choice(m, 1)[0]
x, y = X[j], Yn[j]
score = self.w.dot(x)
if y*score < 1:
self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
else:
self.w = (1 - eta*self.lambda1)*self.w



### Some Notes

• The optional projection step has been left out (the line in square brackets in the paper).
•  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
•  The number η is the step length in gradient descent.
• The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
• Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
• Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

### Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
• Classification accuracy obtained on the test split was 95%.




## 2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.

### Dataset 1

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
• Classification accuracy obtained on the test split was 95.8%.




### Dataset 2

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
• Classification accuracy obtained on the test split was 100%.




## 3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.





### Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
• Classification accuracy obtained on the test split was 97.5%.




## 4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.

The next figures show the distribution of the word lengths for the texts for different sentiments.

## Steps

• Create sklearn Pipeline to
• Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
• Use sklearn SelectKBest to choose top 5000 features
• Train / test the model
• Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
• Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
• Also Record the time taken to train the classifier.
• Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.

The following python code shows how the text-classification pipeline is implemented with SVMPegasos.


def classify_with_svm():
# split into training and test parts
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
random_state=0)
classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
tokenizer = lambda x: x)),
('fs', SelectKBest(k=5000)),
('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
t0 = time.time()
classifier.fit(Xtrain, Ytrain)
t1 = time.time()
print('Training time:', t1-t0, 'seconds.')
Yguess = classifier.predict(Xtest)
print('accuracy:', accuracy_score(Ytest, Yguess))



### Some Notes

• The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
• This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
• The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
• For Perceptron the entire training data was passed 10 times to the model while training.
• For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.

## Results

As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.

# Learning Distributed Word  Representations with Neural Network: an implementation in Octave

In this article, the problem of learning word representations with neural network from scratch is going to be described. This problem appeared as an assignment in the Coursera course Neural Networks for Machine Learning, taught by  Prof.  Geoffrey Hinton from the University of Toronto in 2012.  This problem also appeared as an assignment in this course from the same university.  The problem description is taken from the assignment pdf.

## Problem Statement

In this article we will design a neural net language model. The model will learn to
predict the next word given the previous three words. The network looks like the following:

• The dataset provided consists of 4-grams (A 4-gram is a sequence of 4 adjacent words in a sentence). These 4-grams were extracted from a large collection of text.
• The 4-grams are chosen so that all the words involved come from a small
vocabulary of 250 words. Note that for the purposes of this assignment special characters such as commas, full-stops, parentheses etc. are also considered words.
• Few of the 250 words in the vocabulary are shown as the output from the matlab / octave code below.

data.vocab
ans =
{
[1,1] = all
[1,2] = set
[1,3] = just
[1,4] = show
[1,5] = being
[1,6] = money
[1,7] = over
[1,8] = both
[1,9] = years
[1,10] = four
[1,11] = through
[1,12] = during
[1,13] = go
[1,14] = still
[1,15] = children
[1,16] = before
[1,17] = police
[1,18] = office
[1,19] = million
[1,20] = also
.
.
[1,246] = so
[1,247] = time
[1,248] = five
[1,249] = the
[1,250] = left
}

• The training set consists of 372,550 4-grams. The validation and test sets have 46,568 4-grams each.
• Let’s first look at the raw sentences file, first few lines of the file is shown below. It contains the raw sentences from which these 4-grams were extracted. It can be seen that the kind of sentences we are dealing with here are fairly simple ones.

The raw sentences file: first few lines

No , he says now .
And what did he do ?
The money ‘s there .
That was less than a year ago .
But he made only the first .
There ‘s still time for them to do it .
But he should nt have .
They have to come down to the people .
I do nt know where that is .
No , I would nt .
Who Will It Be ?
And no , I was not the one .
You could do a Where are they now ?
There ‘s no place like it that I know of .
Be here now , and so on .
It ‘s not you or him , it ‘s both of you .
So it ‘s not going to get in my way .
When it ‘s time to go , it ‘s time to go .
No one ‘s going to do any of it for us .
Well , I want more .
Will they make it ?
Who to take into school or not take into school ?
But it ‘s about to get one just the same .
We all have it .

• The training data extracted from this raw text is a matrix of 372550 X 4. This means there are 372550 training cases and 4 words (corresponding to each 4-gram) per training case.
• Each entry is an integer that is the index of a word in the vocabulary. So each row represents a sequence of 4 words. The following octave / matlab code shows how the training dataset looks like.


[train_x, train_t, valid_x, valid_t, test_x, test_t, vocab] = load_data(100);

% 3-gram features for a training data-tuple
train_x(:,13,14)
%ans =
%46
%58
%32
data.vocab{train_x(:,13,14)}
%ans = now
%ans = where
%ans = do

% target for the same data tuple from training dataset
train_t(:,13,14)
%ans = 91
data.vocab{train_t(:,13,14)}
%ans = we


• The validation and test data are also similar. They contain 46,568 4-grams each.
• Before starting the training, all three need to be separated into inputs and targets and the training set needs to be split into mini-batches.
• The data needs to get loaded and then separated into inputs and target. After that,  mini-batches of size 100 for the training set are created.
• First we need to train the model for one epoch (one pass through the training set using forward propagation). Once implemented the cross-entropy loss will start decreasing.
• At this point, we can try changing the hyper-parameters (number of epochs, number of hidden units, learning rates, momentum, etc) to see what effect that has on the training and validation cross entropy.
• The training method will output a ‘model’ (weight matrices, biases for each layer in the network).

## Description of the Network

• As shown above, the network consists of an input layer, embedding layer, hidden layer and output layer.
• The input layer consists of three word indices. The same ‘word_embedding_weights’ are used to map each index to a distributed feature representation. These mapped features constitute the embedding layer. More details can be found here.
• This layer is connected to the hidden layer, which in turn is connected to the output layer.
• The output layer is a softmax over the 250 words.
• The training consists of two steps:  (1) forward propagation: computes (predicts) the output probabilities of the words in the vocabulary as the next word given a 3-gram as input. (2) back-propagation: propagates the error in prediction from the output layer to the input layer through the hidden layers.

## Forward Propagation

• The forward propagation is pretty straight-forward and can be implemented as shown in the following code:

function [embedding_layer_state, hidden_layer_state, output_layer_state] = ...
fprop(input_batch, word_embedding_weights, embed_to_hid_weights,...
hid_to_output_weights, hid_bias, output_bias)
% This method forward propagates through a neural network.
% Inputs:
% input_batch: The input data as a matrix of size numwords X batchsize where,
% numwords is the number of words, batchsize is the number of data points.
% So, if input_batch(i, j) = k then the ith word in data point j is word
% index k of the vocabulary.
%
% word_embedding_weights: Word embedding as a matrix of size
% vocab_size X numhid1, where vocab_size is the size of the vocabulary
% numhid1 is the dimensionality of the embedding space.
%
% embed_to_hid_weights: Weights between the word embedding layer and hidden
% layer as a matrix of soze numhid1*numwords X numhid2, numhid2 is the
% number of hidden units.
%
% hid_to_output_weights: Weights between the hidden layer and output softmax
% unit as a matrix of size numhid2 X vocab_size
%
% hid_bias: Bias of the hidden layer as a matrix of size numhid2 X 1.
%
% output_bias: Bias of the output layer as a matrix of size vocab_size X 1.
%
% Outputs:
% embedding_layer_state: State of units in the embedding layer as a matrix of
% size numhid1*numwords X batchsize
%
% hidden_layer_state: State of units in the hidden layer as a matrix of size
% numhid2 X batchsize
%
% output_layer_state: State of units in the output layer as a matrix of size
% vocab_size X batchsize
%

[numwords, batchsize] = size(input_batch);
[vocab_size, numhid1] = size(word_embedding_weights);
numhid2 = size(embed_to_hid_weights, 2);

%% COMPUTE STATE OF WORD EMBEDDING LAYER.
% Look up the inputs word indices in the word_embedding_weights matrix.
embedding_layer_state = reshape(...
word_embedding_weights(reshape(input_batch, 1, []),:)',...
numhid1 * numwords, []);

%% COMPUTE STATE OF HIDDEN LAYER.
% Compute inputs to hidden units.
inputs_to_hidden_units = embed_to_hid_weights' * embedding_layer_state + ...
repmat(hid_bias, 1, batchsize);

% Apply logistic activation function.
hidden_layer_state = 1 ./ (1 + exp(-inputs_to_hidden_units)); %zeros(numhid2, batchsize);

%% COMPUTE STATE OF OUTPUT LAYER.
% Compute inputs to softmax.
inputs_to_softmax = hid_to_output_weights' * hidden_layer_state + repmat(output_bias, 1, batchsize); %zeros(vocab_size, batchsize);

% Subtract maximum.
% Remember that adding or subtracting the same constant from each input to a
% softmax unit does not affect the outputs. Here we are subtracting maximum to
% make all inputs &amp;amp;amp;amp;amp;amp;amp;amp;amp;amp;lt;= 0. This prevents overflows when computing their
% exponents.
inputs_to_softmax = inputs_to_softmax...
- repmat(max(inputs_to_softmax), vocab_size, 1);

% Compute exp.
output_layer_state = exp(inputs_to_softmax);

% Normalize to get probability distribution.
output_layer_state = output_layer_state ./ repmat(...
sum(output_layer_state, 1), vocab_size, 1);



## Back-Propagation

•  The back-propagation is much more involved. The math for the back-propagation is shown below for a simple 2-layer network, taken from this lecture note.

• As the model trains it prints out some numbers that tell how well the training is going.
• The model shows the average per-case cross entropy (CE) obtained on the training set. The average CE is computed every 100 mini-batches. The average CE over the entire training set is reported at the end of every epoch.
• After every 1000 mini-batches of training, the model is run on the validation set. Recall, that the validation set consists of data that is not used for training. It is used to see how well the model does on unseen data. The cross entropy on validation set is reported.
• The validation error is expected to decrease with increasing epochs till the model starts getting over-fitted with the training data. Hence, the training is stopped immediately when the validation error starts increasing to prevent over-fitting.
• At the end of training, the model is run both on the validation set and on the test set and the cross entropy on both is reported.

# Some Applications

## 1. Predict next word

• Once the model has been trained, it can be used to produce some predictions for the next word given a set of 3 previous words.
• The next example shows when the model is given a 3-gram ‘life’, ‘in’, ‘new’ as input and asked to predict the next word, it predicts the word ‘york’ to be most likely word with the highest (~0.94) probability and the words such as ‘year’, ‘life’ and ‘world’ with low probabilities.
• It also shows how the forward propagation is used to compute the prediction: the distribution for the next word given the 3-gram. First the words are projected into the embedding space, flattened and then the weight-matrices are multiplied sequentially followed by application of the softmax function to compute the likelihood of each word being a next word following the 3-gram.

## 2. Generate stylized pseudo-random text

Here are the steps to generate a piece of pseudo-random  text:

1. Given 3 words to start from, initialize the text with those 3 words.
2. Next, the model is asked to predict k most probable words as a candidate word following the last 3 words.
3. Choose one of the most probable words predicted randomly and insert it at the end of the text.
4. Repeat steps 2-3 to generate more words otherwise stop.

Here is the code that by default generates top 3 predictions for each 3-gram sliding window and chooses one of predicted words tandomly:


function gen_rand_text(words, model, k=3)

probs = [];
i = 4;
while (i < 20 || word != '.')
[word, prob] = predict_next_word(words{i-3}, words{i-2}, words{i-1}, model, k);                   words = {words{:}, word};
probs = [probs; prob];
i = i + 1;
end
fprintf(1, "%s ", words{:}) ;
fprintf(1, '\n');
fprintf(1, "%.2f ", round(probs.*100)./100) ;
fprintf(1, '\n');

end



Starting with the words  'i was going‘, here are some texts that were generated using the model:

Starting with the words  ‘life in new‘, here is a piece of text that was generated using the model:

## 3. Find nearest words

•  The word embedding weight matrix can be used to represent a word in the embedding space and then the distances from every other word in the vocabulary are computed in this word representation space. Then the closest words are returned.
• As can be seen from the following animation examples, the semantically closer words are chosen mostly as the nearest words given a word. Also, higher the number of epochs, better the ordering of the words in terms of semantic similarity.
• For example, the closest semantically similar word (i.e. with least distance) for the word ‘between’ is the word ‘among‘, whereas the nearest words for ‘day’ are ‘year’ and ‘week’. Also, the word ‘and’ is nearer to the word ‘but’ than the word ‘or’.

## 4. Visualization in 2-dimension with t-SNE

•  In all the above examples, the dimension of the word embedding space was 50. Using t-SNE plot (t-distributed stochastic nearest neighbor embedding by Laurens van der Maaten) the words can be projected into a 2 dimensional space and visualized, by keeping the (semantically) nearer words in the distributed representation space nearer in the projected space.
• As can be seen from the following figures, the semantically close words (highlighted with ellipses) are placed near to each other in the visualization, since in the distributed representation space they were close to each other.
• Also, the next animation visualizes how the neighborhood of each word changes with training epochs (the model is trained up to 10 epochs).

## 5. Solving Word-Analogy Problem

•  with the distributed representation: In this type of problems 2 words (w1, w2) from the vocabulary are given where the first is relate to the second one with some semantic relation.  Now, a third word (w3, from the vocabulary) is given and a fourth word that has similar semantic relation with the third word is to be found from the vocabulary.
• The following figure shows the word analogy problem and a possible solution using an exhaustive search in the embedding space for a word that has the distance (with the third word) that is closest to the distance in between the first and second word in the representation space.

• The next code shows results of a few word-analogy example problems and the solutions found using the distributed representation space. As can be seen, despite the fact that the dataset was quite small and there were only 250 words in the vocabulary, the algorithm worked quite well to find the answers for the examples shown.

analogy('year', 'years', 'day', model); % singular-plural relation
%year:years::day:days
%dist_E('year','years')=1.119368, dist_E('day', 'days')= 1.169186

analogy('on', 'off', 'new', model) % antonyms relation
%on:off::new:old
%dist_E('on','off')=2.013958, dist_E('new','old')=2.265665

analogy('use', 'used', 'do', model) % present-past relation
%use:used::do:did
%dist_E('use','used')=2.556175, dist_E('do','did')=2.456098

analogy('he', 'his', 'they', model) % pronoun-relations
%he:his::they:their
%dist_E('he','his')=3.824808, dist_E('they','their')=3.825453

analogy('today', 'yesterday', 'now', model)
%today:yesterday::now:then
%dist_E('today','yesterday')=1.045192, dist_E('now','then')=1.220935


## Model Selection

• Now the model is trained 4 times by changing the values of the hyper-parameters d (dimension of the representation space) and h (the number of nodes in the hidden layer), by trying all possible combinations d=8, d=32 and h=64, h=256.
• The following figures show the cross-entropy errors on the training and validation sets for the models.As can be seen from the following figures,  the models with hidden layer size 64 are trained till 3 epochs, whereas the models with hidden layer size 256 are trained for 4 epochs (since higher numbers of parameters to train).
• The least validation error (also least training error) is obtained for the model with d=32 and h=256, so this is the best model.

# Some Applications of Markov Chain in Python

In this article a few simple applications of Markov chain are going to be discussed as a solution to a few text processing problems. These problems appeared as assignments in a few courses, the descriptions are taken straightaway from the courses themselves.

# 1. Markov Model of Natural Language

This problem appeared as an assignment in the Princeton course cos126 . This assignment was developed by Prof. Bob Sedgewick and Kevin Wayne, based on the classic idea of Claude Shannon.

## Problem Statement

Use a Markov chain to create a statistical model of a piece of English text. Simulate the Markov chain to generate stylized pseudo-random text.

Perspective. In the 1948 landmark paper A Mathematical Theory of Communication, Claude Shannon founded the field of information theory and revolutionized the telecommunications industry, laying the groundwork for today’s Information Age. In this paper, Shannon proposed using a Markov chain to create a statistical model of the sequences of letters in a piece of English text. Markov chains are now widely used in speech recognition, handwriting recognition, information retrieval, data compression, and spam filtering. They also have many scientific computing applications including the genemark algorithm for gene prediction, the Metropolis algorithm for measuring thermodynamical properties, and Google’s PageRank algorithm for Web search. For this assignment, we consider a whimsical variant: generating stylized pseudo-random text.

Markov model of natural language. Shannon approximated the statistical structure of a piece of text using a simple mathematical model known as a Markov model. A Markov model of order 0 predicts that each letter in the alphabet occurs with a fixed probability. We can fit a Markov model of order 0 to a specific piece of text by counting the number of occurrences of each letter in that text, and using these frequencies as probabilities. For example, if the input text is "gagggagaggcgagaaa", the Markov model of order 0 predicts that each letter is 'a' with probability 7/17, 'c' with probability 1/17, and 'g' with probability 9/17 because these are the fraction of times each letter occurs. The following sequence of letters is a typical example generated from this model:

g a g g c g a g a a g a g a a g a a a g a g a g a g a a a g a g a a g ...


A Markov model of order 0 assumes that each letter is chosen independently. This independence does not coincide with statistical properties of English text because there a high correlation among successive letters in a word or sentence. For example, 'w' is more likely to be followed with 'e' than with 'u', while 'q' is more likely to be followed with 'u' than with 'e'.

We obtain a more refined model by allowing the probability of choosing each successive letter to depend on the preceding letter or letters. A Markov model of order k predicts that each letter occurs with a fixed probability, but that probability can depend on the previous k consecutive letters. Let a k-gram mean any k consecutive letters. Then for example, if the text has 100 occurrences of "th", with 60 occurrences of "the", 25 occurrences of "thi", 10 occurrences of "tha", and 5 occurrences of "tho", the Markov model of order 2 predicts that the next letter following the 2-gram "th" is 'e' with probability 3/5, 'i' with probability 1/4, 'a' with probability 1/10, and 'o' with probability 1/20.

A brute-force solution. Claude Shannon proposed a brute-force scheme to generate text according to a Markov model of order 1:

“ To construct [a Markov model of order 1], for example, one opens a book at random and selects a letter at random on the page. This letter is recorded. The book is then opened to another page and one reads until this letter is encountered. The succeeding letter is then recorded. Turning to another page this second letter is searched for and the succeeding letter recorded, etc. It would be interesting if further approximations could be constructed, but the labor involved becomes enormous at the next stage. ”

Our task is to write a python program to automate this laborious task in a more efficient way — Shannon’s brute-force approach is prohibitively slow when the size of the input text is large.

Markov model data type.
Create an immutable data type MarkovModel to represent a Markov model of order k from a given text string. The data type must implement the following API:

• Constructor. To implement the data type, create a symbol table, whose keys will be Stringk-grams. You may assume that the input text is a sequence of characters over the ASCII alphabet so that all char values are between 0 and 127. The value type of your symbol table needs to be a data structure that can represent the frequency of each possible next character. The frequencies should be tallied as if the text were circular (i.e., as if it repeated the first k characters at the end).
• Order. Return the order k of the Markov Model.
• Frequency. There are two frequency methods.
• freq(kgram) returns the number of times the k-gram was found in the original text.
• freq(kgram, c) returns the number of times the k-gram was followed by the character c in the original text.
• Randomly generate a character. Return a character. It must be a character that followed the k-gram in the original text. The character should be chosen randomly, but the results of calling rand(kgram) several times should mirror the frequencies of characters that followed the k-gram in the original text.
• Generate pseudo-random text. Return a String of length T that is a randomly generated stream of characters whose first k characters are the argument kgram. Starting with the argument kgram, repeatedly call rand() to generate the next character. Successive k-grams should be formed by using the most recent k characters in the newly generated text. Use a StringBuilder object to build the stream of characters (otherwise, as we saw when discussing performance, your code will take order of N2 time to generate N characters, which is too slow).

To avoid dead ends, treat the input text as a circular string: the last character is considered to precede the first character.

For example, if k = 2 and the text is the 17-character string "gagggagaggcgagaaa", then the salient features of the Markov model are captured in the table below:

               frequency of   probability that
next char       next char is
kgram   freq    a   c   g        a    c    g
----------------------------------------------
aa      2      1   0   1       1/2   0   1/2
ag      5      3   0   2       3/5   0   2/5
cg      1      1   0   0        1    0    0
ga      5      1   0   4       1/5   0   4/5
gc      1      0   0   1        0    0    1
gg      3      1   1   1       1/3  1/3  1/3
----------------------------------------------
17      7   1   9

Taken from an example from the same assignment.

Note that the frequency of "ag" is 5 (and not 4) because we are treating the string as circular.

Markov chain is a stochastic process where the state change depends on only the current state. For text generation, the current state is a k-gram. The next character is selected at random, using the probabilities from the Markov model. For example, if the current state is "ga" in the Markov model of order 2 discussed above, then the next character is 'a' with probability 1/5 and 'g' with probability 4/5. The next state in the Markov chain is obtained by appending the new character to the end of the k-gram and discarding the first character. A trajectory through the Markov chain is a sequence of such states. Below is a possible trajectory consisting of 9 transitions.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
probability for a:       1/5      3/5      1/3       0        1       1/5      3/51/5      1/2
probability for c:        0        0       1/3       0        0        0        0        0        0
probability for g:       4/5      2/5      1/3       1        0       4/5      2/5      4/5      1/2


Taken from an example from the same assignment.

Treating the input text as a circular string ensures that the Markov chain never gets stuck in a state with no next characters.

To generate random text from a Markov model of order k, set the initial state to k characters from the input text. Then, simulate a trajectory through the Markov chain by performing T − k transitions, appending the random character selected at each step. For example, if k = 2 and T = 11, the following is a possible trajectory leading to the output gaggcgagaag.

trajectory:          ga  -->  ag  -->  gg  -->  gc  -->  cg  -->  ga  -->  ag  -->  ga  -->  aa  -->  ag
output:              ga        g        g        c        g        a        g        a        a        g

Text generation client. Implement a client program TextGenerator that takes two command-line integers k and T, reads the input text from standard input and builds a Markov model of order k from the input text; then, starting with the k-gram consisting of the first k letters of the input text, prints out T characters generated by simulating a trajectory through the corresponding Markov chain. We may assume that the text has length at least k, and also that T ≥ k.

## A Python Implementation

import numpy as np
from collections import defaultdict

class MarkovModel:

def __init__(self, text, k):
'''
create a Markov model of order k from given text
Assume that text has length at least k.
'''
self.k = k
self.tran = defaultdict(float)
self.alph = list(set(list(text)))
self.kgrams = defaultdict(int)
n = len(text)
text += text[:k]
for i in range(n):
self.tran[text[i:i+k],text[i+k]] += 1.
self.kgrams[text[i:i+k]] += 1

def order(self):                   # order k of Markov model
return self.k

def freq(self, kgram):             # number of occurrences of kgram in text
assert len(kgram) == self.k    # (check if kgram is of length k)
return self.kgrams[kgram]

def freq2(self, kgram, c):   	# number of times that character c follows kgram
assert len(kgram) == self.k     # (check if kgram is of length k)
return self.tran[kgram,c]

def rand(self, kgram):             # random character following given kgram
assert len(kgram) == self.k    # (check if kgram is of length k.
Z = sum([self.tran[kgram, alph] for alph in self.alph])
return np.random.choice(self.alph, 1, p=np.array([self.tran[kgram, alph] for alph in self.alph])/Z)

def gen(self, kgram, T):           # generate a String of length T characters
assert len(kgram) == self.k    # by simulating a trajectory through the corresponding
str = ''                       # Markov chain. The first k characters of the newly
for _ in range(T):             # generated String should be the argument kgram.
#print kgram, c           # check if kgram is of length k.
c =  self.rand(kgram)[0]  # Assume that T is at least k.
kgram = kgram[1:] + c
str += c
return str

## Some Results

m = MarkovModel('gagggagaggcgagaaa', 2)

generates the following MarkovChain where each state represents a 2-gram.

#### Input: news item (taken from the assignment)

Microsoft said Tuesday the company would comply with a preliminary ruling by Federal District Court Judge Ronald H. Whyte that Microsoft is no longer able to use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developers Kit for Java.

“We remain confident that once all the facts are presented in the larger case, the court will find Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Counsel for Microsoft Corporation. “We are disappointed with this decision, but we will immediately comply with the Court’s order.”

Microsoft has been in the forefront of helping developers use the Java programming language to write cutting-edge applications. The company has committed significant resources so that Java developers have the option of taking advantage of Windows features when writing software using the Java language. Providing the best tools and programming options will continue to be Microsoft’s goal.

“We will continue to listen to our customers and provide them the tools they need to write great software using the Java language,” added Tod Nielsen, General Manager for Microsoft’s Developer Relations Group/Platform Marketing.

Markov Model learnt

#### Generated output: random news item, using input as an order 7 model

Microsoft is no longer able to use the Java language,” added Tod Nielsen, General Counsel for Microsoft’s Developers use the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software Developer Relations Group/Platform Marketing.Microsoft to be Microsoft Corporation. “We are disappointed with the Court’s order.”

Microsoft is no longer able to use the Java language. Providing the best tools and provide them the tools and programming options will continue to listen to our customers and provide them the tools and programming option of taking advantage of Windows features when writing software using the Java Compatibility Logo on its packaging and websites for Internet Explorer and Software using the Java programming option of taking advantage of Windows features when writing software Developer Relations Group/Platform Marketing.Microsoft to be in full compliance with its contract with Sun,” stated Tom Burt, Associate General Manager for Microsoft’s goal.

“We will continue to listen to our customers and programming language. Providing the Java language,” added Tod Nielsen, General Manager for Microsoft is no longer able to use the Java language.

## Noisy Text Correction

Imagine we receive a message where some of the characters have been corrupted by noise. We represent unknown characters by the ~ symbol (we assume we don’t use ~ in our messages). Add a method replaceUnknown that decodes a noisy message by replacing each ~ with the most likely character given our order k Markov model, and conditional on the surrounding text:

def replaceUnknown(corrupted)  # replace unknown characters with most probable characters


Assume unknown characters are at least k characters apart and also appear at least k characters away from the start and end of the message.  This maximum-likelihood approach doesn’t always get it perfect, but it fixes most of the missing characters correctly.

Here are some details on what it means to find the most likely replacement for each ~. For each unknown character, you should consider all possible replacement characters. We want the replacement character that makes sense not only at the unknown position (given the previous characters) but also when the replacement is used in the context of the k subsequent known characters. For example we expect the unknown character in "was ~he wo" to be 't' and not simply the most likely character in the context of "was ". You can compute the probability of each hypothesis by multiplying the probabilities of generating each of k+1 characters in sequence: the missing one, and the k next ones. The following figure illustrates how we want to consider k+1 windows to maximize the log-likelihood:

Using the algorithm described above, here are the results obtained for the following example:

Original : it was the best of times, it was the worst of times.
Noisy :      it w~s th~ bes~ of tim~s, i~ was ~he wo~st of~times.
Corrected (k=4): it was the best of times, it was the worst of times.
Corrected (k=2): it was the best of times, in was the wo st of times.

# 2. Detecting authorship

This problem appeared as an assignment in the Cornell course cs1114 .

## The Problem Statement

In this assignment, we shall be implementing an authorship detector which, when
given a large sample size of text to train on, can then guess the author of an unknown
text.

• The algorithm to be implemented works based on the following idea: An author’s
writing style can be defined quantitatively by looking at the words he uses. Specifically, we want to keep track of his word flow – that is, which words he tends to use after other words.
• To make things significantly simpler, we’re going to assume that the author always
follows a given word with the same distribution of words. Of course, this isn’t true,
since the words you choose when writing obviously depend on context. Nevertheless, this simplifying assumption should hold over an extended amount of text, where context becomes less relevant.
• In order to implement this model of an author’s writing style, we will use a Markov
chain. A Markov chain is a set of states with the Markov property – that is, the
probabilities of each state are independent from the probabilities of every other state. This behavior correctly models our assumption of word independence.
• A Markov chain can be represented as a directed graph. Each node is a state (words,
in our case), and a directed edge going from state Si to Sj represents the probability we will go to Sj when we’re at Si. We will implement this directed graph as a transition matrix. Given a set of words W1, W2, …Wn, we can construct an n by n transition matrix A, where an edge from Wi to Wj of weight p means Aij = p.
• The edges, in this case, represent the probability that word j follows word i from the
given author. This means, of course, that the sum of the weights of all edges leaving
from each word must add up to 1.
• We can construct this graph from a large sample text corpus. Our next step would be finding the author of an unknown, short chunk of text. To do this, we simply compute the probability of this unknown text occurring, using the words in that order, in each of our Markov chains. The author would likely be the one with the highest probability.
• We shall implement the Markov chain model of writing style. We are given some sample texts to train our model on, as well as some challenges for you to
figure out.

## Constructing the transition matrix

Our first step is to construct the transition matrix representing our Markov chain.
First, we must read the text from a sample file. We shall want to create a sparse array using the scipy csr sparse matrix. Along with the transition matrix, we shall be creating a corresponding vector that contains 2 word frequencies (normalized by the total number of words in the document (including repeated words)).

## Calculating likelihood

Once we have our transition matrix, we can calculate the likelihood of an unknown
sample of text. We are given several pieces of literature by various authors,
as well as excerpts from each of the authors as test dataset. Our goal is to identify the authors of each excerpt.

To do so, we shall need to calculate the likelihood of the excerpt occurring in each author’s transition matrix. Recall that each edge in the directed graph that the transition
matrix represents is the probability that the author follows a word with another.
Since we shall be multiplying numerous possibly small probabilities together, our
calculated likelihood will likely be extremely small. Thus, you should compare log(likelihood) instead. Keep in mind the possibility that the author may have used a word he has never used before. Our calculated likelihood should not eliminate an author completely because of this. We shall be imposing a high penalty if a word is missing.

## Finding the author with the maximum likelihood

Now that we can compute likelihoods, the next step is to write a routine that takes a set of transition matrices and dictionaries, and a sequence of text, and returns the index of the transition matrix that results in the highest likelihood. You will write this in a function classify text, which takes transition matrices, dictionaries, histograms, and the name of the file containing the test text, and returns a single integer best index.  The following figure shows how to detect an author k (A_k) of the test text t_1..t_n  using the transition matrix P_k with MLE :

## Python Implementation

from np import log
def log0(x):
return 0 if x <= 0 else log(x)

def compute_text_likelihood(filename, T, dict_rev, histogram, index):

”’
Compute the (log) likelihood L of a given string (in ‘filename’) given
a word transition probability T, dictionary ‘dict’, and histogram
‘histogram’
”’

num_words = len(text)
text = [word for word in text if word in histogram] # keep only the words that are found in the training dataset
ll = log0(histogram[text[0]]) – log0(sum(histogram.values()))
for i in range(1, len(text)):
ll += log0(T[dict_rev[text[i-1]], dict_rev[text[i]]])
return ll + (num_words – num_matches)*penalty

def classify_text(tmatrices, dict_revs, histograms, filename):

”’
Return the index of the most likely author given the transition matrices, dictionaries, and a test file
”’

for i in range(len(tmatrices)):
ll = compute_text_likelihood(filename, tmatrices[i], dict_revs[i], histograms[i], i)

print i, ll

## Training Dataset

The list of authors whose writings are there in the training dataset:

0. Austen
1. Carroll
2. Hamilton
3. Jay
5. Shakespeare
6. Thoreau
7. Twain

## A few lines of excerpts from the training files (the literature by several authors, taken from Project Gutenberg), the word clouds and a few states from the corresponding Markov Models Constructed for a few authors

Author JANE AUSTEN (Texts taken from Emma, Mansfield Park, Pride and Prejudice)

Emma Woodhouse, handsome, clever, and rich, with a comfortable home and
happy disposition, seemed to unite some of the best blessings of
existence; and had lived nearly twenty-one years in the world with very
little to distress or vex her.

She was the youngest of the two daughters of a most affectionate,
indulgent father; and had, in consequence of her sister’s marriage,
been mistress of his house from a very early period. Her mother had
died too long ago for her to have more than an indistinct remembrance
of her caresses; and her place had been supplied by an excellent woman
as governess, who had fallen little short of a mother in affection.

Sixteen years had Miss Taylor been in Mr. Woodhouse’s family, less as a
governess than a friend, very fond of both daughters, but particularly
of Emma. Between _them_ it was more the intimacy of sisters. Even
before Miss Taylor had ceased to hold the nominal office of governess,
the mildness of her temper had hardly allowed her to impose any
restraint; and the shadow of authority being now long passed away, they
had been living together as friend and friend very mutually attached,
and Emma doing just what she liked; highly esteeming Miss Taylor’s
judgment, but directed chiefly by her own.

The real evils, indeed, of Emma’s situation were the power of having
rather too much her own way, and a disposition to think a little too
well of herself; these were the disadvantages which threatened alloy to
her many enjoyments. The danger, however, was at present so
unperceived, that they did not by any means rank as misfortunes with
her.

Author Lewis Carroll (Texts taken from Alice’s Adventures in Wonderland, Sylvie and Bruno)

Alice was beginning to get very tired of sitting by her sister on the
bank, and of having nothing to do: once or twice she had peeped into the
book her sister was reading, but it had no pictures or conversations in
it, ‘and what is the use of a book,’ thought Alice ‘without pictures or
conversation?’

So she was considering in her own mind (as well as she could, for the
hot day made her feel very sleepy and stupid), whether the pleasure
of making a daisy-chain would be worth the trouble of getting up and
picking the daisies, when suddenly a White Rabbit with pink eyes ran
close by her.

There was nothing so VERY remarkable in that; nor did Alice think it so
VERY much out of the way to hear the Rabbit say to itself, ‘Oh dear!
Oh dear! I shall be late!’ (when she thought it over afterwards, it
occurred to her that she ought to have wondered at this, but at the time
it all seemed quite natural); but when the Rabbit actually TOOK A WATCH
OUT OF ITS WAISTCOAT-POCKET, and looked at it, and then hurried on,
Alice started to her feet, for it flashed across her mind that she had
never before seen a rabbit with either a waistcoat-pocket, or a watch
to take out of it, and burning with curiosity, she ran across the field
after it, and fortunately was just in time to see it pop down a large
rabbit-hole under the hedge.

In another moment down went Alice after it, never once considering how
in the world she was to get out again.

The rabbit-hole went straight on like a tunnel for some way, and then
dipped suddenly down, so suddenly that Alice had not a moment to think
about stopping herself before she found herself falling down a very deep
well.

Author William Shakespeare (Texts taken from Henry IV Part 1, Romeo and Juliet, Twelfth Night)

KING.
So shaken as we are, so wan with care,
Find we a time for frighted peace to pant,
And breathe short-winded accents of new broils
To be commenced in strands afar remote.
No more the thirsty entrance of this soil
Shall daub her lips with her own children’s blood;
No more shall trenching war channel her fields,
Nor bruise her flowerets with the armed hoofs
Of hostile paces: those opposed eyes,
Which, like the meteors of a troubled heaven,
All of one nature, of one substance bred,
Did lately meet in the intestine shock
And furious close of civil butchery,
Shall now, in mutual well-beseeming ranks,
March all one way, and be no more opposed
Against acquaintance, kindred, and allies:
The edge of war, like an ill-sheathed knife,
No more shall cut his master. Therefore, friends,
As far as to the sepulchre of Christ–
Whose soldier now, under whose blessed cross
We are impressed and engaged to fight–
Forthwith a power of English shall we levy,
To chase these pagans in those holy fields
Over whose acres walk’d those blessed feet
Which fourteen hundred years ago were nail’d
For our advantage on the bitter cross.
But this our purpose now is twelvemonth old,
And bootless ’tis to tell you we will go:
Therefore we meet not now.–Then let me hear
Of you, my gentle cousin Westmoreland,
What yesternight our Council did decree
In forwarding this dear expedience.

WEST.
My liege, this haste was hot in question,
And many limits of the charge set down
But yesternight; when, all athwart, there came
A post from Wales loaden with heavy news;
Whose worst was, that the noble Mortimer,
Leading the men of Herefordshire to fight
Against th’ irregular and wild Glendower,
Was by the rude hands of that Welshman taken;
A thousand of his people butchered,
Upon whose dead corpse’ there was such misuse,
Such beastly, shameless transformation,
By those Welshwomen done, as may not be
Without much shame re-told or spoken of.

Author MARK TWAIN (Texts taken from A Connecticut Yankee in King Arthur’s Court, The Adventures of Huckleberry Finn, The Prince and the Pauper)

I am an American. I was born and reared in Hartford, in the State
of Connecticut–anyway, just over the river, in the country. So
I am a Yankee of the Yankees–and practical; yes, and nearly
barren of sentiment, I suppose–or poetry, in other words. My
father was a blacksmith, my uncle was a horse doctor, and I was
both, along at first. Then I went over to the great arms factory
and learned my real trade; learned all there was to it; learned
to make everything: guns, revolvers, cannon, boilers, engines, all
sorts of labor-saving machinery. Why, I could make anything
a body wanted–anything in the world, it didn’t make any difference
what; and if there wasn’t any quick new-fangled way to make a thing,
I could invent one–and do it as easy as rolling off a log. I became

Well, a man like that is a man that is full of fight–that goes
without saying. With a couple of thousand rough men under one,
one has plenty of that sort of amusement. I had, anyway. At last
I met my match, and I got my dose. It was during a misunderstanding
conducted with crowbars with a fellow we used to call Hercules.
He laid me out with a crusher alongside the head that made everything
crack, and seemed to spring every joint in my skull and made it
overlap its neighbor. Then the world went out in darkness, and
I didn’t feel anything more, and didn’t know anything at all
–at least for a while.

When I came to again, I was sitting under an oak tree, on the
grass, with a whole beautiful and broad country landscape all
to myself–nearly. Not entirely; for there was a fellow on a horse,
looking down at me–a fellow fresh out of a picture-book. He was
in old-time iron armor from head to heel, with a helmet on his
head the shape of a nail-keg with slits in it; and he had a shield,
and a sword, and a prodigious spear; and his horse had armor on,
too, and a steel horn projecting from his forehead, and gorgeous
red and green silk trappings that hung down all around him like
a bedquilt, nearly to the ground.

“Fair sir, will ye just?” said this fellow.

“Will I which?”

“Will ye try a passage of arms for land or lady or for–”

“What are you giving me?” I said. “Get along back to your circus,
or I’ll report you.”

Now what does this man do but fall back a couple of hundred yards
and then come rushing at me as hard as he could tear, with his
nail-keg bent down nearly to his horse’s neck and his long spear
pointed straight ahead. I saw he meant business, so I was up
the tree when he arrived.

## Classifying unknown texts from the Test Dataset

Each of the Markov models learnt from the training texts for each author are used to compute the log-likelihood of the unknown test text, the author with the maximum log-likelihood is chosen to be the likely author of the text.
Unknown Text1

Against the interest of her own individual comfort, Mrs. Dashwood had
determined that it would be better for Marianne to be any where, at
that time, than at Barton, where every thing within her view would be
bringing back the past in the strongest and most afflicting manner, by
constantly placing Willoughby before her, such as she had always seen
him there. She recommended it to her daughters, therefore, by all
means not to shorten their visit to Mrs. Jennings; the length of which,
though never exactly fixed, had been expected by all to comprise at
least five or six weeks. A variety of occupations, of objects, and of
company, which could not be procured at Barton, would be inevitable
there, and might yet, she hoped, cheat Marianne, at times, into some
interest beyond herself, and even into some amusement, much as the
ideas of both might now be spurned by her.

From all danger of seeing Willoughby again, her mother considered her
to be at least equally safe in town as in the country, since his
acquaintance must now be dropped by all who called themselves her
friends. Design could never bring them in each other’s way: negligence
could never leave them exposed to a surprise; and chance had less in
its favour in the crowd of London than even in the retirement of
Barton, where it might force him before her while paying that visit at
Allenham on his marriage, which Mrs. Dashwood, from foreseeing at first
as a probable event, had brought herself to expect as a certain one.

She had yet another reason for wishing her children to remain where
they were; a letter from her son-in-law had told her that he and his
wife were to be in town before the middle of February, and she judged
it right that they should sometimes see their brother.

Marianne had promised to be guided by her mother’s opinion, and she
submitted to it therefore without opposition, though it proved
perfectly different from what she wished and expected, though she felt
it to be entirely wrong, formed on mistaken grounds, and that by
requiring her longer continuance in London it deprived her of the only
possible alleviation of her wretchedness, the personal sympathy of her
mother, and doomed her to such society and such scenes as must prevent
her ever knowing a moment’s rest.

But it was a matter of great consolation to her, that what brought evil
to herself would bring good to her sister; and Elinor, on the other
hand, suspecting that it would not be in her power to avoid Edward
entirely, comforted herself by thinking, that though their longer stay
would therefore militate against her own happiness, it would be better
for Marianne than an immediate return into Devonshire.

Her carefulness in guarding her sister from ever hearing Willoughby’s
name mentioned, was not thrown away. Marianne, though without knowing
it herself, reaped all its advantage; for neither Mrs. Jennings, nor
Sir John, nor even Mrs. Palmer herself, ever spoke of him before her.
Elinor wished that the same forbearance could have extended towards
herself, but that was impossible, and she was obliged to listen day
after day to the indignation of them all.

log-likelihood values computed for the probable authors

Author  LL
0           -3126.5812874
1           -4127.9155186
2           -7364.15782346
3           -9381.06336055
4           -7493.78440066
5           -4837.98005673
6           -3515.44028659
7           -3455.85716104

As can be seen from above the maximum likelihood value corresponds to the author 0, i.e., Austen.  Hence, the most probable author of the unknown text is Austen.

Unknown Text2

Then he tossed the marble away pettishly, and stood cogitating. The
truth was, that a superstition of his had failed, here, which he and
all his comrades had always looked upon as infallible. If you buried a
marble with certain necessary incantations, and left it alone a
fortnight, and then opened the place with the incantation he had just
used, you would find that all the marbles you had ever lost had
gathered themselves together there, meantime, no matter how widely they
failed. Tom’s whole structure of faith was shaken to its foundations.
He had many a time heard of this thing succeeding but never of its
failing before. It did not occur to him that he had tried it several
times before, himself, but could never find the hiding-places
afterward. He puzzled over the matter some time, and finally decided
that some witch had interfered and broken the charm. He thought he
would satisfy himself on that point; so he searched around till he
found a small sandy spot with a little funnel-shaped depression in it.
He laid himself down and put his mouth close to this depression and
called–

“Doodle-bug, doodle-bug, tell me what I want to know! Doodle-bug,
doodle-bug, tell me what I want to know!”

The sand began to work, and presently a small black bug appeared for a
second and then darted under again in a fright.

“He dasn’t tell! So it WAS a witch that done it. I just knowed it.”

He well knew the futility of trying to contend against witches, so he
gave up discouraged. But it occurred to him that he might as well have
the marble he had just thrown away, and therefore he went and made a
patient search for it. But he could not find it. Now he went back to
his treasure-house and carefully placed himself just as he had been
standing when he tossed the marble away; then he took another marble
from his pocket and tossed it in the same way, saying:

He watched where it stopped, and went there and looked. But it must
have fallen short or gone too far; so he tried twice more. The last
repetition was successful. The two marbles lay within a foot of each
other.

log-likelihood values computed for the probable authors

Author  LL
0             -2779.02810424
1             -2738.09304225
2             -5978.83684489
3             -6551.16571407
4             -5780.39620942
5             -4166.34886511
6             -2309.25043697
7             -2033.00112729

As can be seen from above the maximum likelihood value corresponds to the author 7, i.e., Twain.  Hence, the most probable author of the unknown text is Twain. The following figure shows the relevant states corresponding to the Markov model for Twain trained from the training dataset.

# Measuring Semantic Relatedness using the Distance and the Shortest Common Ancestor and Outcast Detection with Wordnet Digraph in Python

The following problem appeared as an assignment in the Algorithm Course (COS 226) at Princeton University taught by Prof. Sedgewick.  The description of the problem is taken from the assignment itself. However, in the assignment, the implementation is supposed to be in java, in this article a python implementation will be described instead. Instead of using nltk, this implementation is going to be from scratch.

## The Problem

• WordNet is a semantic lexicon for the English language that computational linguists and cognitive scientists use extensively. For example, WordNet was a key component in IBM’s Jeopardy-playing Watson computer system.
• WordNet groups words into sets of synonyms called synsets. For example, { AND circuitAND gate } is a synset that represent a logical gate that fires only when all of its inputs fire.
• WordNet also describes semantic relationships between synsets. One such relationship is the is-a relationship, which connects a hyponym (more specific synset) to a hypernym (more general synset). For example, the synset gatelogic gate } is a hypernym of { AND circuitAND gate } because an AND gate is a kind of logic gate.
• The WordNet digraph. The first task is to build the WordNet digraph: each vertex v is an integer that represents a synset, and each directed edge v→w represents that w is a hypernym of v.
• The WordNet digraph is a rooted DAG: it is acyclic and has one vertex—the root— that is an ancestor of every other vertex.
• However, it is not necessarily a tree because a synset can have more than one hypernym. A small subgraph of the WordNet digraph appears below.

### The WordNet input file formats

The following two data files will be used to create the WordNet digraph. The files are in comma-separated values (CSV) format: each line contains a sequence of fields, separated by commas.

• List of synsets: The file synsets.txt contains all noun synsets in WordNet, one per line. Line i of the file (counting from 0) contains the information for synset i.
• The first field is the synset id, which is always the integer i;
• the second field is the synonym set (or synset); and
• the third field is its dictionary definition (or gloss), which is not relevant to this assignment.

For example, line 36 means that the synset { AND_circuitAND_gate } has an id number of 36 and its gloss is a circuit in a computer that fires only when all of its inputs fire. The individual nouns that constitute a synset are separated by spaces. If a noun contains more than one word, the underscore character connects the words (and not the space character).
• List of hypernyms: The file hypernyms.txt contains the hypernym relationships. Line i of the file (counting from 0) contains the hypernyms of synset i.
• The first field is the synset id, which is always the integer i;
• subsequent fields are the id numbers of the synset’s hypernyms.

For example, line 36 means that synset 36 (AND_circuit AND_Gate) has 42338 (gate logic_gate) as its only hypernym. Line 34 means that synset 34 (AIDS acquired_immune_deficiency_syndrome) has two hypernyms: 47569 (immunodeficiency) and 56099 (infectious_disease).

### The WordNet data type

Implement an immutable data type WordNet with the following API:

• The Wordnet Digraph contains 76066 nodes and 84087 edges, it’s very difficult to visualize the entire graph at once, hence small subgraphs will be displayed as and when required relevant to the context of the examples later.

• The sca() and the distance() between two nodes v and w are implemented using bfs (bread first search) starting from the two nodes separately and combining the distances computed.

Performance requirements

• The data type must use space linear in the input size (size of synsets and hypernyms files).
• The constructor must take time linearithmic (or better) in the input size.
• The method isNoun() must run in time logarithmic (or better) in the number of nouns.
• The methods distance() and sca() must make exactly one call to the length() and ancestor() methods in ShortestCommonAncestor, respectively.

### The Shortest Common Ancestor

• An ancestral path between two vertices v and w in a rooted DAG is a directed path from v to a common ancestor x, together with a directed path from w to the same ancestor x.
• shortest ancestral path is an ancestral path of minimum total length.
• We refer to the common ancestor in a shortest ancestral path as a shortest common ancestor.
• Note that a shortest common ancestor always exists because the root is an ancestor of every vertex. Note also that an ancestral path is a path, but not a directed path.

• The following animation shows how the shortest common ancestor node 1 for the nodes 3 and 10  for the following rooted DAG is found at distance 4 with bfs, along with the ancestral path 3-1-5-9-10.
• The following animation shows how the shortest common ancestor node 5 for the nodes 8 and 11  for the following rooted DAG is found at distance 3 with bfs, along with the ancestral path 8-5-9-11.
• The following animation shows how the shortest common ancestor node 0 for the nodes 2 and for the following rooted DAG is found at distance 4 with bfs, along with the ancestral path 6-3-1-0-2.

• We generalize the notion of shortest common ancestor to subsets of vertices. A shortest ancestral path of two subsets of vertices A and B is a shortest ancestral path over all pairs of vertices v and w, with v in A and w in B.
• The figure (digraph25.txt) below shows an example in which, for two subsets, red and blue, we have computed several (but not all) ancestral paths, including the shortest one.

• The following animation shows how the shortest common ancestor node 3 for the set of nodes {13, 23, 24} and {6, 16, 17}  for the following rooted DAG is found at associated length (distance) with bfs, along with the ancestral path 13-7-3-9-16.Shortest common ancestor data type

Implement an immutable data type ShortestCommonAncestor with the following API:

Basic performance requirements

The data type must use space proportional to E + V, where E and V are the number of edges and vertices in the digraph, respectively. All methods and the constructor must take time proportional to EV (or better).

### Measuring the semantic relatedness of two nouns

Semantic relatedness refers to the degree to which two concepts are related. Measuring semantic relatedness is a challenging problem. For example, let’s consider George W. Bushand John F. Kennedy (two U.S. presidents) to be more closely related than George W. Bush and chimpanzee (two primates). It might not be clear whether George W. Bush and Eric Arthur Blair are more related than two arbitrary people. However, both George W. Bush and Eric Arthur Blair (a.k.a. George Orwell) are famous communicators and, therefore, closely related.

Let’s define the semantic relatedness of two WordNet nouns x and y as follows:

• A = set of synsets in which x appears
• B = set of synsets in which y appears
• distance(x, y) = length of shortest ancestral path of subsets A and B
• sca(x, y) = a shortest common ancestor of subsets A and B

This is the notion of distance that we need to use to implement the distance() and sca() methods in the WordNet data type.

### Finding semantic relatedness for some example nouns with the shortest common ancestor and the distance method implemented

apple and potato (distance 5 in the Wordnet Digraph, as shown below)

As can be seen, the noun entity is the root of the Wordnet DAG.

beer and diaper (distance 13 in the Wordnet Digraph)

beer and milk (distance 4 in the Wordnet Digraph, with SCA as drink synset), as expected since they are more semantically closer to each other.

bread and butter (distance 3 in the Wordnet Digraph, as shown below)

cancer and AIDS (distance 6 in the Wordnet Digraph, with SCA as disease as shown below, bfs computed distances and the target distance between the nouns are also shown)

car and vehicle (distance 2 in the Wordnet Digraph, with SCA as vehicle as shown below)

cat and dog (distance 4 in the Wordnet Digraph, with SCA as carnivore as shown below)

cat and milk (distance 7 in the Wordnet Digraph, with SCA as substance as shown below, here cat is identified as Arabian tea)

Einstein and Newton (distance 2 in the Wordnet Digraph, with SCA as physicist as shown below)

Leibnitz and Newton (distance 2 in the Wordnet Digraph, with SCA as mathematician)

Gandhi and Mandela (distance 2 in the Wordnet Digraph, with SCA as national_leader synset)

laptop and internet (distance 11 in the Wordnet Digraph, with SCA as instrumentation synset)

school and office (distance 5 in the Wordnet Digraph, with SCA as construction synset as shown below)

bed and table (distance 3 in the Wordnet Digraph, with SCA as furniture synset as shown below)

Tagore and Einstein (distance 4 in the Wordnet Digraph, with SCA as intellectual synset as shown below)

Tagore and Gandhi (distance 8 in the Wordnet Digraph, with SCA as person synset as shown below)

Tagore and Shelley (distance 2 in the Wordnet Digraph, with SCA as author as shown below)

text and mining (distance 12 in the Wordnet Digraph, with SCA as abstraction synset as shown below)

milk and water (distance 3 in the Wordnet Digraph, with SCA as food, as shown below)

### Outcast detection

Given a list of WordNet nouns x1x2, …, xn, which noun is the least related to the others? To identify an outcast, compute the sum of the distances between each noun and every other one:

di   =   distance(xix1)   +   distance(xix2)   +   …   +   distance(xixn)

and return a noun xt for which dt is maximum. Note that distance(xixi) = 0, so it will not contribute to the sum.

Implement an immutable data type Outcast with the following API:

### Examples

As expected, potato is the outcast  in the list of the nouns shown below (a noun with maximum distance from the rest of the nouns, all of which except potato are fruits, but potato is not). It can be seen from the Wordnet Distance heatmap from the next plot, as well as the sum of distance plot from the plot following the next one.

Again, as expected, table is the outcast  in the list of the nouns shown below (a noun with maximum distance from the rest of the nouns, all of which except table are mammals, but table is not). It can be seen from the Wordnet Distance heatmap from the next plot, as well as the sum of distance plot from the plot following the next one.

Finally, as expected, bed is the outcast  in the list of the nouns shown below (a noun with maximum distance from the rest of the nouns, all of which except bed are drinks, but bed is not). It can be seen from the Wordnet Distance heatmap from the next plot, as well as the sum of distance plot from the plot following the next one.

# Online Learning: Sentiment Analysis on Amazon Product Review Dataset with Logistic Regression via Stochastic Gradient Ascent in Python

This problem appeared as an assignment in the coursera course Machine Learning – Classification, part of Machine Learning specialization by the University of Washington. The following description of the problem is taken directly from the assignment.

The goal of this assignment is to implement an online logistic regression classifier using stochastic gradient ascent. The following are the sub-tasks:

• Amazon product reviews dataset is used along with positive / negative labels as the training dataset.
• Bag of words features will be extracted from the training dataset, only a pre-selected set of important words will be used as features.
• The partial derivative of log likelihood (with L2 penalty) with respect to each single coefficient is computed.
• Stochastic gradient ascent is implemented from sractch.
• Convergence of stochastic gradient ascent is compared with that of batch gradient ascent.

## Load and process review dataset

For this assignment, a subset of Amazon product review dataset is going to be used. The subset was chosen to contain similar numbers of positive and negative reviews, as the original dataset consisted of mostly positive reviews.

Preprocessing: We shall work with a hand-curated list of important words extracted from the review data. We will also perform 2 simple data transformations:

1. Remove punctuation using string manipulation functionality.
2. Compute word counts (only for the important_words)

The data frame products now contains one column for each of the 193 important_words.

### Split data into training and validation sets

We will now split the data into a 90-10 split where 90% is in the training set and 10% is in the validation set. We use seed=1 so that everyone gets the same result.

Training set  : 47765 data points
Validation set: 5307 data points


An additional colum ‘intercept‘ (filled with 1’s) is needed to be inserted into the data frame to take account of the intercept term.

## Building on logistic regression

Now the link function for logistic regression can be defined as:

where the feature vector h(xi)h(xi) is given by the word counts of important_words in the review xixi. Now our goal is to maximize the log-likelihood function, which does not have a closed-form solution, hence techniques like gradient descent needs to be used.

The way the probability predictions are computed is not affected by using stochastic gradient ascent as a solver. Only the way in which the coefficients are learned is affected by using stochastic gradient ascent as a solver.

Note. We are not using regularization in this assignment, but, stochastic gradient can also be used for regularized logistic regression, there will be one addtional term for regularization in the partial derivative.

To verify the correctness of the gradient computation, we use a function for computing average log likelihood (to be used for its numerical stability).

To track the performance of stochastic gradient ascent, we provide a function for computing average log likelihood.

Note the added 1/N term which averages the log likelihood across all data points. The 1/N term makes it easier for us to compare stochastic gradient ascent with batch gradient ascent.

In other words, we update the coefficients using the average gradient over data points (instead of using a summation). By using the average gradient, we ensure that the magnitude of the gradient is approximately the same for all batch sizes. This way, we can more easily compare various batch sizes of stochastic gradient ascent (including a batch size of all the data points), and study the effect of batch size on the algorithm as well as the choice of step size.

Now let’s extend the algorithm for batch gradient ascent (that takes all the data at once) to stochastic (takes one data point at a time) and mini-batch gradient ascent (that takes data in small batches as input, computes the gradient and updates the coefficients). The following figure shows the gradient ascent algorithm, which needs to scaled dwon by appropriate batch size.

## Compare convergence behavior of stochastic gradient ascent

For the remainder of the assignment, let’s compare stochastic gradient ascent against batch gradient ascent. For this, we need a reference implementation of batch gradient ascent.

Let’s now run stochastic gradient ascent over the feature_matrix_train for 10 iterations using:

• initial_coefficients = zeros
• step_size = 5e-1
• batch_size = 1
• max_iter = 10
Iteration 0: Average log likelihood (of data points in batch [00000:00001]) = -0.01416346
Iteration 1: Average log likelihood (of data points in batch [00001:00002]) = -0.00505439
Iteration 2: Average log likelihood (of data points in batch [00002:00003]) = -0.00177457
Iteration 3: Average log likelihood (of data points in batch [00003:00004]) = -0.00311449
Iteration 4: Average log likelihood (of data points in batch [00004:00005]) = -0.06140707
Iteration 5: Average log likelihood (of data points in batch [00005:00006]) = -0.00000011
Iteration 6: Average log likelihood (of data points in batch [00006:00007]) = -0.02461738
Iteration 7: Average log likelihood (of data points in batch [00007:00008]) = -0.00876472
Iteration 8: Average log likelihood (of data points in batch [00008:00009]) = -0.00003921
Iteration 9: Average log likelihood (of data points in batch [00009:00010]) = -0.00000620

As expected, as each iteration passes, how does the average log likelihood in the batch fluctuates with stochastic gradient descent (i.e., with batch size 1).

Now let’s again run batch gradient ascent over the feature_matrix_train but this time for 200 iterations using:

• initial_coefficients = zeros
• step_size = 5e-1
• batch_size = # data points in the training dataset
• max_iter = 200
Iteration   0: Average log likelihood (of data points in batch [00000:47765]) = -0.68313840
Iteration   1: Average log likelihood (of data points in batch [00000:47765]) = -0.67402166
Iteration   2: Average log likelihood (of data points in batch [00000:47765]) = -0.66563558
Iteration   3: Average log likelihood (of data points in batch [00000:47765]) = -0.65788618
Iteration   4: Average log likelihood (of data points in batch [00000:47765]) = -0.65070149
Iteration   5: Average log likelihood (of data points in batch [00000:47765]) = -0.64402111
Iteration   6: Average log likelihood (of data points in batch [00000:47765]) = -0.63779295
Iteration   7: Average log likelihood (of data points in batch [00000:47765]) = -0.63197173
Iteration   8: Average log likelihood (of data points in batch [00000:47765]) = -0.62651787
Iteration   9: Average log likelihood (of data points in batch [00000:47765]) = -0.62139668
Iteration  10: Average log likelihood (of data points in batch [00000:47765]) = -0.61657763
Iteration  11: Average log likelihood (of data points in batch [00000:47765]) = -0.61203378
Iteration  12: Average log likelihood (of data points in batch [00000:47765]) = -0.60774127
Iteration  13: Average log likelihood (of data points in batch [00000:47765]) = -0.60367892
Iteration  14: Average log likelihood (of data points in batch [00000:47765]) = -0.59982787
Iteration  15: Average log likelihood (of data points in batch [00000:47765]) = -0.59617125
Iteration 100: Average log likelihood (of data points in batch [00000:47765]) = -0.49541833
Iteration 199: Average log likelihood (of data points in batch [00000:47765]) = -0.47143083


As expected, with (full) batch gradient ascent, as each iteration passes, the average log likelihood in the batch continuously increases.

## Make “passes” over the dataset

To make a fair comparison between stochastic gradient ascent and batch gradient ascent, we measure the average log likelihood as a function of the number of passes (defined as follows):

## Log likelihood plots for stochastic gradient ascent

With the terminology in mind, let us run stochastic gradient ascent for 10 passes. We will use

• step_size=1e-1
• batch_size=100
• initial_coefficients to all zeros.
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.41823078


Let’s plot the average log likelihood as a function of the number of passes.

## Smoothing the stochastic gradient ascent curve

The plotted line oscillates so much that it is hard to see whether the log likelihood is improving. In our plot, we apply a simple smoothing operation using the parameter smoothing_window. The smoothing is simply a moving average of log likelihood over the last smoothing_window “iterations” of stochastic gradient ascent.

To compare convergence rates for stochastic gradient ascent with batch gradient ascent, let’s plot the change in log-likelihood with the iterations.

We are comparing:

• stochastic gradient ascent: step_size = 0.1, batch_size=100
• batch gradient ascent: step_size = 0.5, batch_size= # data points

Write code to run stochastic gradient ascent for 200 passes using:

• step_size=1e-1
• batch_size=100
• initial_coefficients to all zeros.
Iteration     0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration     1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration     2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration     3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration     4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration     5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration     6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration     7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration     8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration     9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration    10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration    11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration    12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration    13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration    14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration    15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration   100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration   200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration   300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration   400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration   500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration   600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration   700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration   800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration   900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration  1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration  2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration  3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration  4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration  5000: Average log likelihood (of data points in batch [23000:23100]) = -0.41321275
Iteration  6000: Average log likelihood (of data points in batch [27600:27700]) = -0.42095621
Iteration  7000: Average log likelihood (of data points in batch [32200:32300]) = -0.47438456
Iteration  8000: Average log likelihood (of data points in batch [36800:36900]) = -0.40689130
Iteration  9000: Average log likelihood (of data points in batch [41400:41500]) = -0.44582019
Iteration 10000: Average log likelihood (of data points in batch [46000:46100]) = -0.39752726
Iteration 20000: Average log likelihood (of data points in batch [44300:44400]) = -0.50001293
Iteration 30000: Average log likelihood (of data points in batch [42600:42700]) = -0.44909961
Iteration 40000: Average log likelihood (of data points in batch [40900:41000]) = -0.41075257
Iteration 50000: Average log likelihood (of data points in batch [39200:39300]) = -0.47957450
Iteration 60000: Average log likelihood (of data points in batch [37500:37600]) = -0.42584682
Iteration 70000: Average log likelihood (of data points in batch [35800:35900]) = -0.37312738
Iteration 80000: Average log likelihood (of data points in batch [34100:34200]) = -0.41330111
Iteration 90000: Average log likelihood (of data points in batch [32400:32500]) = -0.47600432
Iteration 95399: Average log likelihood (of data points in batch [47600:47700]) = -0.47449630


We compare the convergence of stochastic gradient ascent and batch gradient ascent in the following cell. Note that we apply smoothing with smoothing_window=30.

As can be seen from the figure above, the batch gradient ascent needs at least 150 passes to achieve a similar log likelihood as stochastic gradient ascent.

## Explore the effects of step sizes (learning rate) on stochastic gradient ascent

To start, let’s explore a wide range of step sizes that are equally spaced in the log space and run stochastic gradient ascent with step_size set to 1e-4, 1e-3, 1e-2, 1e-1, 1e0, 1e1, and 1e2, using the following set of parameters:

• initial_coefficients=zeros
• batch_size=100
• max_iter initialized so as to run 10 passes over the data.
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69313572
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69313813
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69312970
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69313664
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69312912
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69312352
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69312565
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69312596
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69312480
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69311820
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69312342
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.69309997
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.69310417
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.69311459
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.69311289
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.69310315
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.69299666
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.69262795
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.69259227
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.69216304
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.69184517
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.69233727
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.69184444
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.69162156
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.69137017
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.69116453
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.68868229
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.68748389
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.68381866
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.68213944
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69303256
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69305660
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69297245
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69304176
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69296671
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69291077
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69293204
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69293505
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69292337
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69285782
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69290957
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.69267562
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.69271772
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.69282167
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.69280442
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.69270726
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.69163566
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.68802729
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.68772431
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.68369142
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.68064510
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.68541475
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.68090549
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.67879020
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.67693059
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.67539881
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.65759465
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.64745516
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.62162582
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.61371736
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.69200364
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.69223670
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.69141056
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.69209296
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.69135181
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.69080412
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.69100987
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.69103436
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.69091067
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.69029154
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.69076626
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.68848541
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.68891938
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.68992883
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.68973094
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.68878712
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.67788829
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.64832833
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.64792583
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.62345602
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.60389801
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.63841711
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.61096879
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.59948158
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.59326446
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.59519901
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.54578301
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.51997970
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.46497627
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.46731743
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.68197844
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.68360557
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.67672535
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.68262376
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.67601418
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.67149018
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.67302292
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67288246
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.67104021
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.66754591
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.66946221
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.65083970
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.65625382
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.66398221
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.66083602
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.65357831
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.59260801
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.50083166
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.50714802
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.49769606
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.45111548
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.53578732
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.48576831
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.48193699
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.43452058
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.49750696
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.46582637
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.43007567
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.38589807
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.41823078
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.60671913
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -0.61435096
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -0.57582992
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -0.60419455
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -0.56461895
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -0.55369469
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.56188804
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.56098460
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.53659402
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -0.54855532
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -0.54643770
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.45436554
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -0.51347209
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.53114501
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.51323915
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.50234155
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.44799354
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.38955785
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.41840095
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.42081935
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.35694652
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.44873903
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -0.42411231
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.47364810
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.36346510
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.47974711
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.45019123
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -0.39097701
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.34895703
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.40687203
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -0.75339506
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -5.19955914
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -1.35343001
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -3.63980553
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -1.05854033
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -1.11538249
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -0.86603585
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -0.67571232
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -0.83674532
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -1.07638709
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -1.20809203
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -0.90955296
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -1.58077817
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -0.77787311
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -0.62852240
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -0.70284036
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -0.62403867
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -0.30394690
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -0.56782701
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -0.48147752
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -0.43850709
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -0.43008741
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -1.11804684
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -0.53574169
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -0.31389670
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -0.61323575
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -0.76125665
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -1.02808956
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -0.46513784
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -0.47187763
Iteration    0: Average log likelihood (of data points in batch [00000:00100]) = -5.69805760
Iteration    1: Average log likelihood (of data points in batch [00100:00200]) = -61.13545979
Iteration    2: Average log likelihood (of data points in batch [00200:00300]) = -7.42566120
Iteration    3: Average log likelihood (of data points in batch [00300:00400]) = -27.73988196
Iteration    4: Average log likelihood (of data points in batch [00400:00500]) = -15.42138334
Iteration    5: Average log likelihood (of data points in batch [00500:00600]) = -11.20038218
Iteration    6: Average log likelihood (of data points in batch [00600:00700]) = -11.11461119
Iteration    7: Average log likelihood (of data points in batch [00700:00800]) = -8.49925505
Iteration    8: Average log likelihood (of data points in batch [00800:00900]) = -16.55690529
Iteration    9: Average log likelihood (of data points in batch [00900:01000]) = -16.98581355
Iteration   10: Average log likelihood (of data points in batch [01000:01100]) = -13.56402260
Iteration   11: Average log likelihood (of data points in batch [01100:01200]) = -6.25429003
Iteration   12: Average log likelihood (of data points in batch [01200:01300]) = -16.59818372
Iteration   13: Average log likelihood (of data points in batch [01300:01400]) = -8.45630034
Iteration   14: Average log likelihood (of data points in batch [01400:01500]) = -4.98292077
Iteration   15: Average log likelihood (of data points in batch [01500:01600]) = -7.47891216
Iteration  100: Average log likelihood (of data points in batch [10000:10100]) = -6.88355879
Iteration  200: Average log likelihood (of data points in batch [20000:20100]) = -3.34058488
Iteration  300: Average log likelihood (of data points in batch [30000:30100]) = -3.83117603
Iteration  400: Average log likelihood (of data points in batch [40000:40100]) = -6.29604810
Iteration  500: Average log likelihood (of data points in batch [02300:02400]) = -2.99625091
Iteration  600: Average log likelihood (of data points in batch [12300:12400]) = -2.31125647
Iteration  700: Average log likelihood (of data points in batch [22300:22400]) = -8.19978746
Iteration  800: Average log likelihood (of data points in batch [32300:32400]) = -3.76650208
Iteration  900: Average log likelihood (of data points in batch [42300:42400]) = -4.06151269
Iteration 1000: Average log likelihood (of data points in batch [04600:04700]) = -12.66107559
Iteration 2000: Average log likelihood (of data points in batch [09200:09300]) = -13.33445580
Iteration 3000: Average log likelihood (of data points in batch [13800:13900]) = -1.63544030
Iteration 4000: Average log likelihood (of data points in batch [18400:18500]) = -3.55951973
Iteration 4769: Average log likelihood (of data points in batch [47600:47700]) = -3.41717551


### Plotting the log likelihood as a function of passes for each step size

Now, let’s plot the change in log likelihood for each of the following values of step_size:

• step_size = 1e-4
• step_size = 1e-3
• step_size = 1e-2
• step_size = 1e-1
• step_size = 1e0
• step_size = 1e1
• step_size = 1e2

For consistency, we again apply smoothing_window=30.

Now, let us remove the step size step_size = 1e2 and plot the rest of the curves.

As can be seen from the above plots, the step size 1e2 gives the worst result and step size 1 and 0.1 gives the best results in terms of convergence.

# Sentiment Analysis on the Large Movie Review Dataset using Linear Model Classifier with Hinge-loss and L1 Penalty with Language Model Features and Stochastic Gradient Descent in Python

This problem appeared as a project in the edX course ColumbiaX: CSMM.101x Artificial Intelligence (AI). The following description of the problem is taken directly from the project description.

In this assignment, an active research area in Natural Language Processing (NLP), sentiment analysis will be touched on. Given the exponential growth of online review data (Amazon, IMDB and etc), sentiment analysis becomes increasingly important. In this assignment, the task will be to build a sentiment classifier, i.e., evaluating a piece of text being either positive or negative.

The “Large Movie Review Dataset“(*) shall be used for this project. The dataset is compiled from a collection of 50,000 reviews from IMDB on the condition there are no more than 30 reviews each movie. Number of positive and negative reviews are equal. Negative reviews have scores lesser or equal 4 out of 10 while a positive review greater or equal 7 out of 10. Neutral reviews are not included on the other hand. Then, 50,000 reviews are divided evenly into the training and test set.

*Dataset is credited to Prof. Andrew Mass in the paper, Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

## Instruction

In this project, a Stochastic Gradient Descent Classifier will be trained. While gradient descent is powerful, it can be prohibitively expensive when the dataset is extremely large because every single data point needs to be processed.

However, it turns out when the data is large, rather than the entire dataset, SGD algorithm performs just as good with a small random subset of the original data. This is the central idea of Stochastic SGD and particarly handy for the text data since corpus are often humongous.

## Data Preprocessing

The first task is explore the training data and create one single training data file combining the positive and negative labeled texts. The column “polarity” for each movie-review text consists of sentiment labels, 1 for positive and 0 for negative. In addition, common English stopwords should be removed.

The following table shows the first few rows of the training dataset. The training dataset contains 25000 movie reviews.

## Unigram Data Representation

The very first step in solving any NLP problem is finding a way to represent the text data so that machines can understand. A common approach is using a document-term vector where each document is encoded as a discrete vector that counts occurrences of each word in the vocabulary it contains. For example, consider two one-sentence documents:

• d1: “I love Columbia Artificial Intelligence course”.
• d2: “Artificial Intelligence is awesome”.

The vocabulary V = {artificial, awesome, Columbia, course, I, intelligence, is, love} and two documents can be encoded as v1 and v2 as follows:

This data representation is also called a unigram model.

Now, the next task is to transform the text column in the training dataset into a term-document matrix using uni-gram model. A few rows and columns of this transformed dataset with unigram features (~75k features) are shown as shown below. As can be noticed, stemming is not used.

As can be seen from the above table, the unigram feature matrix is extremely sparse (it took 1.6G space to store the first 10k rows as csv and after compressing the zipped file size became only ~4.5MB, around 99.5% sparse) and the following density plot shows density of the average number of occurences of all the unigram features (after discrading the top 15 freatures with the highest average number of occurences) and density is concentrated < 0.5, which means for the first 10k text reviews almost all the unigram features have average value of occurence < 0.5.

Next, we need to train a Stochastic Descend Gradient (SGD) classifier whose loss=“hinge” and penalty=“l1” on this transformed training dataset.

On the other hand, a test dataset is provided which serves as the benchmark file for the performance of the trained classifier. Next task is to use the trained SGD to predict the sentiments of the text in the test dataset, after converting it to the corresponding unigram represenetation. the trained SGD classifier to predict this information.

The test dataset too has 25000 text reviews and sentiment for each of them needs to be predicted. Here are the prediction counts for positive and negative sentiments predicted with the unigram model.

## Bigram Representation

A more sophisticated data representation model is the bigram model where occurrences depend on a sequence of two words rather than an individual one. Taking the same example like before, v1 and v2 are now encoded as follows:

Instead of enumerating every individual words, bigram counts the number of instance a word following after another one. In both d1 and d2 “intelligence” follows “artificial” so v1(intelligence | artificial) = v2(intelligence | artificial) = 1. In contrast, “artificial” does not follow “awesome” so v1(artificial | awesome) = v2(artificial | awesome) = 0.

The same exercise from Unigram is to be repeated for the Bigram Model Data Representation and the corresponding test prediction file needs to be produced. A few rows and columns of this transformed dataset with bigram features (~175k total bigram features) are shown  below.

## Tf-idf

Sometimes, a very high word counting may not be meaningful. For example, a common word like “say” may appear 10 times more frequent than a less-common word such as “machine” but it does not mean “say” is 10 times more relevant to our sentiment classifier. To alleviate this issue, we can instead use term frequency tf[t] = 1 + log(f[t,d]) where f[t,d] is the count of term t in document d. The log function dampens the unwanted influence of common English words.

Inverse document frequency (idf) is a similar concept. To take an example, it is likely that all of our training documents belong to a same category which has specific jargons. For example, Computer Science documents often have words such as computers, CPU, programming and etc appearing over and over. While they are not common English words, because of the document domain, their occurrences are very high. To rectify, we can adjust using inverse term frequency idf[t] = log( N / df[t] ) where df[t] is the number of documents containing the term t and N is the total number of document in the dataset.

Therefore, instead of just word frequency, tf-idf for each term t can be used, tf-idf[t] = tf[t] ∗idf[t].

The same exercise needs tp be repeated as in the Unigram and Bigram data model but tf-idf needs to be applied this time to produce test prediction files.

A few rows and columns of this transformed dataset with tf-idf unigram features (~75k unigram tf-idf features) are shown below.

A few rows and columns of this transformed dataset with tf-idf bigram features (~175k bigram tifidf features) are shown below.

The next figure shows how the SGD for different models converges with epochs. As can be seen, till 100 epochs, the unigram models cost function (hinge loss) decrease with a much faster rate than the bigram models.