Sentiment Analysis on the Large Movie Review Dataset using Linear Model Classifier with Hinge-loss and L1 Penalty with Language Model Features and Stochastic Gradient Descent in Python

This problem appeared as a project in the edX course ColumbiaX: CSMM.101x Artificial Intelligence (AI). The following description of the problem is taken directly from the project description.

In this assignment, an active research area in Natural Language Processing (NLP), sentiment analysis will be touched on. Given the exponential growth of online review data (Amazon, IMDB and etc), sentiment analysis becomes increasingly important. In this assignment, the task will be to build a sentiment classifier, i.e., evaluating a piece of text being either positive or negative.

The “Large Movie Review Dataset“(*) shall be used for this project. The dataset is compiled from a collection of 50,000 reviews from IMDB on the condition there are no more than 30 reviews each movie. Number of positive and negative reviews are equal. Negative reviews have scores lesser or equal 4 out of 10 while a positive review greater or equal 7 out of 10. Neutral reviews are not included on the other hand. Then, 50,000 reviews are divided evenly into the training and test set.

*Dataset is credited to Prof. Andrew Mass in the paper, Andrew L. Maas, Raymond E. Daly, Peter T. Pham, Dan Huang, Andrew Y. Ng, and Christopher Potts. (2011). Learning Word Vectors for Sentiment Analysis. The 49th Annual Meeting of the Association for Computational Linguistics (ACL 2011).

Instruction

In this project, a Stochastic Gradient Descent Classifier will be trained. While gradient descent is powerful, it can be prohibitively expensive when the dataset is extremely large because every single data point needs to be processed.

However, it turns out when the data is large, rather than the entire dataset, SGD algorithm performs just as good with a small random subset of the original data. This is the central idea of Stochastic SGD and particarly handy for the text data since corpus are often humongous.

Data Preprocessing

The first task is explore the training data and create one single training data file combining the positive and negative labeled texts. The column “polarity” for each movie-review text consists of sentiment labels, 1 for positive and 0 for negative. In addition, common English stopwords should be removed.

The following table shows the first few rows of the training dataset. The training dataset contains 25000 movie reviews.

im3.png

Unigram Data Representation

The very first step in solving any NLP problem is finding a way to represent the text data so that machines can understand. A common approach is using a document-term vector where each document is encoded as a discrete vector that counts occurrences of each word in the vocabulary it contains. For example, consider two one-sentence documents:

  • d1: “I love Columbia Artificial Intelligence course”.
  • d2: “Artificial Intelligence is awesome”.

The vocabulary V = {artificial, awesome, Columbia, course, I, intelligence, is, love} and two documents can be encoded as v1 and v2 as follows:

im1.png

This data representation is also called a unigram model.

Now, the next task is to transform the text column in the training dataset into a term-document matrix using uni-gram model. A few rows and columns of this transformed dataset with unigram features (~75k features) are shown as shown below. As can be noticed, stemming is not used.

im4.png

As can be seen from the above table, the unigram feature matrix is extremely sparse (it took 1.6G space to store the first 10k rows as csv and after compressing the zipped file size became only ~4.5MB, around 99.5% sparse) and the following density plot shows density of the average number of occurences of all the unigram features (after discrading the top 15 freatures with the highest average number of occurences) and density is concentrated < 0.5, which means for the first 10k text reviews almost all the unigram features have average value of occurence < 0.5.

uni.png

Next, we need to train a Stochastic Descend Gradient (SGD) classifier whose loss=“hinge” and penalty=“l1” on this transformed training dataset.

On the other hand, a test dataset is provided which serves as the benchmark file for the performance of the trained classifier. Next task is to use the trained SGD to predict the sentiments of the text in the test dataset, after converting it to the corresponding unigram represenetation. the trained SGD classifier to predict this information.

The test dataset too has 25000 text reviews and sentiment for each of them needs to be predicted. Here are the prediction counts for positive and negative sentiments predicted with the unigram model.

puni.png

Bigram Representation

A more sophisticated data representation model is the bigram model where occurrences depend on a sequence of two words rather than an individual one. Taking the same example like before, v1 and v2 are now encoded as follows:

im2.png

Instead of enumerating every individual words, bigram counts the number of instance a word following after another one. In both d1 and d2 “intelligence” follows “artificial” so v1(intelligence | artificial) = v2(intelligence | artificial) = 1. In contrast, “artificial” does not follow “awesome” so v1(artificial | awesome) = v2(artificial | awesome) = 0.

The same exercise from Unigram is to be repeated for the Bigram Model Data Representation and the corresponding test prediction file needs to be produced. A few rows and columns of this transformed dataset with bigram features (~175k total bigram features) are shown  below.

im5.png

pbi.png

Tf-idf

Sometimes, a very high word counting may not be meaningful. For example, a common word like “say” may appear 10 times more frequent than a less-common word such as “machine” but it does not mean “say” is 10 times more relevant to our sentiment classifier. To alleviate this issue, we can instead use term frequency tf[t] = 1 + log(f[t,d]) where f[t,d] is the count of term t in document d. The log function dampens the unwanted influence of common English words.

Inverse document frequency (idf) is a similar concept. To take an example, it is likely that all of our training documents belong to a same category which has specific jargons. For example, Computer Science documents often have words such as computers, CPU, programming and etc appearing over and over. While they are not common English words, because of the document domain, their occurrences are very high. To rectify, we can adjust using inverse term frequency idf[t] = log( N / df[t] ) where df[t] is the number of documents containing the term t and N is the total number of document in the dataset.

Therefore, instead of just word frequency, tf-idf for each term t can be used, tf-idf[t] = tf[t] ∗idf[t].

The same exercise needs tp be repeated as in the Unigram and Bigram data model but tf-idf needs to be applied this time to produce test prediction files.

A few rows and columns of this transformed dataset with tf-idf unigram features (~75k unigram tf-idf features) are shown below.

im6.png

punit.png
A few rows and columns of this transformed dataset with tf-idf bigram features (~175k bigram tifidf features) are shown below.

im7.png
pbit.png
The next figure shows how the SGD for different models converges with epochs. As can be seen, till 100 epochs, the unigram models cost function (hinge loss) decrease with a much faster rate than the bigram models.
sgd.png
Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s