Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

  • First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
  • Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
  • Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
  • The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:

f31

or as the following if the explicit bias term is discarded:

f32

where the 0-1 loss is approximated by the hinge-loss.

f29

  • Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
  • The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
  • Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
  • The learning rate η is gradually decreased with iteration.

f25

The following figure shows a simplified version of the algorithm:

f33.png

f34.png

f35.png

The following python code implements the algorithm:


class SVMPegasos(LinearClassifier):
    """Implementation of SVM with SGD with PEGASOS Algorithm"""

    def __init__(self, n_iter=10, lambda1=1):
       self.n_iter = n_iter
       self.lambda1 = lambda1

    def fit(self, X, Y):
       Y = list(Y)
       self.find_classes(Y)
       # convert all output values to +1 or -1
       Yn = [sign(y, self.positive_class) for y in Y]
       X = X.toarray()
       m, n_features = X.shape[0], X.shape[1]
       self.w = numpy.zeros( n_features )
       for i in range(self.n_iter):
           eta = 1. / (self.lambda1*(i+1))
           j = numpy.random.choice(m, 1)[0]
           x, y = X[j], Yn[j]
           score = self.w.dot(x)
           if y*score < 1:
              self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
           else:
              self.w = (1 - eta*self.lambda1)*self.w

Some Notes

  • The optional projection step has been left out (the line in square brackets in the paper).
  •  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
  •  The number η is the step length in gradient descent.
  • The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
  • Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
  • Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

ov
  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
  • Classification accuracy obtained on the test split was 95%.
pov

svm_ov

2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.

f26

Using the PEGASOS implementation to classify the following linearly non-separable datasets

Dataset 1 flame

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
  • Classification accuracy obtained on the test split was 95.8%.
pflame

svm_flame

Dataset 2

jain
  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
  • Classification accuracy obtained on the test split was 100%.
pjain

svm_jain

3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.

f30

f36.png

f27

f28.png

Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

  • The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
  • As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
  • The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
  • Classification accuracy obtained on the test split was 97.5%.
povl.gif

lr_ov

4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.

f37.png
f39

The next figures show the distribution of the word lengths for the texts for different sentiments.

f38.png

Steps

  • Create sklearn Pipeline to
    • Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
    • Use sklearn SelectKBest to choose top 5000 features
    • Train / test the model
  • Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
  • Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
  • Also Record the time taken to train the classifier.
  • Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.

 

The following python code shows how the text-classification pipeline is implemented with SVMPegasos.


def classify_with_svm():
    # read all the documents
    X, Y = read_corpus('all_sentiment_shuffled.txt')
    # split into training and test parts
    Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
    random_state=0)
    classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
    tokenizer = lambda x: x)),
    ('fs', SelectKBest(k=5000)),
    ('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
    t0 = time.time()
    classifier.fit(Xtrain, Ytrain)
    t1 = time.time()
    print('Training time:', t1-t0, 'seconds.')
    Yguess = classifier.predict(Xtest)
    print('accuracy:', accuracy_score(Ytest, Yguess))

 

Some Notes

  • The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
  • This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
  • The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
  • For Perceptron the entire training data was passed 10 times to the model while training.
  • For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.

Results

As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.

f41

f42

f43

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s