Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space, it can be built fast by solving the primal optimization problem also. In this article a *Support Vector Machine *implementation is going to be described by solving the *primal optimization problem* with *sub-gradient* *solver *using* stochastic gradient decent. *The algorithm is called the *Pegasos* algorithm, as described by Shai Shalev-Shwartz et al, in their original paper*. *

- First the vanilla version and then the
*kernelized*version of the the*Pegasos*algorithm is going to be described along with some applications on some datasets. - Next the
*hinge-loss*function for the SVM is going to be replaced by the*log-loss*function for the Logistic Regression and the*primal SVM*problem is going to be converted to*regularized logistic regression*. - Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
- The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course. The description of the problems are sometimes taken from the assignment itself.

## 1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITM, these slides from CMU and also shown in the next figure taken from the slides, the *Soft-SVM Primal Lagrangian* can be represented as follows:

or as the following if the explicit bias term is discarded:

where the *0-1 loss* is approximated by the *hinge-loss*.

- Changing the
*regularization*constant to λ, it can be equivalently expressed using the*hinge-loss*as follows, as shown in the next figure, taken from the*Pegasos*paper. - The next figure also describes the
*Pegasos*algorithm, which performs an*SGD*on the*primal objective (Lagrangian)*with carefully chosen*steps*. - Since the
*hinge-loss*is not*continuous*, the*sub-gradient*of the objective is considered instead for the gradient computation for a single update step with SGD. - The learning rate η is gradually decreased with iteration.

The following figure shows a simplified version of the algorithm:

The following python code implements the algorithm:

class SVMPegasos(LinearClassifier): """Implementation of SVM with SGD with PEGASOS Algorithm""" def __init__(self, n_iter=10, lambda1=1): self.n_iter = n_iter self.lambda1 = lambda1 def fit(self, X, Y): Y = list(Y) self.find_classes(Y) # convert all output values to +1 or -1 Yn = [sign(y, self.positive_class) for y in Y] X = X.toarray() m, n_features = X.shape[0], X.shape[1] self.w = numpy.zeros( n_features ) for i in range(self.n_iter): eta = 1. / (self.lambda1*(i+1)) j = numpy.random.choice(m, 1)[0] x, y = X[j], Yn[j] score = self.w.dot(x) if y*score < 1: self.w = (1 - eta*self.lambda1)*self.w + eta*y*x else: self.w = (1 - eta*self.lambda1)*self.w

### Some Notes

- The optional projection step has been left out (the line in square brackets in the paper).
- As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
- The number η is the step length in gradient descent.
- The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty,
*Pegasos*uses a variable step length: η = 1 / (λ · t). - Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around” problem as it gets close to the optimum.
- Although the bias variable
*b*in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum*w*.

### Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

- The following animation and the figure show the final decision surface and how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
- As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
- The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
- Classification accuracy obtained on the test split was 95%.

## 2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to *kernel-SVM*.

### Using the PEGASOS implementation to classify the following linearly non-separable datasets

### Dataset 1

- The following animation and the figure show the final decision surface and how the decision surface (boundary) changes with single-point update-steps with SGD for the
*Kernelized PGEASOS*implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets. - As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
- The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (
*Gaussian Kernel*was used) ran very fast, taking only 2.09 seconds. - Classification accuracy obtained on the test split was 95.8%.

### Dataset 2

- The following animation and the figure show the final decision surface and how the decision surface (boundary) changes with single-point update-steps with SGD for the
*Kernelized PGEASOS*implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets. - As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
- The dataset was small (with 373 rows only) and the training phase with the
*Kernelized PEGASOS SVM*implementation (*Gaussian Kernel*was used) training ran very fast taking only 5.1 seconds. - Classification accuracy obtained on the test split was 100%.

## 3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from *hinge-loss* to *log-loss*) in the *PEGASOS* algorithm, a logistic regression model can be trained.

### Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

- The following animation and the figure show the final decision surface and how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the
*Logistic Regression*classifier, respectively. - The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
- Classification accuracy obtained on the test split was 97.5%.

## 4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy. The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.

The next figures show the distribution of the word lengths for the texts for different sentiments.

## Steps

- Create
*sklearn Pipeline*to- Compute simple BOW features (
*CountVectorizer*from*sklearn*): obtained 51063 columns with BOW. - Use
*sklearn SelectKBest*to choose top 5000 features - Train / test the model

- Compute simple BOW features (
- Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
- Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
- Also Record the time taken to train the classifier.
- Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.

The following python code shows how the *text-classification pipeline* is implemented with *SVMPegasos*.

def classify_with_svm(): # read all the documents X, Y = read_corpus('all_sentiment_shuffled.txt') # split into training and test parts Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8, random_state=0) classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x, tokenizer = lambda x: x)), ('fs', SelectKBest(k=5000)), ('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] ) t0 = time.time() classifier.fit(Xtrain, Ytrain) t1 = time.time() print('Training time:', t1-t0, 'seconds.') Yguess = classifier.predict(Xtest) print('accuracy:', accuracy_score(Ytest, Yguess))

### Some Notes

- The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model / tf-idf features.
- This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
- The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
- For Perceptron the entire training data was passed 10 times to the model while training.
- For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.

## Results

As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.