In this article, couple of implementations of the support vector machine binary classifier with quadratic programming libraries (in R and python respectively) and application on a few datasets are going to be discussed. The following video lectures / tutorials / links have been very useful for the implementation:

- this one from MIT AI course
- this one from Berkeley L&DM course
- this one from UCF CV course
- this one from NPTEL/IITKGP ML course
- this one and this one from pythonprogramming.net
- this one on CVXOPT from MIT
- this repository from github
- this one from python CVXOPT
- this one from R quadprog

The next figure taken from here describes the basics of Soft-Margin SVM (without kernels).

**SVM in a nutshell**

- Given a (training) dataset consisting of positive and negative class instances.
- Objective is to find a
*maximum-margin*classifier, in terms of a*hyper-plane*(the vectors*w*and*b*) that separates the positive and negative instances (in the training dataset). - If the dataset is
*noisy*(with some overlap in positive and negative samples), there will be some error in classifying them with the hyper-plane. - In the latter case the objective will be to minimize the errors in classification along with maximizing the margin and the problem becomes a soft-margin SVM (as opposed to the hard margin SVM without slack variables).
- A
*slack*variable per training point is introduced to include the classification errors (for the miss-classified points in the training dataset) in the objective, this can also be thought of adding regularization. - The optimization problem is
*quadratic*in nature, since it has quadratic objective with linear constraints. - It is easier to solve the optimization problem in the
*dual*rather than the*primal*space, since there are less number of variables. - Hence the optimization problem is often solved in the
*dual*space by converting the*minimization*to a*maximization*problem (keeping in mind the*weak/strong duality theorem*and the*complementary slackness conditions*), by first constructing the*Lagrangian*and then using the*KKT conditions*for a*saddle*point. - If the dataset is not
*linearly separable*, the*kernel trick*is used to conceptually map the datapoints to some higher-dimensions only by computing the*(kernel) gram matrix*/ dot-product of the datapoints (the matrix needs to positive semi-definite as per*Mercer’s theorem*). - Some popular
*kernel*functions are the*linear*,*polynomial*,*Gaussian*(*RBF*, corresponding to the*infinite dimensional*space) kernels. - The
*dual optimization*problem is solved (with standard*quadratic**programming*packages) and the solution is found in terms of a few support vectors (defining the linear/non-liear decision boundary, SVs correspond to the non-zero values of the dual variable / the primal Lagrange multipler), that’s why the name SVM. - Once the
*dual*optimization problem is solved , the values of the primal variables are computed to construct the hyper-plane / decision surface. - Finally the
*dual*and*primal variables*(optimum values obtained from the solutions) are used in conjunction to*predict*the class of a new (unseen) datapoint. - The
*hyper-parameters*(e.g.,*C*) can be tuned to fit different models and choose the most accurate one from a held-out (validation) dataset.

The following figure describes the *soft-margin SVM* in a more formal way.

The following figures show how the *SVM dual quadratic programming* problem can be formulated using the R *quadprog **QP solver* (following the QP formulation in the R package *quadprog*).

The following figures show how the *SVM dual quadratic programming* problem can be formulated using the Python *CVXOPT QP solver* (following the QP formulation in the python library *CVXOPT*).

The following R code snippet shows how a *kernelized* (*soft/hard-margin*) *SVM* model can be fitted by solving the dual *quadratic optimization* problem.

library(quadprog) library(Matrix) linear.kernel <- function(x1, x2) { return (x1%*%x2) } svm.fit <- function(X, y, FUN=linear.kernel, C=NULL) { n.samples <- nrow(X) n.features <- ncol(X) # Gram matrix K <- matrix(rep(0, n.samples*n.samples), nrow=n.samples) for (i in 1:n.samples){ for (j in 1:n.samples){ K[i,j] <- FUN(X[i,], X[j,]) } } Dmat <- outer(y,y) * K Dmat <- as.matrix(nearPD(Dmat)$mat) # convert Dmat to nearest pd matrix dvec <- rep(1, n.samples) if (!is.null(C)) { # soft-margin Amat <- rbind(y, diag(n.samples), -1*diag(n.samples)) bvec <- c(0, rep(0, n.samples), rep(-C, n.samples)) } else { # hard-margin Amat <- rbind(y, diag(n.samples)) bvec <- c(0, rep(0, n.samples)) } res <- solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1) a = res$solution # Lagrange multipliers # Support vectors have non-zero Lagrange multipliers # ... }

The following python code snippet adapted from here and from *Mathieu Blondel’s Blog, *shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the *dual quadratic optimization* problem.

import numpy as np import cvxopt def fit(X, y, kernel, C): n_samples, n_features = X.shape # Compute the Gram matrix K = np.zeros((n_samples, n_samples)) for i in range(n_samples): for j in range(n_samples): K[i,j] = kernel(X[i], X[j]) # construct P, q, A, b, G, h matrices for CVXOPT P = cvxopt.matrix(np.outer(y,y) * K) q = cvxopt.matrix(np.ones(n_samples) * -1) A = cvxopt.matrix(y, (1,n_samples)) b = cvxopt.matrix(0.0) if C is None: # hard-margin SVM G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1)) h = cvxopt.matrix(np.zeros(n_samples)) else: # soft-margin SVM G = cvxopt.matrix(np.vstack((np.diag(np.ones(n_samples) * -1), np.identity(n_samples)))) h = cvxopt.matrix(np.hstack((np.zeros(n_samples), np.ones(n_samples) * C))) # solve QP problem solution = cvxopt.solvers.qp(P, q, G, h, A, b) # Lagrange multipliers a = np.ravel(solution['x']) # Support vectors have non zero lagrange multipliers sv = a > 1e-5 # some small threshold # ...

## Notes

- Since the
*objective function*for*QP*is*convex*if and only if the matrix*P (*in*python CVXOPT) or Dmat (*in*R quadprog)*is positive-semidefinite, it needs to be ensured that the corresponding matrix for SVM is psd too. - The corresponding matrix is computed from the Kernel gram matrix (which is psd or non-negative-definite by Mercer’s theorem) and the labels from the data. Due to numerical errors, often a few eigenvalues of the matrix tend to be very small negative values.
- Although python
*CVXOPT*will allow very small numerical errors in P matrix with a warning message, R*quardprog*will strictly require that the Dmat matrix is strictly positive definite, otherwise it will fail. - Hence, with R
*quadprog*the D matrix first needs to be converted to a positive definite matrix using some algorithm (particularly in case when it contains very small negative eigenvalues, which is quite common, since*D*comes from the data). - A small threshold (e.g., 1e-5) is chosen to find the support vectors (corresponding to non-zero Lagrange multipliers, by complementary slackness condition).
- Cases corresponding to the hard and soft-margin SVMs must be handled separately, otherwise it will lead to inconsistent system of solutions.

## Using the SVM implementations for classification on some datasets

### The datasets

For each dataset, the *80-20 Validation* on the dataset is used to

- First
*fit*(*train*) the model on randomly selected 80% samples of the dataset. *Predict*(*test*) on the*held-out*(remaining 20%) of the dataset and compute*accuracy*.- Different values of the
*hyper-parameter**C*and different*kernels*are used. - For the
*polynomial*kernel, polynomial of*degree 3*is used and the*RBF*kernel with the*standard deviation of 5*is used, although these hyper-parameters can be tuned too.

### Results

As can be seen from the results below,

- The points with blue circles are the
*support vectors*. - When the C value is
*low*(close to*hard-margin SVM*), the model learnt tends to*overfit*the training data. - When the C value is
*high*(close to*soft-margin SVM*), the model learnt tends to be more*generalizable*(C acts as a regularizer). - There are more support vectors required to define the decision surface for the hard-margin SVM than the soft-margin SVM for datasets not linearly separable.
- The
*linear*(and sometimes*polynomial*) kernel performs pretty badly on the datasets that are not linearly separable. - The decision boundaries are also shown.

#### With the Python (CVXOPT) implementation

#### With the R (quadprog) implementation

Reblogged this on Qamar-ud-Din.

LikeLike