In this article, couple of implementations of the support vector machine binary classifier with quadratic programming libraries (in R and python respectively) and application on a few datasets are going to be discussed. The following video lectures / tutorials / links have been very useful for the implementation:

- this one from MIT AI course
- this one from Berkeley L&DM course
- this one from UCF CV course
- this one from NPTEL/IITKGP ML course
- this one and this one from pythonprogramming.net
- this one on CVXOPT from MIT
- this repository from github
- this one from python CVXOPT
- this one from R quadprog

The next figure taken from here describes the basics of Soft-Margin SVM (without kernels).

**SVM in a nutshell**

- Given a (training) dataset consisting of positive and negative class instances.
- Objective is to find a
*maximum-margin*classifier, in terms of a*hyper-plane*(the vectors*w*and*b*) that separates the positive and negative instances (in the training dataset). - If the dataset is
*noisy*(with some overlap in positive and negative samples), there will be some error in classifying them with the hyper-plane. - In the latter case the objective will be to minimize the errors in classification along with maximizing the margin and the problem becomes a soft-margin SVM (as opposed to the hard margin SVM without slack variables).
- A
*slack*variable per training point is introduced to include the classification errors (for the miss-classified points in the training dataset) in the objective, this can also be thought of adding regularization. - The optimization problem is
*quadratic*in nature, since it has quadratic objective with linear constraints. - It is easier to solve the optimization problem in the
*dual*rather than the*primal*space, since there are less number of variables. - Hence the optimization problem is often solved in the
*dual*space by converting the*minimization*to a*maximization*problem (keeping in mind the*weak/strong duality theorem*and the*complementary slackness conditions*), by first constructing the*Lagrangian*and then using the*KKT conditions*for a*saddle*point. - If the dataset is not
*linearly separable*, the*kernel trick*is used to conceptually map the datapoints to some higher-dimensions only by computing the*(kernel) gram matrix*/ dot-product of the datapoints (the matrix needs to positive semi-definite as per*Mercer’s theorem*). - Some popular
*kernel*functions are the*linear*,*polynomial*,*Gaussian*(*RBF*, corresponding to the*infinite dimensional*space) kernels. - The
*dual optimization*problem is solved (with standard*quadratic**programming*packages) and the solution is found in terms of a few support vectors (defining the linear/non-liear decision boundary, SVs correspond to the non-zero values of the dual variable / the primal Lagrange multipler), that’s why the name SVM. - Once the
*dual*optimization problem is solved , the values of the primal variables are computed to construct the hyper-plane / decision surface. - Finally the
*dual*and*primal variables*(optimum values obtained from the solutions) are used in conjunction to*predict*the class of a new (unseen) datapoint. - The
*hyper-parameters*(e.g.,*C*) can be tuned to fit different models and choose the most accurate one from a held-out (validation) dataset.

The following figure describes the *soft-margin SVM* in a more formal way.

The following figures show how the *SVM dual quadratic programming* problem can be formulated using the R *quadprog **QP solver* (following the QP formulation in the R package *quadprog*).

The following figures show how the *SVM dual quadratic programming* problem can be formulated using the Python *CVXOPT QP solver* (following the QP formulation in the python library *CVXOPT*).

The following R code snippet shows how a *kernelized* (*soft/hard-margin*) *SVM* model can be fitted by solving the dual *quadratic optimization* problem.

library(quadprog) library(Matrix) linear.kernel <- function(x1, x2) { return (x1%*%x2) } svm.fit <- function(X, y, FUN=linear.kernel, C=NULL) { n.samples <- nrow(X) n.features <- ncol(X) # Gram matrix K <- matrix(rep(0, n.samples*n.samples), nrow=n.samples) for (i in 1:n.samples){ for (j in 1:n.samples){ K[i,j] <- FUN(X[i,], X[j,]) } } Dmat <- outer(y,y) * K Dmat <- as.matrix(nearPD(Dmat)$mat) # convert Dmat to nearest pd matrix dvec <- rep(1, n.samples) if (!is.null(C)) { # soft-margin Amat <- rbind(y, diag(n.samples), -1*diag(n.samples)) bvec <- c(0, rep(0, n.samples), rep(-C, n.samples)) } else { # hard-margin Amat <- rbind(y, diag(n.samples)) bvec <- c(0, rep(0, n.samples)) } res <- solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1) a = res$solution # Lagrange multipliers # Support vectors have non-zero Lagrange multipliers # ... }

The following python code snippet adapted from here and from *Mathieu Blondel’s Blog, *shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the *dual quadratic optimization* problem.

import numpy as np import cvxopt def fit(X, y, kernel, C): n_samples, n_features = X.shape # Compute the Gram matrix K = np.zeros((n_samples, n_samples)) for i in range(n_samples): for j in range(n_samples): K[i,j] = kernel(X[i], X[j]) # construct P, q, A, b, G, h matrices for CVXOPT P = cvxopt.matrix(np.outer(y,y) * K) q = cvxopt.matrix(np.ones(n_samples) * -1) A = cvxopt.matrix(y, (1,n_samples)) b = cvxopt.matrix(0.0) if C is None: # hard-margin SVM G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1)) h = cvxopt.matrix(np.zeros(n_samples)) else: # soft-margin SVM G = cvxopt.matrix(np.vstack((np.diag(np.ones(n_samples) * -1), np.identity(n_samples)))) h = cvxopt.matrix(np.hstack((np.zeros(n_samples), np.ones(n_samples) * C))) # solve QP problem solution = cvxopt.solvers.qp(P, q, G, h, A, b) # Lagrange multipliers a = np.ravel(solution['x']) # Support vectors have non zero lagrange multipliers sv = a > 1e-5 # some small threshold # ...

## Notes

- Since the
*objective function*for*QP*is*convex*if and only if the matrix*P (*in*python CVXOPT) or Dmat (*in*R quadprog)*is positive-semidefinite, it needs to be ensured that the corresponding matrix for SVM is psd too. - The corresponding matrix is computed from the Kernel gram matrix (which is psd or non-negative-definite by Mercer’s theorem) and the labels from the data. Due to numerical errors, often a few eigenvalues of the matrix tend to be very small negative values.
- Although python
*CVXOPT*will allow very small numerical errors in P matrix with a warning message, R*quardprog*will strictly require that the Dmat matrix is strictly positive definite, otherwise it will fail. - Hence, with R
*quadprog*the D matrix first needs to be converted to a positive definite matrix using some algorithm (particularly in case when it contains very small negative eigenvalues, which is quite common, since*D*comes from the data). - A small threshold (e.g., 1e-5) is chosen to find the support vectors (corresponding to non-zero Lagrange multipliers, by complementary slackness condition).
- Cases corresponding to the hard and soft-margin SVMs must be handled separately, otherwise it will lead to inconsistent system of solutions.

## Using the SVM implementations for classification on some datasets

### The datasets

For each dataset, the *80-20 Validation* on the dataset is used to

- First
*fit*(*train*) the model on randomly selected 80% samples of the dataset. *Predict*(*test*) on the*held-out*(remaining 20%) of the dataset and compute*accuracy*.- Different values of the
*hyper-parameter**C*and different*kernels*are used. - For the
*polynomial*kernel, polynomial of*degree 3*is used and the*RBF*kernel with the*standard deviation of 5*is used, although these hyper-parameters can be tuned too.

### Results

As can be seen from the results below,

- The points with blue circles are the
*support vectors*. - When the C value is
*low*(close to*hard-margin SVM*), the model learnt tends to*overfit*the training data. - When the C value is
*high*(close to*soft-margin SVM*), the model learnt tends to be more*generalizable*(C acts as a regularizer). - There are more support vectors required to define the decision surface for the hard-margin SVM than the soft-margin SVM for datasets not linearly separable.
- The
*linear*(and sometimes*polynomial*) kernel performs pretty badly on the datasets that are not linearly separable. - The decision boundaries are also shown.

#### With the Python (CVXOPT) implementation

#### With the R (quadprog) implementation

Reblogged this on Qamar-ud-Din.

LikeLike

how about implementing primal SVM as a quadric programming?

LikeLike

I am wondering if its possible or not

LikeLike

Primal SVM is generally implemented with SGD based algorithm PEGASOS, it can be implemented with quadratic program too, but it’s easier to solve and use the kernel trick in the dual space I think.

LikeLike