# Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

• First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
• Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
• Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
• The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

## 1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:

or as the following if the explicit bias term is discarded:

where the 0-1 loss is approximated by the hinge-loss.

• Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
• The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
• Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
• The learning rate η is gradually decreased with iteration.

The following figure shows a simplified version of the algorithm:

The following python code implements the algorithm:

```
class SVMPegasos(LinearClassifier):
"""Implementation of SVM with SGD with PEGASOS Algorithm"""

def __init__(self, n_iter=10, lambda1=1):
self.n_iter = n_iter
self.lambda1 = lambda1

def fit(self, X, Y):
Y = list(Y)
self.find_classes(Y)
# convert all output values to +1 or -1
Yn = [sign(y, self.positive_class) for y in Y]
X = X.toarray()
m, n_features = X.shape[0], X.shape[1]
self.w = numpy.zeros( n_features )
for i in range(self.n_iter):
eta = 1. / (self.lambda1*(i+1))
j = numpy.random.choice(m, 1)[0]
x, y = X[j], Yn[j]
score = self.w.dot(x)
if y*score < 1:
self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
else:
self.w = (1 - eta*self.lambda1)*self.w

```

### Some Notes

• The optional projection step has been left out (the line in square brackets in the paper).
•  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
•  The number η is the step length in gradient descent.
• The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
• Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
• Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

### Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
• Classification accuracy obtained on the test split was 95%.
```

```

## 2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.

### Dataset 1

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
• Classification accuracy obtained on the test split was 95.8%.
```

```

### Dataset 2

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
• Classification accuracy obtained on the test split was 100%.
```

```

## 3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.

```

```

### Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
• Classification accuracy obtained on the test split was 97.5%.
```

```

## 4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.

The next figures show the distribution of the word lengths for the texts for different sentiments.

## Steps

• Create sklearn Pipeline to
• Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
• Use sklearn SelectKBest to choose top 5000 features
• Train / test the model
• Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
• Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
• Also Record the time taken to train the classifier.
• Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.

The following python code shows how the text-classification pipeline is implemented with SVMPegasos.

```
def classify_with_svm():
# split into training and test parts
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
random_state=0)
classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
tokenizer = lambda x: x)),
('fs', SelectKBest(k=5000)),
('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
t0 = time.time()
classifier.fit(Xtrain, Ytrain)
t1 = time.time()
print('Training time:', t1-t0, 'seconds.')
Yguess = classifier.predict(Xtest)
print('accuracy:', accuracy_score(Ytest, Yguess))

```

### Some Notes

• The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
• This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
• The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
• For Perceptron the entire training data was passed 10 times to the model while training.
• For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.

## Results

As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.

# Implementing a Soft-Margin Kernelized Support Vector Machine Binary Classifier with Quadratic Programming in R and Python

In this article, couple of implementations of the support vector machine binary classifier with quadratic programming libraries (in R and python respectively) and application on a few datasets are going to be discussed.  The following video lectures / tutorials / links have been very useful for the implementation:

The next figure taken from here describes the basics of Soft-Margin SVM (without kernels).

## SVM in a nutshell

• Given a (training) dataset consisting of positive and negative class instances.
• Objective is to find a maximum-margin classifier, in terms of a hyper-plane (the vectors w and b) that separates the positive and negative instances (in the training dataset).
• If the dataset is noisy (with some overlap in positive and negative samples), there will be some error in classifying them with the hyper-plane.
• In the latter case the objective will be to minimize the errors in classification along with maximizing the margin and the problem becomes a soft-margin SVM (as opposed to the hard margin SVM without slack variables).
• A slack variable per training point is introduced to include the classification errors (for the miss-classified points in the training dataset) in the objective, this can also be thought of adding regularization.
• The optimization problem is quadratic in nature, since it has quadratic objective with linear constraints.
• It is easier to solve the optimization problem in the dual rather than the primal space, since there are less number of variables.
• Hence the optimization problem is often solved in the dual space by converting the minimization to a maximization problem (keeping in mind the weak/strong duality theorem and the complementary slackness conditions), by first constructing the Lagrangian and then using the KKT conditions for a saddle point.
• If the dataset is not linearly separable, the kernel trick is used to conceptually map the datapoints to some higher-dimensions only by computing the (kernel) gram matrix /  dot-product of the datapoints (the matrix needs to positive semi-definite as per Mercer’s theorem).
• Some popular kernel functions are the linear, polynomial, Gaussian (RBF, corresponding to the infinite dimensional space) kernels.
• The dual optimization problem is solved (with standard quadratic programming packages) and the solution is found in terms of a few support vectors (defining the linear/non-liear decision boundary, SVs correspond to the non-zero values of the dual variable / the primal Lagrange multipler), that’s why the name SVM.
• Once the dual optimization problem is solved ,  the values of the primal variables are computed to construct the hyper-plane / decision surface.
• Finally the dual and primal variables (optimum values obtained from the solutions) are used in conjunction to predict the class of a new (unseen) datapoint.
• The hyper-parameters (e.g., C) can be tuned to fit different models and choose the most accurate one from a held-out (validation) dataset.

The following figure describes the soft-margin SVM in a more formal way.

The following figures show how the SVM dual quadratic programming problem can be formulated using the R quadprog QP solver (following the QP formulation in the R package quadprog).

The following figures show how the SVM dual quadratic programming problem can be formulated using the Python CVXOPT QP solver (following the QP formulation in the python library CVXOPT).

The following R code snippet shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

```library(quadprog)
library(Matrix)
linear.kernel <- function(x1, x2) {
return (x1%*%x2)
}
svm.fit <- function(X, y, FUN=linear.kernel, C=NULL) {
n.samples <- nrow(X)
n.features <- ncol(X)
# Gram matrix
K <- matrix(rep(0, n.samples*n.samples), nrow=n.samples)
for (i in 1:n.samples){
for (j in 1:n.samples){
K[i,j] <- FUN(X[i,], X[j,])
}
}
Dmat <- outer(y,y) * K
Dmat <- as.matrix(nearPD(Dmat)\$mat) # convert Dmat to nearest pd matrix
dvec <- rep(1, n.samples)
if (!is.null(C)) { # soft-margin
Amat <- rbind(y, diag(n.samples), -1*diag(n.samples))
bvec <- c(0, rep(0, n.samples), rep(-C, n.samples))
} else {           # hard-margin
Amat <- rbind(y, diag(n.samples))
bvec <- c(0, rep(0, n.samples))
}
res <- solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1)
a = res\$solution # Lagrange multipliers
# Support vectors have non-zero Lagrange multipliers
# ...
}
```

The following python code snippet adapted from here and from  Mathieu Blondel’s Blog, shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

```import numpy as np
import cvxopt
def fit(X, y, kernel, C):
n_samples, n_features = X.shape
# Compute the Gram matrix
K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
for j in range(n_samples):
K[i,j] = kernel(X[i], X[j])
# construct P, q, A, b, G, h matrices for CVXOPT
P = cvxopt.matrix(np.outer(y,y) * K)
q = cvxopt.matrix(np.ones(n_samples) * -1)
A = cvxopt.matrix(y, (1,n_samples))
b = cvxopt.matrix(0.0)
if C is None:      # hard-margin SVM
G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
h = cvxopt.matrix(np.zeros(n_samples))
else:              # soft-margin SVM
G = cvxopt.matrix(np.vstack((np.diag(np.ones(n_samples) * -1), np.identity(n_samples))))
h = cvxopt.matrix(np.hstack((np.zeros(n_samples), np.ones(n_samples) * C)))
# solve QP problem
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
# Lagrange multipliers
a = np.ravel(solution['x'])
# Support vectors have non zero lagrange multipliers
sv = a > 1e-5 # some small threshold
# ...
```

## Notes

1. Since the objective function for QP is convex if and only if the matrix P (in python CVXOPT) or Dmat (in R quadprog) is positive-semidefinite, it needs to be ensured that the corresponding matrix for SVM is psd too.
2. The corresponding matrix is computed from the Kernel gram matrix (which is psd or non-negative-definite by Mercer’s theorem) and the labels from the data. Due to numerical errors, often a few eigenvalues of the matrix tend to be very small negative values.
3. Although python CVXOPT will allow very small numerical errors in P matrix with a warning message, R quardprog will strictly require that the Dmat matrix is strictly positive definite, otherwise it will fail.
4. Hence, with R quadprog the D matrix first needs to be converted to a positive definite matrix using some algorithm (particularly in case when it contains very small negative eigenvalues, which is quite common, since D comes from the data).
5. A small threshold (e.g., 1e-5) is chosen to find the support vectors (corresponding to non-zero Lagrange multipliers, by complementary slackness condition).
6. Cases corresponding to the hard and soft-margin SVMs must be handled separately, otherwise it will lead to inconsistent system of solutions.

## Using the SVM implementations for classification on some datasets

### The datasets

For each dataset, the 80-20 Validation on the dataset is used to

• First fit (train) the model on randomly selected 80% samples of the dataset.
• Predict (test) on the held-out (remaining 20%) of the dataset and compute accuracy.
• Different values of the hyper-parameter C and different kernels are used.
• For the polynomial kernel, polynomial of degree 3 is used and the RBF kernel with the standard deviation of 5 is used, although these hyper-parameters can be tuned too.

### Results

As can be seen from the results below,

• The points with blue circles are the support vectors.
• When the C value is low (close to hard-margin SVM), the model learnt tends to overfit the training data.
• When the C value is high (close to soft-margin SVM), the model learnt tends to be more generalizable (C acts as a regularizer).
• There are more support vectors required to define the decision surface for the hard-margin SVM than the soft-margin SVM for datasets not linearly separable.
• The linear (and sometimes polynomial) kernel performs pretty badly on the datasets that are not linearly separable.
• The decision boundaries are also shown.

# Deep Learning & Art: Neural Style Transfer – An Implementation with Tensorflow (using Transfer Learning with a Pre-trained VGG-19 Network) in Python

This problem appeared as an assignment in the online coursera course Convolution Neural Networks by Prof Andrew Ng, (deeplearing.ai).  The description of the problem is taken straightway from the assignment.

Neural Style Transfer algorithm was created by Gatys et al. (2015) , the paper can be found here .

In this assignment, we shall:

• Implement the neural style transfer algorithm
• Generate novel artistic images using our algorithm

Most of the algorithms we’ve studied optimize a cost function to get a set of parameter values. In Neural Style Transfer, we  shall optimize a cost function to get pixel values!

## Problem Statement

Neural Style Transfer (NST) is one of the most fun techniques in deep learning. As seen below, it merges two images, namely,

1. a “content” image (C) and
2. a “style” image (S),

to create a “generated” image (G). The generated image G combines the “content” of the image C with the “style” of image S.

In this example, we are going to generate an image of the Louvre museum in Paris (content image C), mixed with a painting by Claude Monet, a leader of the impressionist movement (style image S).

Let’s see how we can do this.

## Transfer Learning

Neural Style Transfer (NST) uses a previously trained convolutional network, and builds on top of that. The idea of using a network trained on a different task and applying it to a new task is called transfer learning.

Following the original NST paper, we shall use the VGG network. Specifically, we’ll use VGG-19, a 19-layer version of the VGG network. This model has already been trained on the very large ImageNet database, and thus has learned to recognize a variety of low level features (at the earlier layers) and high level features (at the deeper layers). The following figure (taken from the google image search results) shows how a VGG-19 convolution neural net looks like, without the last fully-connected (FC) layers.

We run the following code to load parameters from the pre-trained VGG-19 model serialized in a matlab file. This takes a few seconds.

import pprint
pprint.pprint(model)

{‘avgpool1’: <tf.Tensor ‘AvgPool_5:0’ shape=(1, 150, 200, 64) dtype=float32>,
‘avgpool2’: <tf.Tensor ‘AvgPool_6:0’ shape=(1, 75, 100, 128) dtype=float32>,
‘avgpool3’: <tf.Tensor ‘AvgPool_7:0’ shape=(1, 38, 50, 256) dtype=float32>,
‘avgpool4’: <tf.Tensor ‘AvgPool_8:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘avgpool5’: <tf.Tensor ‘AvgPool_9:0’ shape=(1, 10, 13, 512) dtype=float32>,
‘conv1_1’: <tf.Tensor ‘Relu_16:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv1_2’: <tf.Tensor ‘Relu_17:0’ shape=(1, 300, 400, 64) dtype=float32>,
‘conv2_1’: <tf.Tensor ‘Relu_18:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv2_2’: <tf.Tensor ‘Relu_19:0’ shape=(1, 150, 200, 128) dtype=float32>,
‘conv3_1’: <tf.Tensor ‘Relu_20:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_2’: <tf.Tensor ‘Relu_21:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_3’: <tf.Tensor ‘Relu_22:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv3_4’: <tf.Tensor ‘Relu_23:0’ shape=(1, 75, 100, 256) dtype=float32>,
‘conv4_1’: <tf.Tensor ‘Relu_24:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_2’: <tf.Tensor ‘Relu_25:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_3’: <tf.Tensor ‘Relu_26:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv4_4’: <tf.Tensor ‘Relu_27:0’ shape=(1, 38, 50, 512) dtype=float32>,
‘conv5_1’: <tf.Tensor ‘Relu_28:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_2’: <tf.Tensor ‘Relu_29:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_3’: <tf.Tensor ‘Relu_30:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘conv5_4’: <tf.Tensor ‘Relu_31:0’ shape=(1, 19, 25, 512) dtype=float32>,
‘input’: <tensorflow.python.ops.variables.Variable object at 0x7f7a5bf8f7f0>}

The next figure shows the content image (C) – the Louvre museum’s pyramid surrounded by old Paris buildings, against a sunny sky with a few clouds.

For the above content image, the activation outputs from the convolution layers are visualized in the next few figures.

## How to ensure that the generated image G matches the content of the image C?

As we know, the earlier (shallower) layers of a ConvNet tend to detect lower-level features such as edges and simple textures, and the later (deeper) layers tend to detect higher-level features such as more complex textures as well as object classes.

We would like the “generated” image G to have similar content as the input image C. Suppose we have chosen some layer’s activations to represent the content of an image. In practice, we shall get the most visually pleasing results if we choose a layer in the middle of the network – neither too shallow nor too deep.

First we need to compute the “content cost” using TensorFlow.

• The content cost takes a hidden layer activation of the neural network, and measures how different a(C) and a(G) are.
• When we minimize the content cost later, this will help make sure G
has similar content as C.

def compute_content_cost(a_C, a_G):
“””
Computes the content cost

Arguments:
a_C — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image C
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing content of the image G

Returns:
J_content — scalar that we need to compute using equation 1 above.
“””

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape a_C and a_G
a_C_unrolled = tf.reshape(tf.transpose(a_C), (m, n_H * n_W, n_C))
a_G_unrolled = tf.reshape(tf.transpose(a_G), (m, n_H * n_W, n_C))

# compute the cost with tensorflow
J_content = tf.reduce_sum((a_C_unrolled – a_G_unrolled)**2 / (4.* n_H * n_W *  \
n_C))

return J_content

## Computing the style cost

For our running example, we will use the following style image (S). This painting was painted in the style of impressionism, by  Claude Monet .

def gram_matrix(A):
“””
Argument:
A — matrix of shape (n_C, n_H*n_W)

Returns:
GA — Gram matrix of A, of shape (n_C, n_C)
“””

GA = tf.matmul(A, tf.transpose(A))
return GA

def compute_layer_style_cost(a_S, a_G):
“””
Arguments:
a_S — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image S
a_G — tensor of dimension (1, n_H, n_W, n_C), hidden layer activations representing style of the image G

Returns:
J_style_layer — tensor representing a scalar value, style cost defined above by equation (2)
“””

# Retrieve dimensions from a_G
m, n_H, n_W, n_C = a_G.get_shape().as_list()

# Reshape the images to have them of shape (n_C, n_H*n_W)
a_S = tf.reshape(tf.transpose(a_S), (n_C, n_H * n_W))
a_G = tf.reshape(tf.transpose(a_G), (n_C, n_H * n_W))

# Computing gram_matrices for both images S and G (≈2 lines)
GS = gram_matrix(a_S)
GG = gram_matrix(a_G)

# Computing the loss
J_style_layer = tf.reduce_sum((GS – GG)**2 / (4.* (n_H * n_W * n_C)**2))

return J_style_layer

• The style of an image can be represented using the Gram matrix of a hidden layer’s activations. However, we get even better results combining this representation from multiple different layers. This is in contrast to the content representation, where usually using just a single hidden layer is sufficient.
• Minimizing the style cost will cause the image G to follow the style of the image S.

## Defining the total cost to optimize

Finally, let’s create and implement a cost function that minimizes both the style and the content cost. The formula is:

def total_cost(J_content, J_style, alpha = 10, beta = 40):
“””
Computes the total cost function

Arguments:
J_content — content cost coded above
J_style — style cost coded above
alpha — hyperparameter weighting the importance of the content cost
beta — hyperparameter weighting the importance of the style cost

Returns:
J — total cost as defined by the formula above.
“””

J = alpha * J_content + beta * J_style
return J

• The total cost is a linear combination of the content cost J_content(C,G) and the style cost J_style(S,G).
• α and β are hyperparameters that control the relative weighting between content and style.

## Solving the optimization problem

Finally, let’s put everything together to implement Neural Style Transfer!

Here’s what the program will have to do:

• Create an Interactive Session
• Randomly initialize the image to be generated
• Build the TensorFlow graph:
• Run the content image through the VGG19 model and compute the content cost.
• Run the style image through the VGG19 model and compute the style cost
Compute the total cost.
• Define the optimizer and the learning rate.
• Initialize the TensorFlow graph and run it for a large number of iterations, updating the generated image at every step.

Let’s first load, reshape, and normalize our “content” image (the Louvre museum picture) and “style” image (Claude Monet’s painting).

Now, we initialize the “generated” image as a noisy image created from the content_image. By initializing the pixels of the generated image to be mostly noise but still slightly correlated with the content image, this will help the content of the “generated” image more rapidly match the content of the “content” image. The following figure shows the noisy image:

Next, let’s load the pre-trained VGG-19 model.

To get the program to compute the content cost, we will now assign a_C and a_G to be the appropriate hidden layer activations. We will use layer conv4_2 to compute the content cost. We need to do the following:

• Assign the content image to be the input to the VGG model.
• Set a_C to be the tensor giving the hidden layer activation for layer “conv4_2”.
• Set a_G to be the tensor giving the hidden layer activation for the same layer.
• Compute the content cost using a_C and a_G.

Next, we need to compute the style cost and compute the total cost J by taking a linear combination of the two. Use alpha = 10 and beta = 40.

Then we are going to  set up the Adam optimizer in TensorFlow, using a learning rate of 2.0.

Finally, we need to initialize the variables of the tensorflow graph, assign the input image (initial generated image) as the input of the VGG19 model and runs the model to minimize the total cost J for a large number of iterations.

## Results

The following figures / animations show the generated images (G) with different content (C) and style images (S) at different iterations in the optimization process.

Content

Style (Claud Monet’s The Poppy Field near Argenteuil)

Generated

Content

Style

Generated

Content

Style

Generated

Content

Style (Van Gogh’s The Starry Night)

Generated

Content

Style

Generated

Content (Victoria Memorial Hall)

Style (Van Gogh’s The Starry Night)

Generated

Content (Taj Mahal)

Style (Van Gogh’s Starry Night Over the Rhone)

Generated

Content

Style (Claud Monet’s Sunset in Venice)

Generated

Content (Visva Bharati)

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )

Generated

Content (Howrah Bridge)

Style (Van Gogh’s The Starry Night)

Generated

Content (Leonardo Da Vinci’s Mona Lisa)

Style (Van Gogh’s The Starry Night)

Generated

Content (My sketch: Rabindranath Tagore)

Style (Abanindranath Tagore’s Rabindranath in the role of  blind singer )

Generated

Content (me)

Style (Van Gogh’s Irises)

Generated

Content

Style

Generated

Content

Style (Publo Picaso’s Factory at Horto de Ebro)

Generated

The following animations show how the generated image changes with the change in VGG-19 convolution layer used for computing content cost.

Content

Style (Van Gogh’s The Starry Night)

Generated

convolution layer 3_2 used

convolution layer 4_2 used

convolution layer 5_2 used

# Some Optimization: Implementing the Orthogonal Matching Pursuit (OMP) and the Basis Pursuit (BP) Algorithms with Octave / Matlab

The following problems appeared in a project in the edX course 236862.1x: Introduction to the Fundamentals of Sparse Representations by Prof. Michael Elad from The Technion – Israel Institute of Technology.  The description of the problems are taken straightaway from the project.

## Pursuit Algorithms

In this article we demonstrate the Orthogonal Matching Pursuit (OMP) and Basis Pursuit (BP) algorithms by running them on a set of test signals and checking whether they provide the desired outcome for the P0 problem.

## Some Theory

Our goal is to solve the following problem:

which is also known as the P0 problem (since the objective function to be minimized is the L0 norm of a vector x):  a problem that seeks the sparsest solution to a linear system.  The concept of seeking the sparsest representation for data are of central importance to data processing in a universal way (e.g., in machine learning, image processing, computer vision, remote sensing), although the problem is NP-hard in general.

In order to solve it in practice, some approximation algorithm is needed to be used, either using greedy or relaxation based approaches, as shown in the next figure.

In this article the implementation  of the above two algorithms and some experimental results on approximately solving the P0 problem with some synthetic signal dataset will be described. The following figure describes how the experiments are to be done:

## Data Construction

In this part we create a dictionary and synthetic signals that will serve for us in the experiments.  Here are the ingredients involved:

• The matrix A is the “dictionary” (of size 50-by-100) under which the ideal signal is known to be sparse. We need to first construct this dictionary by drawing at random a matrix with i.i.d. Gaussian entries. Normalize each atom to a unit L2 norm.

• Then we need to generate a sparse vector x0 of cardinality s. Let’s draw at random the  locations of the non-zero entries. Then, draw at random the value of each entry from a uniform distribution in the range [1, 3] and multiply each by a random sign.

The following figure shows the heatmap of the dictionary A randomly generated.

## Greedy Pursuit

The greedy algorithm OMP is described in the following figure:

## Basis Pursuit

Even though the P0 problem is relaxed to P1, it’s still not ready to be solved by an LP Solver since, the objective function is still not linear. As explained in this thesis, the P1 problem can be converted to a LP by introducing a bunch of new variables and using the trick shown in the following figure.

The matlab /octave function to implement the BP algorithm using LP is shown below:

## Analyzing the Results

In this part we compare the results obtained via the OMP and BP, by executing the following steps.

• Plot the average relative ℓ2 error, obtained by the OMP and BP versus the cardinality.
• Plot the average support recovery error, obtained by the OMP and BP versus the cardinality.
• Discuss the presented graphs and explain the results.

## Discussion

As can be seen from the above results,

• The Linear Programming-based relaxed implementation of Basis Pursuit has higher accuracy (in terms of both L2 and support-probability errors) than the greedy Orthogonal Matching Pursuit, particularly when the cardinality of the true solution increases beyond cardinality 6.

• Both OMP and BP (LP) performs equally well (with almost zero L2 error) upto cardinality 6 and then OMP clearly performs worse than the BP (LP).

• Although the average L2 error for OMP increases upto 0.2 when the cardinality of the true solution increases to 15, whereas that of BP only increases very slightly.

• Similar pattern can be seen for the probability of error in support.

## Mutual Coherence of A and theoretical guarantees for OMP and BP

where the mutual coherence of matrix A is defined as follows:

For our experiment, the mutual coherence μ(A)  = 0.45697. Hence, as per the theoretical guarantee, the OMP and BP are guaranteed to find the solution of P0 for s < 1.594164. Although in practice we observe that till s = 6 the solution found is quite accurate, since the theoretical bound shown above is for the worst-case only.