# Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python)

Although a support vector machine model (binary classifier) is more commonly built by solving a quadratic programming problem in the dual space,  it can be built fast by solving the primal optimization problem also. In this article a Support Vector Machine implementation is going to be described by solving the primal optimization problem with sub-gradient solver using stochastic gradient decent. The algorithm is called the Pegasos algorithm, as described by Shai Shalev-Shwartz et al, in their original paper.

• First the vanilla version and then the kernelized version of the the Pegasos algorithm is going to be described along with some applications on some datasets.
• Next the hinge-loss function for the SVM is going to be replaced by the log-loss function for the Logistic Regression and the primal SVM problem is going to be converted to regularized logistic regression.
• Finally document sentiment classification will be done by first training a Perceptron, SVM (with Pegasos) and a Logistic Regression classifier on a corpus and then testing it on an unseen part of the corpus.
• The time to train the classifiers along with accuracy obtained on a held-out dataset will be computed.

Most of the above problems appeared as an assignment in this course.  The description of the problems are sometimes taken from the assignment itself.

## 1. SVM implementation by minimizing the primal objective with hinge-loss using SGD with PEGASOS

As explained in these lecture slides from MIT, this video from IITMthese slides from CMU and also shown in the next figure taken from the slides, the Soft-SVM Primal Lagrangian can be represented as follows:

or as the following if the explicit bias term is discarded:

where the 0-1 loss is approximated by the hinge-loss.

• Changing the regularization constant to λ, it can be equivalently expressed using the hinge-loss as follows, as shown in the next figure, taken from the Pegasos paper.
• The next figure also describes the Pegasos algorithm, which performs an SGD on the primal objective (Lagrangian) with carefully chosen steps.
• Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD.
• The learning rate η is gradually decreased with iteration.

The following figure shows a simplified version of the algorithm:

The following python code implements the algorithm:

```
class SVMPegasos(LinearClassifier):
"""Implementation of SVM with SGD with PEGASOS Algorithm"""

def __init__(self, n_iter=10, lambda1=1):
self.n_iter = n_iter
self.lambda1 = lambda1

def fit(self, X, Y):
Y = list(Y)
self.find_classes(Y)
# convert all output values to +1 or -1
Yn = [sign(y, self.positive_class) for y in Y]
X = X.toarray()
m, n_features = X.shape[0], X.shape[1]
self.w = numpy.zeros( n_features )
for i in range(self.n_iter):
eta = 1. / (self.lambda1*(i+1))
j = numpy.random.choice(m, 1)[0]
x, y = X[j], Yn[j]
score = self.w.dot(x)
if y*score < 1:
self.w = (1 - eta*self.lambda1)*self.w + eta*y*x
else:
self.w = (1 - eta*self.lambda1)*self.w

```

### Some Notes

• The optional projection step has been left out (the line in square brackets in the paper).
•  As usual, the outputs (in the list Y) are coded as +1 for positive examples and -1 for negative examples.
•  The number η is the step length in gradient descent.
• The gradient descent algorithm may have problems finding the minimum if the step length η is not set properly. To avoid this difficulty, Pegasos uses a variable step length: η = 1 / (λ · t).
• Since we compute the step length by dividing by t, it will gradually become smaller and smaller. The purpose of this is to avoid the “bounce around”  problem as it gets close to the optimum.
• Although the bias variable b in the objective function is discarded in this implementation, the paper proposes several ways to learn a bias term (non-regularized) too, the fastest implementation is probably with the binary search on a real interval after the PEGASOS algorithm returns an optimum w.

### Using the PEGASOS SVM implementation to classify the following linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for primal SVM classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS SVM implementation ran very fast taking only 24 milliseconds.
• Classification accuracy obtained on the test split was 95%.
```

```

## 2. Kernelized PEGASOS

The next figure, taken from the same paper shows how the algorithm can be adapted to kernel-SVM.

### Dataset 1

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 240 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used)  ran very fast, taking only 2.09 seconds.
• Classification accuracy obtained on the test split was 95.8%.
```

```

### Dataset 2

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the Kernelized PGEASOS implementation for primal SVM classifier, respectively, for couple of different linearly non-separable datasets.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 373 rows only) and the training phase with the Kernelized PEGASOS SVM implementation (Gaussian Kernel was used) training ran very fast taking only 5.1 seconds.
• Classification accuracy obtained on the test split was 100%.
```

```

## 3. From SVM to Logistic Regression

The following figures show how by changing the loss function (from hinge-loss to log-loss) in the PEGASOS algorithm, a logistic regression model can be trained.

```

```

### Using the PEGASOS Logistic Regression implementation to classify the same linearly separable dataset with some added noise (overlap)

• The following animation and the figure show the final decision surface and  how the decision surface (boundary) changes with single-point update-steps with SGD for the PGEASOS implementation for the Logistic Regression classifier, respectively.
• As usual, 80% random samples from the dataset were used for training and 20% for testing. A fixed seed was used for reproducible results.
• The dataset was small (with 200 rows only) and the trainign phase with the PEGASOS LR implementation ran very fast taking only 27 milliseconds.
• Classification accuracy obtained on the test split was 97.5%.
```

```

## 4. Sentiment Classification of texts with Linear Classifiers (SGD implementations): Perceptron, SVM and Logistic Regression with PEGASOS

Given the following dataset with 11914 texts, let’s fit a Perceptron model along with SVM / LR models with PEGASOS algorithm on 80% sample and predict to classify the sentiments of the 20% held-out text, followed by computing the accuracy.  The next figure shows first few rows of the corpus and number of positive and negative sentiment examples.

The next figures show the distribution of the word lengths for the texts for different sentiments.

## Steps

• Create sklearn Pipeline to
• Compute simple BOW features (CountVectorizer from sklearn): obtained 51063 columns with BOW.
• Use sklearn SelectKBest to choose top 5000 features
• Train / test the model
• Train the classifier (Perceptron or SVM/LR with PEGASOS) on 80% of the dataset.
• Predict sentiments on held-out 20% of the dataset and compute accuracy by comparing with the ground truths.
• Also Record the time taken to train the classifier.
• Find the top positive and negative words (corresponding to the highest and lowest valued coefficients, respectively) found by the classifier.

The following python code shows how the text-classification pipeline is implemented with SVMPegasos.

```
def classify_with_svm():
# split into training and test parts
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X, Y, train_size=0.8,
random_state=0)
classifier = Pipeline( [('vec', CountVectorizer(preprocessor = lambda x: x,
tokenizer = lambda x: x)),
('fs', SelectKBest(k=5000)),
('cls', SVMPegasos(n_iter=len(Xtrain)*10, lambda1=0.015))] )
t0 = time.time()
classifier.fit(Xtrain, Ytrain)
t1 = time.time()
print('Training time:', t1-t0, 'seconds.')
Yguess = classifier.predict(Xtest)
print('accuracy:', accuracy_score(Ytest, Yguess))

```

### Some Notes

• The accuracy of the classifier can be improved with applying more text-mining techniques such as pre-processing, including language model /  tf-idf features.
• This experiment was just to test/compare the SVM / LR PEGASOS with Perceptron.
• The accuracy of the PEGASOS models can also be improved by tuning the hyper-parameters (e.g., regularization parameter λ for PEGASOS LR/SVM, number of iterations to train etc.)
• For Perceptron the entire training data was passed 10 times to the model while training.
• For PEGASOS SGD implementations of regularized SVM and logistic regression, number of iterations used were 10*size_of_training_dataset. Also, the λ value used was 0.01.

## Results

As can be seen, even with very simple features, the performance of all the classifiers in terms of accuracy of prediction on the test set and the time taken to train the model are good and comparable to each other.

# Implementing a Soft-Margin Kernelized Support Vector Machine Binary Classifier with Quadratic Programming in R and Python

In this article, couple of implementations of the support vector machine binary classifier with quadratic programming libraries (in R and python respectively) and application on a few datasets are going to be discussed.  The following video lectures / tutorials / links have been very useful for the implementation:

The next figure taken from here describes the basics of Soft-Margin SVM (without kernels).

## SVM in a nutshell

• Given a (training) dataset consisting of positive and negative class instances.
• Objective is to find a maximum-margin classifier, in terms of a hyper-plane (the vectors w and b) that separates the positive and negative instances (in the training dataset).
• If the dataset is noisy (with some overlap in positive and negative samples), there will be some error in classifying them with the hyper-plane.
• In the latter case the objective will be to minimize the errors in classification along with maximizing the margin and the problem becomes a soft-margin SVM (as opposed to the hard margin SVM without slack variables).
• A slack variable per training point is introduced to include the classification errors (for the miss-classified points in the training dataset) in the objective, this can also be thought of adding regularization.
• The optimization problem is quadratic in nature, since it has quadratic objective with linear constraints.
• It is easier to solve the optimization problem in the dual rather than the primal space, since there are less number of variables.
• Hence the optimization problem is often solved in the dual space by converting the minimization to a maximization problem (keeping in mind the weak/strong duality theorem and the complementary slackness conditions), by first constructing the Lagrangian and then using the KKT conditions for a saddle point.
• If the dataset is not linearly separable, the kernel trick is used to conceptually map the datapoints to some higher-dimensions only by computing the (kernel) gram matrix /  dot-product of the datapoints (the matrix needs to positive semi-definite as per Mercer’s theorem).
• Some popular kernel functions are the linear, polynomial, Gaussian (RBF, corresponding to the infinite dimensional space) kernels.
• The dual optimization problem is solved (with standard quadratic programming packages) and the solution is found in terms of a few support vectors (defining the linear/non-liear decision boundary, SVs correspond to the non-zero values of the dual variable / the primal Lagrange multipler), that’s why the name SVM.
• Once the dual optimization problem is solved ,  the values of the primal variables are computed to construct the hyper-plane / decision surface.
• Finally the dual and primal variables (optimum values obtained from the solutions) are used in conjunction to predict the class of a new (unseen) datapoint.
• The hyper-parameters (e.g., C) can be tuned to fit different models and choose the most accurate one from a held-out (validation) dataset.

The following figure describes the soft-margin SVM in a more formal way.

The following figures show how the SVM dual quadratic programming problem can be formulated using the R quadprog QP solver (following the QP formulation in the R package quadprog).

The following figures show how the SVM dual quadratic programming problem can be formulated using the Python CVXOPT QP solver (following the QP formulation in the python library CVXOPT).

The following R code snippet shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

```library(quadprog)
library(Matrix)
linear.kernel <- function(x1, x2) {
return (x1%*%x2)
}
svm.fit <- function(X, y, FUN=linear.kernel, C=NULL) {
n.samples <- nrow(X)
n.features <- ncol(X)
# Gram matrix
K <- matrix(rep(0, n.samples*n.samples), nrow=n.samples)
for (i in 1:n.samples){
for (j in 1:n.samples){
K[i,j] <- FUN(X[i,], X[j,])
}
}
Dmat <- outer(y,y) * K
Dmat <- as.matrix(nearPD(Dmat)\$mat) # convert Dmat to nearest pd matrix
dvec <- rep(1, n.samples)
if (!is.null(C)) { # soft-margin
Amat <- rbind(y, diag(n.samples), -1*diag(n.samples))
bvec <- c(0, rep(0, n.samples), rep(-C, n.samples))
} else {           # hard-margin
Amat <- rbind(y, diag(n.samples))
bvec <- c(0, rep(0, n.samples))
}
res <- solve.QP(Dmat,dvec,t(Amat),bvec=bvec, meq=1)
a = res\$solution # Lagrange multipliers
# Support vectors have non-zero Lagrange multipliers
# ...
}
```

The following python code snippet adapted from here and from  Mathieu Blondel’s Blog, shows how a kernelized (soft/hard-margin) SVM model can be fitted by solving the dual quadratic optimization problem.

```import numpy as np
import cvxopt
def fit(X, y, kernel, C):
n_samples, n_features = X.shape
# Compute the Gram matrix
K = np.zeros((n_samples, n_samples))
for i in range(n_samples):
for j in range(n_samples):
K[i,j] = kernel(X[i], X[j])
# construct P, q, A, b, G, h matrices for CVXOPT
P = cvxopt.matrix(np.outer(y,y) * K)
q = cvxopt.matrix(np.ones(n_samples) * -1)
A = cvxopt.matrix(y, (1,n_samples))
b = cvxopt.matrix(0.0)
if C is None:      # hard-margin SVM
G = cvxopt.matrix(np.diag(np.ones(n_samples) * -1))
h = cvxopt.matrix(np.zeros(n_samples))
else:              # soft-margin SVM
G = cvxopt.matrix(np.vstack((np.diag(np.ones(n_samples) * -1), np.identity(n_samples))))
h = cvxopt.matrix(np.hstack((np.zeros(n_samples), np.ones(n_samples) * C)))
# solve QP problem
solution = cvxopt.solvers.qp(P, q, G, h, A, b)
# Lagrange multipliers
a = np.ravel(solution['x'])
# Support vectors have non zero lagrange multipliers
sv = a > 1e-5 # some small threshold
# ...
```

## Notes

1. Since the objective function for QP is convex if and only if the matrix P (in python CVXOPT) or Dmat (in R quadprog) is positive-semidefinite, it needs to be ensured that the corresponding matrix for SVM is psd too.
2. The corresponding matrix is computed from the Kernel gram matrix (which is psd or non-negative-definite by Mercer’s theorem) and the labels from the data. Due to numerical errors, often a few eigenvalues of the matrix tend to be very small negative values.
3. Although python CVXOPT will allow very small numerical errors in P matrix with a warning message, R quardprog will strictly require that the Dmat matrix is strictly positive definite, otherwise it will fail.
4. Hence, with R quadprog the D matrix first needs to be converted to a positive definite matrix using some algorithm (particularly in case when it contains very small negative eigenvalues, which is quite common, since D comes from the data).
5. A small threshold (e.g., 1e-5) is chosen to find the support vectors (corresponding to non-zero Lagrange multipliers, by complementary slackness condition).
6. Cases corresponding to the hard and soft-margin SVMs must be handled separately, otherwise it will lead to inconsistent system of solutions.

## Using the SVM implementations for classification on some datasets

### The datasets

For each dataset, the 80-20 Validation on the dataset is used to

• First fit (train) the model on randomly selected 80% samples of the dataset.
• Predict (test) on the held-out (remaining 20%) of the dataset and compute accuracy.
• Different values of the hyper-parameter C and different kernels are used.
• For the polynomial kernel, polynomial of degree 3 is used and the RBF kernel with the standard deviation of 5 is used, although these hyper-parameters can be tuned too.

### Results

As can be seen from the results below,

• The points with blue circles are the support vectors.
• When the C value is low (close to hard-margin SVM), the model learnt tends to overfit the training data.
• When the C value is high (close to soft-margin SVM), the model learnt tends to be more generalizable (C acts as a regularizer).
• There are more support vectors required to define the decision surface for the hard-margin SVM than the soft-margin SVM for datasets not linearly separable.
• The linear (and sometimes polynomial) kernel performs pretty badly on the datasets that are not linearly separable.
• The decision boundaries are also shown.

# Some Reinforcement Learning: The Greedy and Explore-Exploit Algorithms for the Multi-Armed Bandit Framework in Python

In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.

## The Problem Statement and Some Theory

Given a set of  actions with some unknown reward distributions,  maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately.  This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.

The following figure defines the problem mathematically  and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.

The following figure shows a motivating application of the multi-armed bandit problem in drug discovery.  Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.

The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.

• The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
• The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
• The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas
the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.

The following code from the github repository of the same course shows how the basic bandit Framework can be defined:

```
import numpy as np
import sys

### Interface
class Environment(object):

def reset(self):
raise NotImplementedError('Inheriting classes must override reset.')

def actions(self):
raise NotImplementedError('Inheriting classes must override actions.')

def step(self):
raise NotImplementedError('Inheriting classes must override step')

class ActionSpace(object):

def __init__(self, actions):
self.actions = actions
self.n = len(actions)

### BanditEnv Environment

class BanditEnv(Environment):

def __init__(self, num_actions = 10, distribution = "bernoulli", evaluation_seed="387"):
super(BanditEnv, self).__init__()

self.action_space = ActionSpace(range(num_actions))
self.distribution = distribution

np.random.seed(evaluation_seed)

self.reward_parameters = None
if distribution == "bernoulli":
self.reward_parameters = np.random.rand(num_actions)
elif distribution == "normal":
self.reward_parameters = (np.random.randn(num_actions),
np.random.rand(num_actions))
elif distribution == "heavy-tail":
self.reward_parameters = np.random.rand(num_actions)
else:
print("Please use a supported reward distribution") #, flush = True)
sys.exit(0)

if distribution != "normal":
self.optimal_arm = np.argmax(self.reward_parameters)
else:
self.optimal_arm = np.argmax(self.reward_parameters[0])

def reset(self):
self.is_reset = True
return None

def compute_gap(self, action):
if self.distribution != "normal":
gap = np.absolute(self.reward_parameters[self.optimal_arm] -
self.reward_parameters[action])
else:
gap = np.absolute(self.reward_parameters[0][self.optimal_arm] -
self.reward_parameters[0][action])
return gap

def step(self, action):
self.is_reset = False

valid_action = True
if (action is None or action = self.action_space.n):
print("Algorithm chose an invalid action; reset reward to -inf")#, flush = True)
reward = float("-inf")
gap = float("inf")
valid_action = False

if self.distribution == "bernoulli":
if valid_action:
reward = np.random.binomial(1, self.reward_parameters[action])
gap = self.reward_parameters[self.optimal_arm] -
self.reward_parameters[action]
elif self.distribution == "normal":
if valid_action:
reward = self.reward_parameters[0][action] + self.reward_parameters[1][action] * np.random.randn()
gap = self.reward_parameters[0][self.optimal_arm] - self.reward_parameters[0][action]
elif self.distribution == "heavy-tail":
if valid_action:
reward = self.reward_parameters[action] + np.random.standard_cauchy()
gap = self.reward_parameters[self.optimal_arm] - self.reward_parameters[action] #HACK to compute expected gap
else:
print("Please use a supported reward distribution")#, flush = True)
sys.exit(0)

return(None, reward, self.is_reset, '')

#Policy interface
class Policy:
#num_actions: (int) Number of arms [indexed by 0 ... num_actions-1]
def __init__(self, num_actions):
self.num_actions = num_actions

def act(self):
pass

def feedback(self, action, reward):
pass

```

Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).

In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.

Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.

Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.

• The next figure shows the reward distributions for a 5-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..4.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
2. Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
3. Both Greedy and Round-Robin has linear regrets (w.r.t. timesteps).
4. In this case the greedy algorithm even with sub-optimal action still performs relatively better than the round-robin.

## Round-Robin

• The next figure shows the reward distributions for a 10-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter ε) and the Optimistic-Greedy (with different values of the hyper-parameter R) algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Epsilon-Greedy algorithm behaves exactly like Greedy when ε = 0 and behaves randomly when ε = 1. For both of these cases, the ε-greedy algorithm has linear regret.
2.  ε = 0.1 and ε = 0.15 find the optimal arm 6 eventually and they have sub-linear regrets.
3. Optimistic-Greedy algorithm behaves exactly like Greedy when R = 0 and behaves randomly when R = 10000. For both of these cases, the ε-greedy algorithm has linear regret.
4. R = 3 and R = 5 find the optimal arm 6 eventually and they have sub-linear regrets(w.r.t. timesteps).

ε = 0

ε = 0.05

ε = 0.1

ε = 0.15

ε = 1.0

## Optimistic Greedy

R = 0

R = 1

R = 3

R = 5

R = 10000

The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.

• The next figure shows the reward distributions for a 10-armed bandit framework.
• If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
• Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i,  so we have R_i ~ B(p_i), i=0..9.
• We first generate the parameters for the distribution corresponding to each arm randomly.
• As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
• Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Again, let’s assume that we don’t know p_i values, we  implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

1. Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
2. Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
3. For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the non-informative flat prior.
4. Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
5. The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.