In this article the multi-armed bandit framework problem and a few algorithms to solve the problem is going to be discussed. This problem appeared as a lab assignment in the edX course DAT257x: Reinforcement Learning Explained by Microsoft. The problem description is taken from the assignment itself.

## The Problem Statement and Some Theory

Given a set of actions with some unknown reward distributions, maximize the cumulative reward by taking the actions sequentially, one action at each time step and obtaining a reward immediately. This is the traditional explore-exploit problem in reinforcement learning. In order to find the optimal action, one needs to explore all the actions but not too much. At the same time, one needs to exploit the best action found so-far by exploring.

The following figure defines the problem mathematically and shows the exploration-exploitation dilemma in a general setting of the reinforcement learning problems with sequential decision with incomplete information.

The following figure shows a motivating application of the multi-armed bandit problem in drug discovery. Given a set of experimental drugs (each of which can be considered an arm in the bandit framework) to be applied on a set of patients sequentially, with a reward 1 if a patient survives after the application of the drug and 0 if he dies, the goal is to save as many patients as we can.

The following figures show the naive and a few variants of the greedy algorithms for maximizing the cumulative rewards.

- The Naive Round-Robin algorithm basically chooses every action once to complete a round and repeats the rounds. Obviously it’s far from optimal since it explores too much and exploits little.
- The greedy algorithm tries to choose the arm that has maximum average reward, with the drawback that it may lock-on to a sub-optimal action forever.
- The epsilon greedy and optimistic greedy algorithms are variants of the greedy algorithm that try to recover from the drawback of the greedy algorithm. Epsilon-greedy chooses an action uniformly at random with probability epsilon, whereas

the optimistic greedy algorithm initialized the estimated reward for each action to a high value, in order to prevent locking to a sub-optimal action.

The following code from the github repository of the same course shows how the basic bandit Framework can be defined:

import numpy as np import sys ### Interface class Environment(object): def reset(self): raise NotImplementedError('Inheriting classes must override reset.') def actions(self): raise NotImplementedError('Inheriting classes must override actions.') def step(self): raise NotImplementedError('Inheriting classes must override step') class ActionSpace(object): def __init__(self, actions): self.actions = actions self.n = len(actions) ### BanditEnv Environment class BanditEnv(Environment): def __init__(self, num_actions = 10, distribution = "bernoulli", evaluation_seed="387"): super(BanditEnv, self).__init__() self.action_space = ActionSpace(range(num_actions)) self.distribution = distribution np.random.seed(evaluation_seed) self.reward_parameters = None if distribution == "bernoulli": self.reward_parameters = np.random.rand(num_actions) elif distribution == "normal": self.reward_parameters = (np.random.randn(num_actions), np.random.rand(num_actions)) elif distribution == "heavy-tail": self.reward_parameters = np.random.rand(num_actions) else: print("Please use a supported reward distribution") #, flush = True) sys.exit(0) if distribution != "normal": self.optimal_arm = np.argmax(self.reward_parameters) else: self.optimal_arm = np.argmax(self.reward_parameters[0]) def reset(self): self.is_reset = True return None def compute_gap(self, action): if self.distribution != "normal": gap = np.absolute(self.reward_parameters[self.optimal_arm] - self.reward_parameters[action]) else: gap = np.absolute(self.reward_parameters[0][self.optimal_arm] - self.reward_parameters[0][action]) return gap def step(self, action): self.is_reset = False valid_action = True if (action is None or action = self.action_space.n): print("Algorithm chose an invalid action; reset reward to -inf")#, flush = True) reward = float("-inf") gap = float("inf") valid_action = False if self.distribution == "bernoulli": if valid_action: reward = np.random.binomial(1, self.reward_parameters[action]) gap = self.reward_parameters[self.optimal_arm] - self.reward_parameters[action] elif self.distribution == "normal": if valid_action: reward = self.reward_parameters[0][action] + self.reward_parameters[1][action] * np.random.randn() gap = self.reward_parameters[0][self.optimal_arm] - self.reward_parameters[0][action] elif self.distribution == "heavy-tail": if valid_action: reward = self.reward_parameters[action] + np.random.standard_cauchy() gap = self.reward_parameters[self.optimal_arm] - self.reward_parameters[action] #HACK to compute expected gap else: print("Please use a supported reward distribution")#, flush = True) sys.exit(0) return(None, reward, self.is_reset, '') #Policy interface class Policy: #num_actions: (int) Number of arms [indexed by 0 ... num_actions-1] def __init__(self, num_actions): self.num_actions = num_actions def act(self): pass def feedback(self, action, reward): pass

Now in order to implement an algorithm we need to just extend (inherit from) the Policy base class (interface) and implement the functions act() and feedback() for that algorithm (policy).

In order to theoretically analyze the greedy algorithms and find algorithms that have better performance guarantees, let’s define regret as the gap in between the total expected reward with the action chosen by the optimal policy and the cumulative reward with a set of actions chosen by any algorithm (assuming that the reward distributions are known), as shown in the following figure. Hence, maximizing cumulative reward is equivalent to minimizing the regret.

Given that greedy exploits too much an epsilon greedy explores too much, it can be shown that all the greedy variants have regrets linear in the number of timesteps T.

Also, it was theoretically proven by Lai and Robbins,that the lower bound on the regret is logarithmic in the number of timesteps T.

- The next figure shows the reward distributions for a 5-armed bandit framework.
- If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
- Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i, so we have R_i ~ B(p_i), i=0..4.
- We first generate the parameters for the distribution corresponding to each arm randomly.
- As we can see, the arm 2 has the highest p_i, so if we choose arm 2, we have the highest probability to get a reward of 1 at any timestep.
- Obviously, the arm 2 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Now, let’s assume that we don’t know p_i values and we use the Greedy and the Naive Round-Robin algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

- Greedy algorithm locks into the arm/action 0 and can’t find the optimal action.
- Round robin algorithm chooses the actions uniformly and can’t find the optimal action.
- Both Greedy and Round-Robin has linear regrets (w.r.t. timesteps).
- In this case the greedy algorithm even with sub-optimal action still performs relatively better than the round-robin.

## Greedy

## Round-Robin

- The next figure shows the reward distributions for a 10-armed bandit framework.
- If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
- Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i, so we have R_i ~ B(p_i), i=0..9.
- We first generate the parameters for the distribution corresponding to each arm randomly.
- As we can see, the arm 6 has the highest p_i, so if we choose arm 6, we have the highest probability to get a reward of 1 at any timestep.
- Obviously, the arm 6 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Again, let’s assume that we don’t know p_i values and we use the Epsilon-Greedy (with different values of the hyper-parameter * ε*) and the Optimistic-Greedy (with different values of the hyper-parameter

*) algorithms to maximize the cumulative rewards over 10000 timesteps.*

**R**As can be seen from the next animations and figures

- Epsilon-Greedy algorithm behaves exactly like Greedy when
=**ε****0**and behaves randomly when=**ε****1**. For both of these cases, thegreedy algorithm has linear regret.**ε-** -
**ε****= 0.1**and*ε*= 0.15 - Optimistic-Greedy algorithm behaves exactly like Greedy when
=**R****0**and behaves randomly when=**R****10000**. For both of these cases, thegreedy algorithm has linear regret.**ε-** **R****= 3**and*R*= 5

## Epsilon-Greedy

* ε* = 0

* ε* = 0.05

* ε* = 0.1

* ε* = 0.15

* ε* = 1.0

## Optimistic Greedy

**R = 0**

**R = 1**

**R = 3**

**R = 5**

**R = 10000**

The next figure shows two algorithms (UCB and Bayesian Thompson-Beta Posterior Sampling) that achieve logarithmic regret.

- The next figure shows the reward distributions for a 10-armed bandit framework.
- If we take an action A_i, with probability p_i we shall get a reward of 1 and with probability 1-p_i we shall get a reward of 0.
- Hence, each arm A_i has reward that is distributed as a Bernoulli random variable with some parameter p_i, so we have R_i ~ B(p_i), i=0..9.
- We first generate the parameters for the distribution corresponding to each arm randomly.
- As we can see, the arm 4 has the highest p_i, so if we choose arm 4, we have the highest probability to get a reward of 1 at any timestep.
- Obviously, the arm 4 is the arm/action that gives the most promising reward and any optimal policy should choose that arm at all timesteps.

Again, let’s assume that we don’t know p_i values, we implement and use the UCB1 and Thompson-Beta algorithms to maximize the cumulative rewards over 10000 timesteps.

As can be seen from the next animations and figures

- Both the algorithms find the optimal arm 4 pretty quickly without much of exploration.
- Both the algorithms achieve logarithmic regret when the (unknown) reward distribution is Bernoulli.
- For posterior sampling with Thompson Beta, each arm’s reward is sampled from the posterior β distribution from the same exponential family with the unknown Bernoulli distributed rewards, starting with the non-informative flat prior.
- Posterior R_i ~ β(a_i, b_i) where a_i is the number of times we obtained a reward 0 and b_i is the number of times we obtained a reward 1, when the arm A_i was drawn, as shown in the figure.
- The Thompson sampling gives linear regret when the (unknown) reward distribution is normal.