Multi-armed bandit

Multi-armed bandit#

From Full RL to Bandit Problems#

Remember the difference between reinforcement learning and unsupervised learning.
Reinforcement learning uses training information to evaluate the actions.
To really understand the RL we need to understand its evaluative aspect.

Why start with bandits?

Multi-armed bandits are a simplified version of the full reinforcement learning problem
They help us focus on one key aspect: the exploration-exploitation trade-off
Understanding bandits provides the foundation for more complex RL scenarios

How bandits relate to full RL:

Bandit: Single state, multiple actions, immediate rewards
Full RL: Multiple states, multiple actions, delayed rewards
Key insight: Every state in full RL can be viewed as a separate bandit problem

Activity

Think of a video game character at a crossroads. How is choosing which path to take similar to a bandit problem? How is it different from the full game?

We will consider the \(k\)-armed bandit problem.

What is the \(k\)-armed bandit problem?#

../../_images/multi-armed-bandit-slots.png

An agent has \(k\) different actions.
After taking an action, the agent receive a reward.
The rewards are chosen from a stationary probability distribution based on the action selected.
The objective of the agent is to maximize the expected total reward.

Activity

Explain why the probability distributions are stationary.
Why do we say expected total reward?

More formally, each action has a value.
We denote the action selected on a time step \(t\) as \(A_t\).
We denote the corresponding reward as \(R_t\).
We denote \(q_*(a)\) the expected reward of action \(a\):

\[ q_*(a) = \mathbb{E}\left[ R_t|A_t=a \right]\]

However, the estimated value of \(a\) at time \(t\) is denoted \(Q_t(a)\).

../../_images/multi-armed-bandit-slots-est-value.drawio.png

Activity

Why do we have an estimated value?

When you have \(Q_t(a)\) for all the actions:
- Choosing \(a\) maximizing the reward is called greedy action selection.
  - It’s called exploitation.
  - It does not lead to the best expected value.
- Choosing another action is called exploration.
  - It is necessary to update the estimations.
We will see how we can balance exploration and exploitation.

Action-value methods#

Now, we need to estimate the actions.
Then using these estimates to select the next action.
The methods to do that are called action-value methods.

Estimating values#

Activity

Consider the following multi-armed bandit problem.

../../_images/multi-armed-bandit-simplified.png

Let us simulate a few actions and results:

Based on these results:
- Decide which button (action) you would push.
- Justify why.

One way to estimate the expected total reward is averaging the rewards actually received:

\[ Q_t(a) = \frac{\sum^{t-1}_{i=1}R_i\times \mathbb{1}_{A_t=a}}{\sum^{t-1}_{i=1}\mathbb{1}_{A_t=a}}\]

Example 1

If we take back the previous activity and calculate the \(Q\)-values we obtain the following.

Selecting the action#

Once the values are estimated, we need to have a strategy to select the action.
Keeping in mind we want to explore and exploit.
A good way is to mix greedy action and random actions.
- We will select the greedy action \(1-\epsilon\) of the time.
- Then a random action \(\epsilon\) of th time.
It is the \(\epsilon\)-greedy method.

Important

It guarantees that it will converge, but does not give any guarantee on the performance.

Activity

Consider the following two actions with their values:
- \(a_1\), \(Q(a_1)=10\)
- \(a_2\), \(Q(a_2)=1\)
If you apply the \(\epsilon\)-greedy action selection with \(\epsilon=0.5\), what is the probability of choosing \(a_1\)?

Incremental implementation#

We need to execute each action a very large number of times.
So, it is important to investigate how the reward can be computed efficiently.
Consider a problem with a single action:
- Let \(R_i\) the reward received after the \(i\text{th}\) action.
- Let \(Q_n\) denote the estimate of this action after being selected \(n-1\) times.
\(Q_n\) can be written:

\[ Q_n = \frac{R_1 + \dots + R_{n-1}}{n-1}\]

Activity

How would you implement that?
What happens in terms of memory consumption and computation time?

The most efficient way to implement that is to use an increment formula for updating the average.
Given \(Q_n, R_n\):

\[\begin{split} \begin{aligned} Q_{n+1} &= \frac{1}{n}\sum^n_{i=1}R_i\\ &= Q_n + \frac{1}{n}\left[ R_n - Q_n \right] \end{aligned}\end{split}\]

With this we can calculate in constant time and memory.
This update function is a key of reinforcement learning.
We find this form a lot:

\[NewEstimate \leftarrow OldEstimate + StepSize\left[Target-OldEstimate\right]\]

Activity

What represents \([Target - OldEstimate]\)?

Note

\(StepSize\) is not the step number but the size of the step, in our case it’s \(\frac{1}{n}\).
In reinforcement learning, it’s often \(\alpha\).

Algorithm 1 (\(k\)-armed Algorithm)

\( \begin{array}{l} \textbf{Input}\text{: A multi-armed bandit}\\ \textbf{Initialize}: \\ \textbf{For}\ a = 1\ \text{to}\ k:\\ \quad\quad Q(a) \leftarrow 0\\ \quad\quad N(a) \leftarrow 0\\ \textbf{Loop forever}:\\ \quad\quad A \leftarrow \begin{cases} \arg\max_a Q(a) \quad \text{with probability } 1-\epsilon \\ \text{random action} \quad \text{with probability } \epsilon \end{cases}\\ \quad\quad R \leftarrow N(a) + 1\\ \quad\quad Q(a) \leftarrow Q(a) + \frac{1}{N(a)} \left[ R - Q(a) \right]\\ \end{array} \)

Multi-armed bandit

Contents

Multi-armed bandit#

From Full RL to Bandit Problems#

What is the \(k\)-armed bandit problem?#

Action-value methods#

Estimating values#

Selecting the action#

Incremental implementation#