3. Multi-armed bandit - Action Selection methods

  • In the previous topic we saw an algorithm to update the action-value.

  • The algorithm updates the action-value by averaging the rewards.

  • This method consider that every reward has the same weights.

  • Another popular way in reinforcement learning is to give more weights to the more recent rewards.

3.1. Constant step-size parameter

  • If we take the previous formula and we replace it with a constant parameter \(\alpha\) we obtain:

    \[Q_{n+1} = Q_n + \alpha\left[ R_n - Q_n \right]\]

    where the step-size parameter \(\alpha \in (0,1]\) is constant.

  • \(Q_{n+1}\) is called a weighted average of post rewards and initial estimate \(Q_1\):

\[\begin{split}\begin{aligned} Q_{n+1} &= Q_{n} + \alpha \left[ R_n - Q_n \right]\\ &= (1-\alpha)^n Q_1 + \sum_{i=1}^n \alpha(1-\alpha)^{n-i}R_i\\ \end{aligned}\end{split}\]

Note

It is called a weighted average because the sum of the weights is:

\[(1-\alpha)^n+\sum_{i=1}^n\alpha(1-\alpha)^{n-i} = 1\]

3.2. Action-value initialization

  • We have seen that the algorithm is initialized with some default action-value estimate \(Q_1(a)\).

  • This method is statistically biased by the initial estimate.

  • For the sampled average, the bias disappears once all actions have been selected.

  • However, for the weighted average, the bias is permanent. Even if it decreases over time.

Example

  • We can compare the sampled average and the weighted average.

  • We will initialize the weighted average of action 4 to something neutral like 0.

../../_images/mab-example-weighted-avg.png

  • We can see that the sampled-average value and the weighted average value are not the same.

  • This bias cannot be overcome by choosing different initial value.

  • Setting all the actions to a positive value is optimistic.

  • After trying one action:

    • The algorithm will be disappointed by the returned reward (compared to the estimate).

    • It will select another action due to this disappointment.

  • Because all the actions are over optimistic, the algorithm will choose all of them multiple times.

3.3. Upper-Confidence-Bound action-selection

  • Let’s think about the action-selection seen until now.

  • Greedy action-selection:

    • Always choose the action that has the higher expected reward.

    • It doesn’t explore.

  • \(\epsilon\)-greedy action-selection:

    • It forces the selection of non-greedy action based on a probability \(\epsilon\).

    • No preferences between non-greedy action.

Activity

  • What is the downfall of \(\epsilon\)-greedy?

  • It is possible to improve the action-selection by adding a few parameters to the selection algorithm.

  • The idea is to base the exploration on the uncertainty of the estimate.

    • Having a high uncertainty means that we didn’t explore the action often compared to the numbers of action selected in total.

    • If we don’t have a good estimate, we should explore.

    • We call this Upper-Confidence bound.

  • The action-selection formula becomes:

    \[A_t = \arg \max_a\left[ Q_t(a)+c\sqrt{\frac{\ln t}{N_t(a)}} \right]\]

    where

    • \(\ln t\) is the natural logarithm of \(t\),

    • \(N_t(a)\) is the number of time \(a\) has been selected,

    • and \(c>0\) the degree of exploration.

Activity

  • What happens when \(N_t(a)\) increase?

  • What happens when \(\ln t\) increase?

  • When our uncertainty on the estimate decrease the action is less selected.

../../_images/ucb.drawio.png

  • The algorithm will select the action with the highest upper bound even if the uncertainty is very high.

  • If multiple actions have very close upper bounds, it will depend on the exploration coefficient.

Activity

  • If we use our previous samples.

../../_images/mab-example.png

  • Calculate each \(Q\) values using the weighted average.

  • Use UCB with \(c=2\) to decide what will be the next action.

3.4. Gradient Bandit Algorithms

  • All previous methods have one thing in common.

  • They all based the action-selection on received rewards.

  • We will discuss another technic based on the preference of an action \(a\), \(H_t(a)\).

  • Higher the preference is, more often the action is selected.

  • We use this preference in a softmax distribution:

    \[P(A_t=a)=\frac{e^{H_t(a)}}{\sum_{b=1}^{k}e^{H_t(b)}}=\pi_t(a)\]

    where \(\pi_t(a)\) is the probability to select \(a\) at time \(t\).

  • Now, we need to calculate the preference \(H_t(a)\).

    • At the beginning \(H_1(a) =0, \forall a \in A\).

    • We use stochastic gradient ascent.

      • After each action \(A_t\) the preference is updated:

      \[\begin{split}\begin{aligned} H_{t+1}(A_t) &= H_t(A_t) + \alpha(R_t - \bar{R}_t)(1-\pi_t(A_t))\text{, and}\\ H_{t+1}(a) &= H_t(a) - \alpha(R_t - \bar{R}_t)\pi_t(a)\text{, for all } a \neq A_t \end{aligned}\end{split}\]

      where \(\alpha >0\) is the step-size parameter, \(\bar{R}_t \in \mathbb{R}\) is the average reward not including time \(t\).

      • We often refer to \(\bar{R}_t\) as the baseline.

Activity

  • What happens if the reward is greater than the baseline?

  • What happens if the reward is less than the baseline?