5. Policies and Value Function

In the previous topic, we introduced formally Markov Decision Processes.
The objectives of MDP is to maximize the cumulative reward.
We will see how it can be done and how it is linked to the action selection.

5.1. Cumulative rewards

Remember that after few times steps we obtain a trajectory:

Similarly we also obtain a sequence of rewards associated to this trajectory:

As the objective is to maximize the cumulative rewards, we need to consider which part of this sequence we want to maximize.
Remember that for a same action in a same state the next future state is not certain, so it complicates things.

This is why, in reinforcement learning we try to maximize the expected return.
The return is denoted \(G_t\) and can be defined as the sum of the rewards:

\[G_t = r_{t}, r_{t+1}, \dots, r_T\]

where \(T\) is the final step.
As you can see it is very well defined when \(T\) has a maximum value.

Example

You receive three times a reward of \(10\).
\(G_t\) is the sum of thiss rewards, so \(30\).

../../_images/expected_return.drawio.png

Activity

What happens if \(T=\infty\), meaning if the problem does not end?

Due to this issue, it was introduced the notion of discounted return:

\[G_t = r_{t} + \gamma r_{t+1} + \gamma^2 r_{t+2} = \sum_{k=0}^\infty \gamma^k r_{t+k}\]

where \(\gamma\) is a parameter \(0\leq\gamma\leq 1\) called discount rate.
This discount rate represents how much future rewards are important compared to immediate rewards.
- A reward received at step \(k\) is worth only \(\gamma^k\) times what it would be worth now.
- If \(\gamma = 0\) the agent only consider immediate reward, thus is greedy.
- When \(\gamma\) approaches \(1\), the agent becomes farsighted.

Example

You receive three times a reward of \(10\).
But you apply a discount rate \(\gamma = 0.5\).

../../_images/discounted_return.drawio.png

Important

If \(\gamma < 1\) has a finite value if the sequence is bounded.

This is crucial part of reinforcement learning.
It works well, because the returns at each time step are related to each other:

\[\begin{split}\begin{aligned} G_t &= r_{t} + \gamma r_{t+1} + \gamma^2 r_{t+2} + \dots\\ &= r_{t} + \gamma \left(r_{t+1} + \gamma r_{t+2} + \dots \right)\\ &= r_{t} + \gamma G_{t+1}\\ \end{aligned}\end{split}\]

Note

Even if \(T=\infty\) the return is still finite
- if the reward is nonzero and constant and,
- if \(\gamma < 1\)
For example, is the reward is a constant \(+1\), then the return is:

\[G_t = \sum_{k=0}^\infty \gamma^k = \frac{1}{1-\gamma}\]

5.2. Policies and Value functions

We know how to calculate the expected return of a sequence of rewards.
Now, we need to link that to our trajectory and more precisely on the decision-making.
- The agent is supposed to learn which action to take depending on the state.
To do that reinforcement learning algorithms implement a value function.
- Estimates the expected return of being in a specific state.
- Value functions are closely linked to policies.

Definition: Policy

A policy is a mapping from states to probabilities of each possible action.

We denote a policy \(\pi\).
The probability that the agent following a policy \(\pi\) select an action \(a\) in state \(s\) is denoted \(\pi(a|s)\).

Definition: Value function

The value function of a state \(s\) under a policy \(\pi\) is the expected return when starting in \(s\) and following \(\pi\) thereafter.

In MDPs, \(v_\pi\) is defined as:

\[v_\pi(s) = \mathbb{E}_\pi\left[G_t | s_t=s \right] = \mathbb{E}_\pi\left[ \sum_{k=0}^\infty\gamma^k r_{t+k+1} | s_t = s \right]\]
Note, that we used only the definition of \(G_t\).
Now we can estimate the value of a state.

../../_images/value_functions.drawio.png

However, we need to estimate the value of a specific action in a state.
This is defined by q-values:

\[q_\pi(s,a) = \mathbb{E}_\pi\left[G_t | s_t=s , a_t = a\right] = \mathbb{E}_\pi\left[ \sum_{k=0}^\infty\gamma^k r_{t+k+1} | s_t = s, a_t = a \right]\]

Lastly, value functions are also recursive:

\[\begin{split}\begin{aligned} v_\pi(s) &= \mathbb{E}_\pi\left[G_t | s_t=s \right]\\ &= \mathbb{E}_\pi\left[r_{t} + \gamma G_{t}| s_t = s\right]\\ &= \sum_a \pi(a|s)\sum_{s'}p(s'|s,a)\left[ r(s,a) + \gamma\mathbb{E}_\pi\left[G_{t+1} | s_{t+1}=s' \right] \right]\\ &= \sum_a \pi(a|s)\sum_{s'}p(s'|s,a)\left[r + \gamma v_\pi(s')\right]\\ \end{aligned}\end{split}\]
This is the Bellman equation applied to \(v_t\).
Remember your course on dynamic programming.

5.3. Optimal Policies and Optimal Value Functions

A policy \(\pi\) is just a mapping function.
Calculating a policy doesn’t guarantee that it’s optimal to solve the problem.

Activity

Propose a method to calculate the optimal policy.

Intuitively, we would like to compare two policies \(\pi, \pi'\) and find the best.
Concretely we want to create an ordering \(\pi > \pi'\).
Remember that a policy \(\pi\) has an associated value function \(v_\pi\).
We can compare policies by checking the return for each state \(s \in S\).
The policy is considered better if \(\forall s\in S, \pi(s)\geq\pi'(s)\).

Important

There is always at least one policy that is better than or equal to all other policies.

The optimal policy is denoted \(\pi^*\).
The optimal policy has an associated optimal value function.
The optimal value function is denoted \(v^*\) and defined as:

\[v^*(s) = \max_\pi v_\pi(s), \forall s\in S\]

Activity

Discuss if for the same MDP we can find more than one optimal policy.

Because there is an optimal value function, there is an optimal q-value function:

\[q^*(s,a) = \max_\pi q_\pi(s,a)\]

We can rewrite the optimal value function as a Bellman equation:

\[\begin{split}\begin{aligned} v^*(s) &= \max_{a\in A} q_{\pi^*}(s,a)\\ &= \max_a \mathbb{E}_{\pi^*}\left[ G_t | s_t = s, a_t = a \right]\\ &= \max_a \mathbb{E}_{\pi^*}\left[ r_{t+1} + \gamma G_{t+1} | s_t = s, a_t = a \right]\\ &= \max_a \mathbb{E}_{\pi^*}\left[ r_{t+1} + \gamma v^*(s_{t+1}) | s_t = s, a_t = a \right]\\ &= \max_a \sum_{s'}p(s'|s,a)\left[ r(s,a) + \gamma v^*(s') \right]\\ \end{aligned}\end{split}\]

Activity

What can you conclude from this equation?

To calculate the optimal policy, we “just” need to calculate the optimal value function.