8. Temporal-Difference Learning

We covered the fundamentals to understand reinforcement learning.
One central methods for reinforcement learning is called Temporal-Difference (TD) learning.
TD learning is composed of:
- Dynamic programming
- Monte Carlo methods.

8.1. TD Prediction

Activity

TD works differently.
- TD forms a target at time \(t+1\).
- We use the reward \(r_{t+1}\) and the current estimate \(V(s_{t+1})\) to update \(V(s_{t})\).
The update is done as:

\[V(s_t) \leftarrow V(s_t) + \alpha\left[ r_{t+1} + \gamma V(s_{t+1}) - V(s_t) \right]\]

Algorithm TD(0)

Activity

The TD target is an estimate for two reasons:
1. The expected value \(V(s')\) is sampled.
2. It uses the current estimate \(V(s)\).
Moreover \(\left[ r + \gamma V(s') - V(s) \right]\) can be seen as an error.
- It measures the difference between the estimated value \(V(s_t)\) and the better estimate \(r + \gamma V(s')\).

Important

It is called the TD error:

\[\delta_t = r_{t+1}+\gamma V(s_{t+1}) - V(s_t)\]

The TD error depends on the next state and next reward, so it’s not available until \(t+1\).

Example

Some days as you drive to your friend’s house, you try to predict how long it will take to get there.
When you leave town, you note the time, the day of week, the weather, and anything else that might be relevant.
The following table summarizes what happened:

What are the advantage of TD learning?

TD methods do not require a model of the environment, of the reward and transition function.
An advantage over Monte Carlo methods is that they are online and fully incremental.

Now that we can estimate a policy, we can calculate the policy.
TD methods we also need to trade off exploration or prediction.
For on-policy methods we must estimate \(q_\pi(s,a)\) for all states \(s\) and actions \(a\).

Activity

\[q(s,a) = q(s,a) + \alpha\left[ r_{t+1} + \gamma q(s_{t+1}, a_{t+1}) - q(s_t, a_t) \right]\]

This update utilize every element of the transition \((s_,a_t,r_{t+1},s_{t+1},a_{t+1})\).
This algorithm is called Sarsa and can be written as:

SARSA

Note

The action selection is important to have a good ratio between exploration and prediction.
The strategies seen in previous topics can be used freely.

\[q(s_t, a_t) = q(s_t, a_t) + \alpha\left[ r_{t+1} + \gamma\max_{a'} q(s_{t+1}, a')- q(s_t, a_t) \right]\]

Q-Learning Algorithm

Activity