6. TD Learning

In this lab you will implement both the on-policy and off-policy algorithms using TD learning (Related topic).
You will need to reuse the code that you produced during the previous labs (MDP lab).

6.1. On-Policy

Implement the on-policy method.
Create a function on_policy_TD(env, eps, alpha, T, E):
- env: the gym environment.
- eps: is the \(\epsilon\) parameter.
- alpha: is the step size parmeter.
- T: is the maximum number of steps.
- E: is the number of episodes.
The function needs to return the policy and the history of the sum of rewards for each episodes.

6.2. Off-Policy

Implement the off-policy method.
Create a function off_policy_TD(env, eps, alpha, T, E):
- env: the gym environment.
- eps: is the \(\epsilon\) parameter.
- alpha: is the step size parmeter.
- T: is the maximum number of steps.
- E: is the number of episodes.
The function needs to return the policy and the history of the sum of rewards for each episodes.

6.3. Experiments

For different \(T\) and \(E\):
- Run both algorithm.
- Draw the evolution of the sum of rewards for both algorithms.
Conclude on both algorithms and compare with what we said in class.
Now compare with the policy calculated using Monte-Carlo methods.
- What is the method that brings the best performance?
- Why?
Compare the ratio between efficiency and running time.