6. TD Learning
In this lab you will implement both the on-policy and off-policy algorithms using TD learning (Related topic).
You will need to reuse the code that you produced during the previous labs (MDP lab).
6.1. On-Policy
Implement the on-policy method.
Create a function
on_policy_TD(env, eps, alpha, T, E)
:env
: the gym environment.eps
: is the \(\epsilon\) parameter.alpha
: is the step size parmeter.T
: is the maximum number of steps.E
: is the number of episodes.
The function needs to return the policy and the history of the sum of rewards for each episodes.
6.2. Off-Policy
Implement the off-policy method.
Create a function
off_policy_TD(env, eps, alpha, T, E)
:env
: the gym environment.eps
: is the \(\epsilon\) parameter.alpha
: is the step size parmeter.T
: is the maximum number of steps.E
: is the number of episodes.
The function needs to return the policy and the history of the sum of rewards for each episodes.
6.3. Experiments
For different \(T\) and \(E\):
Run both algorithm.
Draw the evolution of the sum of rewards for both algorithms.
Conclude on both algorithms and compare with what we said in class.
Now compare with the policy calculated using Monte-Carlo methods.
What is the method that brings the best performance?
Why?
Compare the ratio between efficiency and running time.