6. TD Learning

  • In this lab you will implement both the on-policy and off-policy algorithms using TD learning (Related topic).

  • You will need to reuse the code that you produced during the previous labs (MDP lab).

6.1. On-Policy

  • Implement the on-policy method.

  • Create a function on_policy_TD(env, eps, alpha, T, E):

    • env: the gym environment.

    • eps: is the \(\epsilon\) parameter.

    • alpha: is the step size parmeter.

    • T: is the maximum number of steps.

    • E: is the number of episodes.

  • The function needs to return the policy and the history of the sum of rewards for each episodes.

6.2. Off-Policy

  • Implement the off-policy method.

  • Create a function off_policy_TD(env, eps, alpha, T, E):

    • env: the gym environment.

    • eps: is the \(\epsilon\) parameter.

    • alpha: is the step size parmeter.

    • T: is the maximum number of steps.

    • E: is the number of episodes.

  • The function needs to return the policy and the history of the sum of rewards for each episodes.

6.3. Experiments

  • For different \(T\) and \(E\):

    • Run both algorithm.

    • Draw the evolution of the sum of rewards for both algorithms.

  • Conclude on both algorithms and compare with what we said in class.

  • Now compare with the policy calculated using Monte-Carlo methods.

    • What is the method that brings the best performance?

    • Why?

  • Compare the ratio between efficiency and running time.