4. Dynamic Programming

  • In this lab you will calculate the optimal policy for the recycling robot problem.

  • You will need to reuse the code that you produced during the previous lab (MDP lab).

4.1. Policy Iteration

  • Implement the policy iteration algorithm def policy_iteration(env, theta).

    • env: the gym environment.

    • theta: \(\theta\) for the convergence of the value evaluation.

  • The algorithm needs to return the policy and the value function.

Important

The algorithm needs to be compatible for any Gym environment.

4.2. Value Iteration

  • Implement the value iteration algorithm def value_iteration(env, theta).

    • env: the gym environment.

    • theta: \(\theta\) for the convergence.

  • The algorithm needs to return the policy and the value function.

Important

The algorithm needs to be compatible for any Gym environment.

4.3. Results and Performance

  • Run both algorithm on the environment.

  • With both policies, run 1000 simulations and calculate the average reward.

    • Determine if one policy is better than the other, and conclude on the experiment with the theory seen in class.

  • Finally, run both algorithm again, but compute the time it takes for each algorithm to finish.

    • Determine if one is faster than the other, and conclude on the experiment with the theory seen in class.