4. Dynamic Programming

In this lab you will calculate the optimal policy for the recycling robot problem.
You will need to reuse the code that you produced during the previous lab (MDP lab).

4.1. Policy Iteration

Implement the policy iteration algorithm def policy_iteration(env, theta).
- env: the gym environment.
- theta: \(\theta\) for the convergence of the value evaluation.
The algorithm needs to return the policy and the value function.

Important

The algorithm needs to be compatible for any Gym environment.

4.2. Value Iteration

Implement the value iteration algorithm def value_iteration(env, theta).
- env: the gym environment.
- theta: \(\theta\) for the convergence.
The algorithm needs to return the policy and the value function.

Important

The algorithm needs to be compatible for any Gym environment.

4.3. Results and Performance

Run both algorithm on the environment.
With both policies, run 1000 simulations and calculate the average reward.
- Determine if one policy is better than the other, and conclude on the experiment with the theory seen in class.
Finally, run both algorithm again, but compute the time it takes for each algorithm to finish.
- Determine if one is faster than the other, and conclude on the experiment with the theory seen in class.