4. Dynamic Programming
In this lab you will calculate the optimal policy for the recycling robot problem.
You will need to reuse the code that you produced during the previous lab (MDP lab).
4.1. Policy Iteration
Implement the policy iteration algorithm
def policy_iteration(env, theta)
.env
: the gym environment.theta
: \(\theta\) for the convergence of the value evaluation.
The algorithm needs to return the policy and the value function.
Important
The algorithm needs to be compatible for any Gym environment.
4.2. Value Iteration
Implement the value iteration algorithm
def value_iteration(env, theta)
.env
: the gym environment.theta
: \(\theta\) for the convergence.
The algorithm needs to return the policy and the value function.
Important
The algorithm needs to be compatible for any Gym environment.
4.3. Results and Performance
Run both algorithm on the environment.
With both policies, run 1000 simulations and calculate the average reward.
Determine if one policy is better than the other, and conclude on the experiment with the theory seen in class.
Finally, run both algorithm again, but compute the time it takes for each algorithm to finish.
Determine if one is faster than the other, and conclude on the experiment with the theory seen in class.