********************************** Topic 10 - Markov Decision Process ********************************** * Until now we have seen how a robot move or locate itself. * It is important now to cover how a robot can plan its mission. * We will now cover probabilistic algorithms. Motivation ========== We can take a few examples: 1. A robotic manipulator grasp and assembles parts arriving in random configuration on a conveyor belt. * The parts that arrive are unknown. * The optimal strategy requires to know the future parts. 2. A vehicle to move from point A to point B in Toronto. * Shall it take the shortest path through downtown, but losing the GPS. * Or shall it takes the longest path, but keeping the GPS. 3. Robots need to map an area collectively. * Should they seek each other? * Or should they cover more unknown area alone? In robotics we need to consider two types of uncertainties: * Uncertainty of the actions. * Uncertainty on the measurement. .. admonition:: Activity :class: warning * How would you handle these two uncertainties? Markov Decision Process ======================= * We will star by tackling the problem of the uncertainty on the actions. * Markov Decision Process or called MDP is a mathematical framework that allows us to consider stochastic environment. The Idea -------- * MDPs can be represented graphically as follows: .. figure:: ../img/topic10/mdp.png :align: center * We have an agent, in our case a robot. * These agents evolve in an environment. * The agent can interact with this environment with different actions. * When the agent uses an action it changes the state of the environment and it gives it a reward. .. admonition:: Activity :class: warning * Consider a robot that needs to navigate in a maze. * Try to answer the following questions: * What would be the environment? * What would be the actions? * What would be the state? Formally -------- * We need to define MDPs formally. .. admonition:: Definition A MDP is defined with a tuple :math:`(S,A,T,R)`: * :math:`S` is the state space that contains the set of states :math:`s\in S`. * :math:`A` is the action space that contains the set of actions :math:`a\in A`. * :math:`T: S\times A\times S` is the transition function that defines the probability :math:`p(s_{t+1}=s'|s_{t}=s,a_{t})`. * :math:`R: S\times A \rightarrow \mathbb{R}` is the reward function that defines :math:`r(s,a)`. .. admonition:: Activity :class: warning * Remember the problem of detecting Alan in the hallway? * Alan can only move forward. * At each time step, Alan can ever overshoot by 1 cell with a probability of 10%. * Alan can ever undershoot by 1 cell with a probability of 10%. * Now we want to model Alan moving in the hallway as an MDP: * Define the state space. * Define the action space. * Define the transition function. Solving a MDP ------------- * The strategy in an MDP is called a policy :math:`\pi: S \rightarrow A`. * It's a function that gives for each state :math:`s\in S` the action :math:`a\in A`. * When a policy is followed we obtain a **trajectory** :math:`G`. * Which is a sequence of rewards: .. math:: G_t = r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + \dots = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1} where :math:`\gamma` is a discount rate :math:`0 \leq \gamma \leq 1`. * The discount rate determines the present value of future rewards. * When the discount rate approaches 1, the future rewards are taken into account more strongly. * The objective of the MDPs is to calculate the optimal policy :math:`\pi^*`. * The policy that will maximize the value function :math:`V^\pi(s_0)`. * :math:`V^\pi(s_0)` is the total expected reward if the agent follow :math:`\pi` starting in :math:`s_0`: .. math:: V^\pi(s_0)= \mathbb{E}\left[G_0|s_0 \right] = \mathbb{E}\left[ \sum_{t=0}^{\infty}\gamma^t r(s_t,\pi(s_t))|s_0 \right] * So the optimal value function :math:`V^{\pi^*}` is defined as: .. math:: \begin{aligned} V^{\pi^*}(s) &= \max_{\pi} V^\pi(s)\\ &= \max_{a}\mathbb{E}_{\pi^*}\left[G_t | s, a \right]\\ &= \max_{a}\mathbb{E}_{\pi^*}\left[R_{t+1} + \gamma G_{t+1} | s, a \right]\\ &= \max_{a}\mathbb{E}_{\pi^*}\left[R_{t+1} + \gamma V^{\pi^*}(s_{t+1}) | s, a \right]\\ &= \max_{a}r(s,a)+\gamma\sum_{s'}p(s'|s,a)V^{\pi^*}(s')\\ \end{aligned} Dynamic Programming =================== * We define how a MDP can be solved, but we need to discuss possible algorithms to do it. * MDPs are usually solved dynamic programming algorithms. * We will assume a finite MDP (finite horizon). The Idea -------- * We want to use the value functions to search for a good policy. * We have seen that we can obtain the optimal policy once we found the optimal value function. * The idea is to calculate the optimal value function. Value Iteration --------------- * Value Iteration is an exact method. * It works in three steps: 1. Initialize random policy 2. Policy Evaluation 3. Policy improvement .. admonition:: Pseudocode :class: important .. figure:: ../img/topic10/policy_iteration.png :align: center .. admonition:: Example * A robot that needs to either go on the cell on the top left or bottom right cell. * At first the policy is initialized randomly. * After applying policy iteration we converge to the optimal policy. .. figure:: ../img/topic10/value_iteration_01.png :align: center .. figure:: ../img/topic10/value_iteration_02.png :align: center .. admonition:: Activity :class: warning * Consider the following problems: * A Hallway consisting of 7 tiles. * The idea is for a robot to reach the centered tile. * It can start from any tile. * It can either move on the left or on the right. * Each time the robot moves it receive a reward of -1. * If it reaches the centered tile, it receives a reward of +10. * Model this as an MDP. * Solve it by hand. * Implement it in python. Efficiency ---------- * Dynamic programming is very efficient for lots of problems. * It is also the case with Markov Decision Processes.