******************************************** Partially Observable Markov Decision Process ******************************************** * Markov Decision Processe (MDP) is a good framework for control decision when the measurement is perfect. * However, most of the time the measurement is not perfect. * To address the uncertainty on the measurement a new framework was created: Partially Observable Markov Decision Process (POMDP) How does it work? ================= * In MDPs, we use the value iteration function: .. math:: v^*_\pi(s) = \max_a \sum_{s'}p(s'|s,a)\left[ r(s,a) + \gamma v^*(s') \right] * However, in POMDPs we don't know what the state :math:`s` is. * In robotics we use the notion of belief state :math:`b`. * If we replace :math:`s` by :math:`b` in the previous equation we obtain: .. math:: v^*_\pi(b) = \max_a \int_{b'}p(b'|b,a)\left[ r(b,a) + \gamma v^*(b') db'\right] * Giving us the control policy: .. math:: \pi(b) = \arg\max_a \int_{b'}p(b'|b,a)\left[ r(b,a) + \gamma v^*(b') db'\right] * The issue with using the belief state is that it is continuous. .. important:: When the state space, the action space, the space of observations and the planning horizon are all finite there is a solution. .. admonition:: Example :class: example * We will consider the following problem represented as a POMDP. .. figure:: ./img/two_states_pomdp.drawio.png :align: center | * The actions :math:`u_1` and :math:`u_2` are giving different rewards depending of the states. * Only :math:`u_3` is giving the same one: :math:`-1`. * The goal of the robot is to use :math:`u_3` to gain information of the real state. Action selection ---------------- * Before solving the problem we need to have a way to calculate the reward for a specific action. * For any belief state :math:`b = (p_1, p_2)`, the expected reward us calculated as: .. math:: r(b, u) = \mathbb{E}\left[ r(s, u) \right] = p_1\times r(s_1, u) + p_2\times r(s_2, u) * which gives use the following expected rewards: .. figure:: ./img/pomdp_belief_reward.png :align: center | .. important:: Even with two states calculating the expected reward is complex. Also the we can see the expected value function is piecewise linear and convex! .. admonition:: Activity :class: activity * Calculate the expected reward for the following belief states and every actions: * :math:`b_1 = (0.3, 0.7)` * :math:`b_2 = (0.5, 0.5)` * :math:`b_3 = (0.7, 0.3)` * Calculating the best action at one time step is solving a set of linear equations: .. math:: \begin{aligned} v_1(b) &= \max_{u}r(b,u)\\ &= \max \begin{Bmatrix} -100p_1 & +100(1-p_1) \\ 100p_1 & -50(1-p_1)\\ -1 \end{Bmatrix} \end{aligned} * As only the first two linear functions contribute to the optimal action we can prune the third one: .. math:: v_1(b) &= \max \begin{Bmatrix} -100p_1 & +100(1-p_1) \\ 100p_1 & -50(1-p_1)\\ \end{Bmatrix} .. note:: Prunable functions are usually represented by dashed lines. Sensing ------- * We suppose now that we receive a measurement :math:`z_1`. * We can apply Bayes rule: * For :math:`p_1`: .. math:: \begin{aligned} p_1' &= p(x_1|z)\\ &= \frac{p(z_1|x_1)p(x_1)}{p(z_1)}\\ &= \frac{0.7p_1}{p(z_1)} \end{aligned} * For :math:`p_2`: .. math:: \begin{aligned} p_2' &= \frac{0.3(1-p_1)}{p(z_1)} \end{aligned} * The only issue is the normalizer :math:`p(z_1)`: .. math:: p(z_1) = 0.7p_1 + 0.3(1-p_1) = 0.4p_1 + 0.3 * If we use the measurement, our belief becomes nonlinear: .. figure:: ./img/sensing_01.png :align: center | * Changing the expected value: .. figure:: ./img/sensing_02.png :align: center | * Mathematically: .. math:: \begin{aligned} v_1(b|z_1) &= \max \begin{Bmatrix} -100p_1' & +100p_2' \\ 100p_1' & +-50p_2'\\ \end{Bmatrix}\\[10pt] &= \max \begin{Bmatrix} -100\frac{0.7p_1}{p(z_1)} & +100\frac{0.3(1-p_1)}{p(z_1)} \\ 100\frac{0.7p_1}{p(z_1)} & -50\frac{0.3(1-p_1)}{p(z_1)}\\ \end{Bmatrix}\\[10pt] &= \frac{1}{p(z_1)}\max \begin{Bmatrix} -70p_1 & +30(1-p_1) \\ 70p_1 &-15(1-p_1)\\ \end{Bmatrix}\\ \end{aligned} * This is only considering one measurement. * We need to calculate the value before sensing for every measurement: .. math:: \bar{v_1}(b) = \mathbb{E}_z\left[v_1(b|z)\right] = \sum_{i=1}^{N} p(z_i)v_1(b|z_i) * However, :math:`v_1(b|z_i)` is conditioned to :math:`p(z_i)`: .. math:: \begin{aligned} \bar{v_1}(b) &= \sum_{i=1}^{N} p(z_i)v_1(b|z_i)\\ &= \sum_{i=1}^{N} p(z_i)v_1(\frac{p(z_1|x_1)p(x_1)}{p(z_1)})\\ &= \sum_{i=1}^{N} p(z_i)\frac{1}{p(z_i)}v_1(p(z_1|x_1)p(x_1))\\ &= \sum_{i=1}^{N} v_1(p(z_1|x_1)p(x_1))\\ \end{aligned} .. important:: The nonlinearity cancels out! * It allow us to calculate: .. math:: p(z_1)v_1(b|z_1)= \max \begin{Bmatrix} -70p_1 & +30(1-p_1) \\ 70p_1 &-15(1-p_1)\\ \end{Bmatrix} .. figure:: ./img/sensing_03.png :align: center | * Same for :math:`z_2`: .. math:: p(z_2)v_1(b|z_2)= \max \begin{Bmatrix} -30p_1 & +70(1-p_1) \\ 30p_1 & -35(1-p_1)\\ \end{Bmatrix} .. figure:: ./img/sensing_04.png :align: center | So the value function before sensing becomes: .. math:: \bar{v_1}(b) = \max \begin{Bmatrix} -70p_1 & +30(1-p_1) \\ 70p_1 &-15(1-p_1)\\ \end{Bmatrix} + \max \begin{Bmatrix} -30p_1 & +70(1-p_1) \\ 30p_1 & -35(1-p_1)\\ \end{Bmatrix} .. figure:: ./img/sensing_05.png :align: center | * However, it seems easier that it is. * We actually need to compute the maximum between any sums that adds a linear function from the previous expression. * Which leaves us with four possible combination: .. math:: \begin{aligned} \bar{v_1}(b) &= \max \begin{Bmatrix} -70p_1 & +30(1-p_1) & -30p_1 & +70(1-p_1) \\ -70p_1 & +30(1-p_1) & +30p_1 & -35(1-p_1)\\ 70p_1 &-15(1-p_1) & -30p_1 & +70(1-p_1)\\ 70p_1 &-15(1-p_1) & +30p_1 & -35(1-p_1) \end{Bmatrix}\\[10pt] &= \max \begin{Bmatrix} -100p_1 & +100(1-p_1) \\ -40p_1 & -5(1-p_1) \\ 40p_1 & +55(1-p_1) \\ 100p_1 & -50(1-p_1) \\ \end{Bmatrix}\\[10pt] &= \max \begin{Bmatrix} -100p_1 & +100(1-p_1) \\ 40p_1 & +55(1-p_1) \\ 100p_1 & -50(1-p_1) \\ \end{Bmatrix}\\ \end{aligned} Prediction ---------- * Now we need to deal with the state transition. * Consider we start in :math:`s_1` with a probability of :math:`p_1 = 1`. * Then based on our model: .. math:: p_1' = p(s_1'|s_2',u_3) = 0.2 * Similarly if :math:`p_1 = 0` .. math:: p_1' = p(s_1'|s_2',u_3) = 0.8 * And in between the expectations is linear: .. math:: \begin{aligned} p_1' &= \mathbb{E}\left[ p(s_1'|s,u_3) \right]\\ &= \sum_{i=1}^{N} p(s_1'|s_i,u_3)p_i\\ &= 0.2p_1 + 0.8(1-p_1)\\ &= 0.8-0.6p_1 \end{aligned} * Which give us: .. figure:: ./img/predicting_01.png :align: center :scale: 80% | * If we project this to our previous value function :math:`\bar{v_1}(b)`. .. figure:: ./img/predicting_02.png :align: center :scale: 80% | * We obtain: .. figure:: ./img/predicting_03.png :align: center :scale: 80% | * We can see it flattens, because we lose information. * Mathematically it's the projection of :math:`p_1'` through :math:`\bar{v_1}(b)`, giving us: .. math:: \begin{aligned} \bar{v_1}(b) &= \max \begin{Bmatrix} -100p_1' & +100(1-p_1') \\ 40p_1' & +55(1-p_1') \\ 100p_1' & -50(1-p_1') \\ \end{Bmatrix}\\[10pt] &= \max \begin{Bmatrix} -100(0.8-0.6p_1) & +100(1-(0.8-0.6p_1)) \\ 40(0.8-0.6p_1) & +55(1-(0.8-0.6p_1)) \\ 100(0.8-0.6p_1) & -50(1-(0.8-0.6p_1)) \\ \end{Bmatrix}\\[10pt] &= \max \begin{Bmatrix} -100(0.8-0.6p_1) & +100(0.2+0.6p_1) \\ 40(0.8-0.6p_1) & +55(0.2+0.6p_1) \\ 100(0.8-0.6p_1) & -50(0.2+0.6p_1) \\ \end{Bmatrix}\\[10pt] &= \max \begin{Bmatrix} 60p_1 & -60(1-p_1) \\ 50p_1 & +43(1-p_1) \\ -20p_1 & +70(1-p_1) \\ \end{Bmatrix}\\ \end{aligned} * The value function for step :math:`T=2`. * We just add :math:`u_1` and :math:`u_2`. .. math:: \bar{v_2}(b) &= \max \begin{Bmatrix} -100p_1 & +100(1-p_1) \\ 100p_1 & -50(1-p_1) \\ 60p_1 & -60(1-p_1) \\ 50p_1 & +43(1-p_1) \\ -20p_1 & +70(1-p_1) \\ \end{Bmatrix} * We remove the uniform cost od :math:`-1` from :math:`u_3`. .. math:: \bar{v_2}(b) &= \max \begin{Bmatrix} -100p_1 & +100(1-p_1) \\ 100p_1 & -50(1-p_1) \\ 59p_1 & -61(1-p_1) \\ 51p_1 & +42(1-p_1) \\ -21p_1 & +69(1-p_1) \\ \end{Bmatrix} .. figure:: ./img/predicting_05.png :align: center :scale: 17% | * We can prune two functions are suboptimal. .. math:: \bar{v_2}(b) &= \max \begin{Bmatrix} -100p_1 & +100(1-p_1) \\ 100p_1 & -50(1-p_1) \\ 51p_1 & +42(1-p_1) \\ \end{Bmatrix} .. important:: We have done a full backup!