********************************************
Partially Observable Markov Decision Process
********************************************

*  Markov Decision Processe (MDP) is a good framework for control decision when the measurement is perfect.
*  However, most of the time the measurement is not perfect.
*  To address the uncertainty on the measurement a new framework was created: Partially Observable Markov Decision Process (POMDP)

How does it work?
=================

*  In MDPs, we use the value iteration function:

.. math::

    v^*_\pi(s) = \max_a \sum_{s'}p(s'|s,a)\left[ r(s,a) + \gamma v^*(s') \right]

*  However, in POMDPs we don't know what the state :math:`s` is.
*  In robotics we use the notion of belief state :math:`b`.
*  If we replace :math:`s` by :math:`b` in the previous equation we obtain:

.. math::

    v^*_\pi(b) = \max_a \int_{b'}p(b'|b,a)\left[ r(b,a) + \gamma v^*(b') db'\right]

*  Giving us the control policy:

.. math::

    \pi(b) = \arg\max_a \int_{b'}p(b'|b,a)\left[ r(b,a) + \gamma v^*(b') db'\right]

*  The issue with using the belief state is that it is continuous.

.. important::
    
    When the state space, the action space, the space of observations and the planning horizon are all finite there is a solution.


.. admonition:: Example
    :class: example

    *  We will consider the following problem represented as a POMDP.

    .. figure:: ./img/two_states_pomdp.drawio.png
        :align: center

    |

    *  The actions :math:`u_1` and :math:`u_2` are giving different rewards depending of the states.
    *  Only :math:`u_3` is giving the same one: :math:`-1`.
    *  The goal of the robot is to use :math:`u_3` to gain information of the real state.

Action selection
----------------

*  Before solving the problem we need to have a way to calculate the reward for a specific action.
*  For any belief state :math:`b = (p_1, p_2)`, the expected reward us calculated as:

.. math::

    r(b, u) = \mathbb{E}\left[ r(s, u) \right] = p_1\times r(s_1, u) + p_2\times r(s_2, u)

*  which gives use the following expected rewards:

.. figure:: ./img/pomdp_belief_reward.png
    :align: center

|

.. important::

    Even with two states calculating the expected reward is complex.

    Also the we can see the expected value function is piecewise linear and convex!

.. admonition:: Activity
    :class: activity

    *  Calculate the expected reward for the following belief states and every actions:

       *  :math:`b_1 = (0.3, 0.7)`
       *  :math:`b_2 = (0.5, 0.5)`
       *  :math:`b_3 = (0.7, 0.3)`

*  Calculating the best action at one time step is solving a set of linear equations:

.. math::

    \begin{aligned}
    v_1(b) &= \max_{u}r(b,u)\\
    &= \max \begin{Bmatrix}
    -100p_1 & +100(1-p_1) \\
    100p_1 & -50(1-p_1)\\
    -1
    \end{Bmatrix}
    \end{aligned}

*  As only the first two linear functions contribute to the optimal action we can prune the third one:

.. math::

    v_1(b) &= \max \begin{Bmatrix}
    -100p_1 & +100(1-p_1) \\
    100p_1 & -50(1-p_1)\\
    \end{Bmatrix}

.. note::

    Prunable functions are usually represented by dashed lines.

Sensing
-------

*  We suppose now that we receive a measurement :math:`z_1`.
*  We can apply Bayes rule:

   * For :math:`p_1`:

   .. math::

        \begin{aligned}
        p_1' &= p(x_1|z)\\
        &= \frac{p(z_1|x_1)p(x_1)}{p(z_1)}\\
        &= \frac{0.7p_1}{p(z_1)}
        \end{aligned}

   * For :math:`p_2`:

   .. math:: 

        \begin{aligned}
        p_2' &= \frac{0.3(1-p_1)}{p(z_1)}
        \end{aligned}


*  The only issue is the normalizer :math:`p(z_1)`:

.. math::

    p(z_1) = 0.7p_1 + 0.3(1-p_1) = 0.4p_1 + 0.3


*  If we use the measurement, our belief becomes nonlinear:

.. figure:: ./img/sensing_01.png
    :align: center

|

*  Changing the expected value:

.. figure:: ./img/sensing_02.png
    :align: center

|

*  Mathematically:

.. math::

    \begin{aligned}
    v_1(b|z_1) &= \max \begin{Bmatrix}
    -100p_1' & +100p_2' \\
    100p_1' & +-50p_2'\\
    \end{Bmatrix}\\[10pt]
    &= \max \begin{Bmatrix}
    -100\frac{0.7p_1}{p(z_1)} & +100\frac{0.3(1-p_1)}{p(z_1)} \\
    100\frac{0.7p_1}{p(z_1)} & -50\frac{0.3(1-p_1)}{p(z_1)}\\
    \end{Bmatrix}\\[10pt]
    &= \frac{1}{p(z_1)}\max \begin{Bmatrix}
    -70p_1 & +30(1-p_1) \\
    70p_1 &-15(1-p_1)\\
    \end{Bmatrix}\\
    \end{aligned}

*  This is only considering one measurement.
*  We need to calculate the value before sensing for every measurement:

.. math::

    \bar{v_1}(b) = \mathbb{E}_z\left[v_1(b|z)\right] = \sum_{i=1}^{N} p(z_i)v_1(b|z_i)

*  However, :math:`v_1(b|z_i)` is conditioned to :math:`p(z_i)`:

.. math::

    \begin{aligned}
    \bar{v_1}(b) &= \sum_{i=1}^{N} p(z_i)v_1(b|z_i)\\
    &= \sum_{i=1}^{N} p(z_i)v_1(\frac{p(z_1|x_1)p(x_1)}{p(z_1)})\\
    &= \sum_{i=1}^{N} p(z_i)\frac{1}{p(z_i)}v_1(p(z_1|x_1)p(x_1))\\
    &= \sum_{i=1}^{N} v_1(p(z_1|x_1)p(x_1))\\
    \end{aligned}

.. important::

    The nonlinearity cancels out!

*  It allow us to calculate:

.. math::

    p(z_1)v_1(b|z_1)= \max \begin{Bmatrix}
    -70p_1 & +30(1-p_1) \\
    70p_1 &-15(1-p_1)\\
    \end{Bmatrix}

.. figure:: ./img/sensing_03.png
    :align: center

|


*  Same for :math:`z_2`:

.. math::

    p(z_2)v_1(b|z_2)= \max \begin{Bmatrix}
    -30p_1 & +70(1-p_1) \\
    30p_1 & -35(1-p_1)\\
    \end{Bmatrix}


.. figure:: ./img/sensing_04.png
    :align: center

|

So the value function before sensing becomes:

.. math::

    \bar{v_1}(b) = \max \begin{Bmatrix}
    -70p_1 & +30(1-p_1) \\
    70p_1 &-15(1-p_1)\\
    \end{Bmatrix} + \max \begin{Bmatrix}
    -30p_1 & +70(1-p_1) \\
    30p_1 & -35(1-p_1)\\
    \end{Bmatrix}

.. figure:: ./img/sensing_05.png
    :align: center

|

*  However, it seems easier that it is.
*  We actually need to compute the maximum between any sums that adds a linear function from the previous expression.
*  Which leaves us with four possible combination:


.. math::


    \begin{aligned}
    \bar{v_1}(b) &= \max \begin{Bmatrix}
    -70p_1 & +30(1-p_1) & -30p_1 & +70(1-p_1) \\
    -70p_1 & +30(1-p_1) & +30p_1 & -35(1-p_1)\\
    70p_1 &-15(1-p_1) & -30p_1 & +70(1-p_1)\\
    70p_1 &-15(1-p_1) & +30p_1 & -35(1-p_1)
    \end{Bmatrix}\\[10pt]
    &= \max \begin{Bmatrix}
    -100p_1 & +100(1-p_1) \\
    -40p_1 & -5(1-p_1) \\
    40p_1 & +55(1-p_1) \\
    100p_1 & -50(1-p_1) \\
    \end{Bmatrix}\\[10pt]
    &= \max \begin{Bmatrix}
    -100p_1 & +100(1-p_1) \\
    40p_1 & +55(1-p_1) \\
    100p_1 & -50(1-p_1) \\
    \end{Bmatrix}\\
    \end{aligned}

Prediction
----------

*  Now we need to deal with the state transition.
*  Consider we start in :math:`s_1` with a probability of :math:`p_1 = 1`.

   *  Then based on our model:

   .. math::

        p_1' = p(s_1'|s_2',u_3) = 0.2

*  Similarly if :math:`p_1 = 0`

   .. math::

        p_1' = p(s_1'|s_2',u_3) = 0.8

*  And in between the expectations is linear:

.. math::

    \begin{aligned}
    p_1' &= \mathbb{E}\left[ p(s_1'|s,u_3) \right]\\
    &= \sum_{i=1}^{N} p(s_1'|s_i,u_3)p_i\\
    &= 0.2p_1 + 0.8(1-p_1)\\
    &= 0.8-0.6p_1
    \end{aligned}

*  Which give us:

.. figure:: ./img/predicting_01.png
    :align: center
    :scale: 80%

|

*  If we project this to our previous value function :math:`\bar{v_1}(b)`.

.. figure:: ./img/predicting_02.png
    :align: center
    :scale: 80%

|

*  We obtain:

.. figure:: ./img/predicting_03.png
    :align: center
    :scale: 80%

|

*  We can see it flattens, because we lose information.
*  Mathematically it's the projection of :math:`p_1'` through :math:`\bar{v_1}(b)`, giving us:

.. math::


    \begin{aligned}
    \bar{v_1}(b) &= \max \begin{Bmatrix}
    -100p_1' & +100(1-p_1') \\
    40p_1' & +55(1-p_1') \\
    100p_1' & -50(1-p_1') \\
    \end{Bmatrix}\\[10pt]
    &= \max \begin{Bmatrix}
    -100(0.8-0.6p_1) & +100(1-(0.8-0.6p_1)) \\
    40(0.8-0.6p_1) & +55(1-(0.8-0.6p_1)) \\
    100(0.8-0.6p_1) & -50(1-(0.8-0.6p_1)) \\
    \end{Bmatrix}\\[10pt]
    &= \max \begin{Bmatrix}
    -100(0.8-0.6p_1) & +100(0.2+0.6p_1) \\
    40(0.8-0.6p_1) & +55(0.2+0.6p_1) \\
    100(0.8-0.6p_1) & -50(0.2+0.6p_1) \\
    \end{Bmatrix}\\[10pt]
    &= \max \begin{Bmatrix}
    60p_1  & -60(1-p_1) \\
    50p_1  & +43(1-p_1) \\
    -20p_1 & +70(1-p_1) \\
    \end{Bmatrix}\\
    \end{aligned}

*  The value function for step :math:`T=2`.
*  We just add :math:`u_1` and :math:`u_2`.

.. math::

    \bar{v_2}(b) &= \max \begin{Bmatrix}
    -100p_1 & +100(1-p_1) \\
    100p_1  &  -50(1-p_1) \\
    60p_1   &  -60(1-p_1) \\
    50p_1   &  +43(1-p_1) \\
    -20p_1  &  +70(1-p_1) \\
    \end{Bmatrix}

*  We remove the uniform cost od :math:`-1` from :math:`u_3`.

.. math::

    \bar{v_2}(b) &= \max \begin{Bmatrix}
    -100p_1 & +100(1-p_1) \\
    100p_1  &  -50(1-p_1) \\
    59p_1   &  -61(1-p_1) \\
    51p_1   &  +42(1-p_1) \\
    -21p_1  &  +69(1-p_1) \\
    \end{Bmatrix}

.. figure:: ./img/predicting_05.png
    :align: center
    :scale: 17%

|

*  We can prune two functions are suboptimal.

.. math::

    \bar{v_2}(b) &= \max \begin{Bmatrix}
    -100p_1 & +100(1-p_1) \\
    100p_1  &  -50(1-p_1) \\
    51p_1   &  +42(1-p_1) \\
    \end{Bmatrix}

.. important::

    We have done a full backup!