6. Dynamic Programming

  • We have seen in two previous topics how we can model a problem and how we can determine the best policy.

  • The only thing missing is an algorithm to calculate the optimal policy and value function.

  • There are different approaches:

    • Dynamic programming

    • Monte-Carlo methods

    • Reinforcement learning

  • Dynamic programming is the first proposed and it gives the foundations.

Activity

  • What do we need to calculate to obtain the optimal policies?

  • Give the formulas.

6.1. Policy Iteration

  • The algorithm works in two parts:

    • Policy evaluation (prediction)

    • Policy improvement

6.1.1. Policy Evaluation

  • The idea of policy evaluation is to compute the value function \(v_\pi\) for a policy \(\pi\).

  • Remember the equation from the previous topic:

\[v_\pi(s) = \sum_a\pi(a|s)\sum_{s'}p(s'|s,a)\left[ r(s,a) + \gamma v_\pi(s') \right]\]
  • It is possible to calculate this equation by iterative methods.

    • Consider a sequence of approximate value functions \(v_0, v_1, v_2, \dots\).

    • We initialize \(v_0\) with an initial value and the goal state to \(0\).

    • Each successive approximation is calculated as defined above:

    \[v_{k+1}(s) = \sum_a\pi(a|s)\sum_{s'}p(s'|s,a)\left[ r(s,a) + \gamma v_k(s') \right]\]
    • This update function will converge to \(v_\pi\) as \(k \rightarrow \infty\).

  • Of course at each step you need to calculate \(v_k(s)\) for every state \(s \in S\).

  • Put in an algorithm we obtain:

Algorithm

../../_images/policy_evaluation.png

Example

  • We will consider the following problem, initial policy, and initial value function.

../../_images/problem.drawio.png

  • Now we estimate the value function using the formula.

../../_images/value_estimation.drawio.png

6.1.2. Policy Improvement

  • Now we can evaluate a policy, it is possible to find a better one.

  • The idea is simple:

    • We take our value function \(v_\pi\).

    • We choose a state \(s\).

    • And we check if we need to change the policy for an action \(a, a\neq \pi(s)\).

  • A way to verify this is to calculate the q-value:

\[q_\pi(s,a) = \sum_{s'}p(s'|s,a)\left[r(s,a) + \gamma v_\pi(s')\right]\]
  • Then we can compare the value obtain with the one in the current value function.

Policy improvement theorem

  • Let \(\pi\) and \(\pi'\) being any pair of deterministic policies such that, for all \(s\in S\),

\[q_\pi (s,\pi'(s))\geq v_\pi(s)\]
  • Then the policy \(\pi'\) must be as good as, or better than \(\pi\).

  • That is, it must obtain greater or equal expected return from all states \(s\in S\):

\[v_{\pi'}(s) \geq v_\pi(s)\]
  • The intuition is that if, for a policy, we change only the action for a state \(s\), then the changed policy is indeed better.

  • Now, we just need to apply this to all states.

  • Each time we select the action that seems better according to \(q_\pi(s,a)\):

\[\begin{split}\begin{aligned} \pi'(s) &= \arg\max_a q_\pi(s,a)\\ &= \arg\max_a \mathbb{E}\left[ r_{t+1} + \gamma v_\pi(s_{t+1}) \right]\\ &= \arg\max_a \sum_{s'}p(s'|s,a)\left[ r(s,a) + \gamma v_\pi(s')\right]\\ \end{aligned}\end{split}\]
  • This greedy method respect the policy improvement theorem.

6.1.3. Policy Iteration Algorithm

  • The policy iteration algorithm combines the two previous steps.

  • It calculates the optimal policy by executing the previous steps multiple time:

    1. Evaluate a policy

    2. Improve the policy

    3. If the policy changed go to step 1.

  • MDPs has a finite number of policies, so it will converge to the optimal policy.

  • The complete algorithm is:

Policy Iteration

../../_images/policy_iteration.png

Example

  • Following the previous example, we can apply policy iteration.

../../_images/policy_iteration.drawio.png

Activity

  • Suggest possible drawback of this method.

6.2. Value Iteration

  • Considering the issue with Policy Iteration, we could come up with another algorithm.

  • It is possible to only consider a one-step policy evaluation.

  • This algorithm is Value Iteration.

    • It proposes to do only one step evaluation combined with policy improvement.

    • We obtain the following formula:

    \[\begin{split}\begin{aligned} v_{k+1}(s) &= \max_a \mathbb{E}\left[ r_{t+1} + \gamma v_k(s_k) | s_t = s, a_t = a \right]\\ &= \max_a\sum_{s'}p(s'|s,a)\left[ r(s,a) + \gamma v_k(s') \right]\\ \end{aligned}\end{split}\]
  • It updates the value function, but we lose our stopping criteria.

  • As it can take an infinite number of steps to converge to the optimal value function, we need to define when to stop.

  • In practice, we fix a small value, and when the value function change by less than this value we stop.

  • The complete algorithm is:

Value Iteration

../../_images/value_iteration.png
  • This algorithm converge faster than policy iterations.