3. Markov Decision Process

  • In this lab, you will define your first MDP.

  • If you don’t remember what a MDP is, check the notes: MDP

3.1. Problem definition

  • A mobile robot has the job of collecting empty soda cans in an environment.

  • It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin

  • It runs on a rechargeable battery.

  • High-level decisions about how to search for cans are made by a reinforcement learning agent based on the current charge level of the battery.

3.1.1. The MDP

  • We assume that only two charge levels can be distinguished:

    • High

    • Low

  • In each state the agent can:

    • Search for cans.

    • Wait for someone to bring a can.

    • Recharge the battery at its home base.

  • The reward:

    • \(0\) most of the time.

    • \(-3\) when the battery is depleted.

    • The reward for searching is \(r_{search}\).

    • The reward for waiting is \(r_{wait}\).

    • \(r_{search} > r_{wait}\).

  • The transition function:

    • A period of searching that begins with a high energy level leaves the energy level high with probability \(\alpha\) and reduces it to low with probability \(1-\alpha\).

    • A period of searching undertaken when the energy level is low leaves it low with probability \(\beta\) and depletes the battery with probability \(1-\beta\).

    • In the latter case, the robot must be rescued, and the battery is then recharged back to high.

    • Waiting does not change the state of the battery.

Activity

  • Use the previous information to model the problem as a MDP.

3.2. Implementation

Important

  • Move to this step when the previous part is done!

  • The implementation will be done in Python using the Gym library (Gym).

3.2.1. What to do?

  1. You need to create a custom environment representing your MDP (Gym env).

  2. Define the action space and observation space in __init__().

  3. Define the constructor __init__().

  4. Implement the method gym.Env.step(self, action) (Doc).

  • step() apply the transition function based on the action received.

  • It returns a tuple with 5 elements (observation, reward, terminated, truncated, info).

    • observation is the new state.

    • terminated is a boolean stating if the problem is over.

    • truncated is a boolean, it’s not important for this lab, so always False.

    • info is a dictionnary, and is not important for this lab.

Example

def step(self, action):
  # The transition function
  # ...


  terminated = # ...
  reward = # ...
  observation = # ...
  info = {}

  return observation, reward, terminated, False, info
  1. Implement the method gym.Env.reset(self) (Doc).

  • reset() is a method to create a new episode.

  • Moreover, reset() should be called whenever a terminated has been True.

  • It should return the initial state, and can return additional informations info.

Example

def reset(self):
  # We need the following line to seed self.np_random
  super().reset()

  # ...

  observation = # ...
  info = {}

  return observation, info

3.2.2. Checking

  • Once it’s done you can check that your Gym environment is conformed to the documentation:

from gymnasium.utils.env_checker import check_env
check_env(env)
  • Then run a few steps with different actions to make sure it’s working correctly.

3.2.3. Miscellaneous

  • Ensure that \(r_{search}\) and \(r_{wait}\) are defined in your environment.

  • Ensure that you can define \(\alpha\) and \(\beta\) easily.

  • You can have other methods to help you, such as the transition function or the reward function.