3. Markov Decision Process

In this lab, you will define your first MDP.
If you don’t remember what a MDP is, check the notes: MDP

3.1. Problem definition

A mobile robot has the job of collecting empty soda cans in an environment.
It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin
It runs on a rechargeable battery.
High-level decisions about how to search for cans are made by a reinforcement learning agent based on the current charge level of the battery.

3.1.1. The MDP

We assume that only two charge levels can be distinguished:
- High
- Low
In each state the agent can:
- Search for cans.
- Wait for someone to bring a can.
- Recharge the battery at its home base.
The reward:
- \(0\) most of the time.
- \(-3\) when the battery is depleted.
- The reward for searching is \(r_{search}\).
- The reward for waiting is \(r_{wait}\).
- \(r_{search} > r_{wait}\).
The transition function:
- A period of searching that begins with a high energy level leaves the energy level high with probability \(\alpha\) and reduces it to low with probability \(1-\alpha\).
- A period of searching undertaken when the energy level is low leaves it low with probability \(\beta\) and depletes the battery with probability \(1-\beta\).
- In the latter case, the robot must be rescued, and the battery is then recharged back to high.
- Waiting does not change the state of the battery.

Activity

Use the previous information to model the problem as a MDP.

3.2. Implementation

Important

Move to this step when the previous part is done!

The implementation will be done in Python using the Gym library (Gym).

3.2.1. What to do?

You need to create a custom environment representing your MDP (Gym env).
Define the action space and observation space in __init__().
Define the constructor __init__().
Implement the method gym.Env.step(self, action) (Doc).

step() apply the transition function based on the action received.

It returns a tuple with 5 elements (observation, reward, terminated, truncated, info).

observation is the new state.

terminated is a boolean stating if the problem is over.

truncated is a boolean, it’s not important for this lab, so always False.

info is a dictionnary, and is not important for this lab.
Example
def step(self, action):
  # The transition function
  # ...


  terminated = # ...
  reward = # ...
  observation = # ...
  info = {}

  return observation, reward, terminated, False, info

Implement the method gym.Env.reset(self) (Doc).

reset() is a method to create a new episode.

Moreover, reset() should be called whenever a terminated has been True.

It should return the initial state, and can return additional informations info.
Example
def reset(self):
  # We need the following line to seed self.np_random
  super().reset()

  # ...

  observation = # ...
  info = {}

  return observation, info

3.2.2. Checking

Once it’s done you can check that your Gym environment is conformed to the documentation:

from gymnasium.utils.env_checker import check_env
check_env(env)

Then run a few steps with different actions to make sure it’s working correctly.

3.2.3. Miscellaneous

Ensure that \(r_{search}\) and \(r_{wait}\) are defined in your environment.
Ensure that you can define \(\alpha\) and \(\beta\) easily.
You can have other methods to help you, such as the transition function or the reward function.