3. Markov Decision Process
In this lab, you will define your first MDP.
If you don’t remember what a MDP is, check the notes: MDP
3.1. Problem definition
A mobile robot has the job of collecting empty soda cans in an environment.
It has sensors for detecting cans, and an arm and gripper that can pick them up and place them in an onboard bin
It runs on a rechargeable battery.
High-level decisions about how to search for cans are made by a reinforcement learning agent based on the current charge level of the battery.
3.1.1. The MDP
We assume that only two charge levels can be distinguished:
High
Low
In each state the agent can:
Search for cans.
Wait for someone to bring a can.
Recharge the battery at its home base.
The reward:
\(0\) most of the time.
\(-3\) when the battery is depleted.
The reward for searching is \(r_{search}\).
The reward for waiting is \(r_{wait}\).
\(r_{search} > r_{wait}\).
The transition function:
A period of searching that begins with a high energy level leaves the energy level high with probability \(\alpha\) and reduces it to low with probability \(1-\alpha\).
A period of searching undertaken when the energy level is low leaves it low with probability \(\beta\) and depletes the battery with probability \(1-\beta\).
In the latter case, the robot must be rescued, and the battery is then recharged back to high.
Waiting does not change the state of the battery.
Activity
Use the previous information to model the problem as a MDP.
3.2. Implementation
Important
Move to this step when the previous part is done!
The implementation will be done in Python using the Gym library (Gym).
3.2.1. What to do?
You need to create a custom environment representing your MDP (Gym env).
Define the action space and observation space in
__init__()
.Define the constructor
__init__()
.Implement the method
gym.Env.step(self, action)
(Doc).
step()
apply the transition function based on the action received.It returns a tuple with 5 elements
(observation, reward, terminated, truncated, info)
.
observation
is the new state.
terminated
is a boolean stating if the problem is over.
truncated
is a boolean, it’s not important for this lab, so alwaysFalse
.
info
is a dictionnary, and is not important for this lab.Example
def step(self, action): # The transition function # ... terminated = # ... reward = # ... observation = # ... info = {} return observation, reward, terminated, False, info
Implement the method
gym.Env.reset(self)
(Doc).
reset()
is a method to create a new episode.Moreover,
reset()
should be called whenever aterminated
has beenTrue
.It should return the initial state, and can return additional informations
info
.Example
def reset(self): # We need the following line to seed self.np_random super().reset() # ... observation = # ... info = {} return observation, info
3.2.2. Checking
Once it’s done you can check that your Gym environment is conformed to the documentation:
from gymnasium.utils.env_checker import check_env
check_env(env)
Then run a few steps with different actions to make sure it’s working correctly.
3.2.3. Miscellaneous
Ensure that \(r_{search}\) and \(r_{wait}\) are defined in your environment.
Ensure that you can define \(\alpha\) and \(\beta\) easily.
You can have other methods to help you, such as the transition function or the reward function.