.. _reinforcement-learning: Sequential Decision Making ========================== WhyNot is also an excellent test bed for sequential decision making and reinforcement learning in diverse dynamic environments. WhyNot offers RL environments compatible with the OpenAI Gym API style, so that existing code for OpenAI Gym can be adapted for WhyNot with minimal changes. Using Existing WhyNot Environments ---------------------------------- To see all available environments, .. code:: python import whynot.gym as gym for env in gym.envs.registry.all(): print(env.id) To create an environment, set the random seed, and get an initial observation, .. code:: python env = gym.make('HIV-v0') env.seed(1) observation = env.reset() To sample a random action and perform the random action, use the ``step`` function. The step function returns the reward, the next observation, whether the environment achieves a terminal state, and a dict of additional debugging info. .. code:: python action = env.action_space.sample() observation, reward, done, info = env.step(action) The actions, observations, and rewards in the WhyNot Gym environment are all represented as numpy arrays. The environment works with algorithms implemented in any Python numerical computation library, such as PyTorch or TensorFlow. See `this notebook `_ for an example of training policies on the HIV environment. Defining a New Custom Environment --------------------------------- To define a new custom environment on top of a WhyNot simulator, implement 1) the reward function, 2) a mapping from numerical actions to system interventions, and, optionally, 3) a mapping from state to observation. The class :class:`~whynot.gym.envs.ODEEnvBuilder` then wraps an arbitrary dynamical system simulator into a Gym environment for reinforcement learning. For example, we defined the HIV environment by .. code:: python from whynot.gym.envs import ODEEnvBuilder from whynot.simulators.hiv import Config, Intervention, State from whynot.simulators.hiv import simulate def reward_fn(intervention, state): reward = ... return reward def intervention_fn(action, time): action_to_intervention_map = ... return action_to_intervention_map[action] HivEnv = ODEEnvBuilder( # Specify the dynamical system simulate_fn=simulate, # Simulator configuration config=Config(), # Initial state to begin simulator initial_state=State(), # Define the action space action_space=spaces.Discrete(...) # Define the observation space observation_space=spaces.Box(...) # Convert numerical actions to simulator interventions intervention_fn=intervention_fn, # Define the reward function reward_fn=reward_fn, )