A fork of OpenAI Baselines, implementations of reinforcement learning algorithms

Overview

Build Status Documentation Status Codacy Badge Codacy Badge

Stable Baselines

Stable Baselines is a set of improved implementations of reinforcement learning algorithms based on OpenAI Baselines.

You can read a detailed presentation of Stable Baselines in the Medium article.

These algorithms will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of. We expect these tools will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. We also hope that the simplicity of these tools will allow beginners to experiment with a more advanced toolset, without being buried in implementation details.

Note: despite its simplicity of use, Stable Baselines (SB) assumes you have some knowledge about Reinforcement Learning (RL). You should not utilize this library without some practice. To that extent, we provide good resources in the documentation to get started with RL.

Main differences with OpenAI Baselines

This toolset is a fork of OpenAI Baselines, with a major structural refactoring, and code cleanups:

  • Unified structure for all algorithms
  • PEP8 compliant (unified code style)
  • Documented functions and classes
  • More tests & more code coverage
  • Additional algorithms: SAC and TD3 (+ HER support for DQN, DDPG, SAC and TD3)
Features Stable-Baselines OpenAI Baselines
State of the art RL methods ✔️ (1) ✔️
Documentation ✔️
Custom environments ✔️ ✔️
Custom policies ✔️ (2)
Common interface ✔️ (3)
Tensorboard support ✔️ (4)
Ipython / Notebook friendly ✔️
PEP8 code style ✔️ ✔️ (5)
Custom callback ✔️ (6)

(1): Forked from previous version of OpenAI baselines, with now SAC and TD3 in addition
(2): Currently not available for DDPG, and only from the run script.
(3): Only via the run script.
(4): Rudimentary logging of training information (no loss nor graph).
(5): EDIT: you did it OpenAI! 🐱
(6): Passing a callback function is only available for DQN

Documentation

Documentation is available online: https://stable-baselines.readthedocs.io/

RL Baselines Zoo: A Collection of 100+ Trained RL Agents

RL Baselines Zoo. is a collection of pre-trained Reinforcement Learning agents using Stable-Baselines.

It also provides basic scripts for training, evaluating agents, tuning hyperparameters and recording videos.

Goals of this repository:

  1. Provide a simple interface to train and enjoy RL agents
  2. Benchmark the different Reinforcement Learning algorithms
  3. Provide tuned hyperparameters for each environment and RL algorithm
  4. Have fun with the trained agents!

Github repo: https://github.com/araffin/rl-baselines-zoo

Documentation: https://stable-baselines.readthedocs.io/en/master/guide/rl_zoo.html

Installation

Note: Stable-Baselines supports Tensorflow versions from 1.8.0 to 1.14.0. Support for Tensorflow 2 API is planned.

Prerequisites

Baselines requires python3 (>=3.5) with the development headers. You'll also need system packages CMake, OpenMPI and zlib. Those can be installed as follows

Ubuntu

sudo apt-get update && sudo apt-get install cmake libopenmpi-dev python3-dev zlib1g-dev

Mac OS X

Installation of system packages on Mac requires Homebrew. With Homebrew installed, run the following:

brew install cmake openmpi

Windows 10

To install stable-baselines on Windows, please look at the documentation.

Install using pip

Install the Stable Baselines package:

pip install stable-baselines[mpi]

This includes an optional dependency on MPI, enabling algorithms DDPG, GAIL, PPO1 and TRPO. If you do not need these algorithms, you can install without MPI:

pip install stable-baselines

Please read the documentation for more details and alternatives (from source, using docker).

Example

Most of the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.

Here is a quick example of how to train and run PPO2 on a cartpole environment:

import gym

from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import PPO2

env = gym.make('CartPole-v1')
# Optional: PPO2 requires a vectorized environment to run
# the env is now wrapped automatically when passing it to the constructor
# env = DummyVecEnv([lambda: env])

model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

env.close()

Or just train a model with a one liner if the environment is registered in Gym and if the policy is registered:

from stable_baselines import PPO2

model = PPO2('MlpPolicy', 'CartPole-v1').learn(10000)

Please read the documentation for more examples.

Try it online with Colab Notebooks !

All the following examples can be executed online using Google colab notebooks:

Implemented Algorithms

Name Refactored(1) Recurrent Box Discrete MultiDiscrete MultiBinary Multi Processing
A2C ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
ACER ✔️ ✔️ (5) ✔️ ✔️
ACKTR ✔️ ✔️ ✔️ ✔️ ✔️
DDPG ✔️ ✔️ ✔️ (4)
DQN ✔️ ✔️
GAIL (2) ✔️ ✔️ ✔️ ✔️ (4)
HER (3) ✔️ ✔️ ✔️ ✔️
PPO1 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ (4)
PPO2 ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ ✔️
SAC ✔️ ✔️
TD3 ✔️ ✔️
TRPO ✔️ ✔️ ✔️ ✔️ ✔️ ✔️ (4)

(1): Whether or not the algorithm has be refactored to fit the BaseRLModel class.
(2): Only implemented for TRPO.
(3): Re-implemented from scratch, now supports DQN, DDPG, SAC and TD3
(4): Multi Processing with MPI.
(5): TODO, in project scope.

NOTE: Soft Actor-Critic (SAC) and Twin Delayed DDPG (TD3) were not part of the original baselines and HER was reimplemented from scratch.

Actions gym.spaces:

  • Box: A N-dimensional box that containes every point in the action space.
  • Discrete: A list of possible actions, where each timestep only one of the actions can be used.
  • MultiDiscrete: A list of possible actions, where each timestep only one action of each discrete set can be used.
  • MultiBinary: A list of possible actions, where each timestep any of the actions can be used in any combination.

MuJoCo

Some of the baselines examples use MuJoCo (multi-joint dynamics in contact) physics simulator, which is proprietary and requires binaries and a license (temporary 30-day license can be obtained from www.mujoco.org). Instructions on setting up MuJoCo can be found here

Testing the installation

All unit tests in baselines can be run using pytest runner:

pip install pytest pytest-cov
make pytest

Projects Using Stable-Baselines

We try to maintain a list of project using stable-baselines in the documentation, please tell us when if you want your project to appear on this page ;)

Citing the Project

To cite this repository in publications:

@misc{stable-baselines,
  author = {Hill, Ashley and Raffin, Antonin and Ernestus, Maximilian and Gleave, Adam and Kanervisto, Anssi and Traore, Rene and Dhariwal, Prafulla and Hesse, Christopher and Klimov, Oleg and Nichol, Alex and Plappert, Matthias and Radford, Alec and Schulman, John and Sidor, Szymon and Wu, Yuhuai},
  title = {Stable Baselines},
  year = {2018},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/hill-a/stable-baselines}},
}

Maintainers

Stable-Baselines is currently maintained by Ashley Hill (aka @hill-a), Antonin Raffin (aka @araffin), Maximilian Ernestus (aka @erniejunior), Adam Gleave (@AdamGleave) and Anssi Kanervisto (@Miffyli).

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.

How To Contribute

To any interested in making the baselines better, there is still some documentation that needs to be done. If you want to contribute, please read CONTRIBUTING.md guide first.

Acknowledgments

Stable Baselines was created in the robotics lab U2IS (INRIA Flowers team) at ENSTA ParisTech.

Logo credits: L.M. Tenkes

Comments
  • Invalid Action Mask [WIP]

    Invalid Action Mask [WIP]

    This is about a month overdue, I'll go through some lines below and add comments.

    Right now, a number of tests don't pass, but this is per @araffin request to do a draft PR.

    closes #351

    experimental 
    opened by H-Park 150
  • V3.0 implementation design

    V3.0 implementation design

    Version3 is now online: https://github.com/DLR-RM/stable-baselines3

    Hello,

    Before starting the migration to tf2 for stable baselines v3, I would like to discuss some design point we should agree on.

    Which tf paradigm should we use?

    I would go for pytorch-like "eager mode", wrapping the method using a tf.function to improve the performance (as it is done here). The define-by-run is usually easier to read and debug (and I can compare it to my internal pytorch version). Wrapping it up with a tf.function should preserve performances.

    What is the roadmap?

    My idea would be:

    1. Refactor common folder (as done by @Miffyli in #540 )
    2. Implement one on-policy algorithm and one off-policy: I would go for PPO/TD3 and I can be in charge of that. This would allow to discuss concrete implementation details.
    3. Implement the rest, in order:
    • SAC
    • A2C
    • DQN
    • DDPG
    • HER
    • TRPO
    1. Implement the recurrent versions?

    I'm afraid that the remaining ones (ACKTR, GAIL and ACER) are not the easiest one to implement. And for GAIL, we can refer to https://github.com/HumanCompatibleAI/imitation by @AdamGleave et al.

    Is there other breaking changes we should do? Change in the interface?

    Some answers to this questions are linked here: https://github.com/hill-a/stable-baselines/issues/366

    There are different things that I would like to change/add.

    First, it would be adding the evaluation in the training loop. That is to say, we allow use to pass an eval_env on which the agent will be evaluated every eval_freq for n_eval_episodes. This is a true measure of the agent performance compared to training reward.

    I would like to manipulate only VecEnv in the algorithm (and wrap the gym.Env automatically if necessary) this simplify the thing (so we don't have to think about what is the type of the env). Currently, we are using an UnVecEnvWrapper which makes things complicated for DQN for instance.

    Should we maintain MPI support? I would favor switching to VecEnv too, this remove a dependency and unify the rest. (and would maybe allow to have an easy way to multiprocess SAC/DDPG or TD3 (cf #324 )). This would mean that we will remove PPO1 too.

    Next thing I would like to make default is the Monitor wrapper. This allow to retrieve statistics about the training and would remove the need of a buggy version of total_episode_reward_logger for computing reward (cf #143).

    As discussed in an other issue, I would like to unify the learning rate schedule too (would not be too difficult).

    I would like to unify also the parameters name (ex: ent_coef vs ent_coeff).

    Anyway, I plan to do a PR and we can then discuss on that.

    Regarding the transition

    As we will be switching to keras interface (at least for most of the layers), this will break previously saved models. I propose to create scripts that allow to convert old models to new SB version rather than try to be backward-compatible.

    Pinging @hill-a @erniejunior @AdamGleave @Miffyli

    PS: I hope I did not forget any important point

    EDIT: the draft repo is here: https://github.com/Stable-Baselines-Team/stable-baselines-tf2 (ppo and td3 included for now)

    v3 
    opened by araffin 44
  • Multithreading broken pipeline on custom Env

    Multithreading broken pipeline on custom Env

    First of all, thank you for this wonderful project, I can't stress it enough how badly baselines was in need of such a project.

    Now, the Multiprocessing Tutorial created by stable-baselines (see) states that the following is to be used to generate multiple envs - as an example of course:

    def make_env(env_id, rank, seed=0):
        """
        Utility function for multiprocessed env.
        
        :param env_id: (str) the environment ID
        :param num_env: (int) the number of environment you wish to have in subprocesses
        :param seed: (int) the inital seed for RNG
        :param rank: (int) index of the subprocess
        """
        def _init():
            env = gym.make(env_id)
            env.seed(seed + rank)
            return env
        set_global_seeds(seed)
        return _init
    

    However, for some obscure reason, python never calls _init, for some obvious reason: even though it has no arguments, it is still a function hence, please replace it with 'return _init()'.

    Secondly, even doing so results in an error when building the SubprocVecEnv([make_env(env_id, i) for i in range(numenvs)]), namely:

    Traceback (most recent call last):

    File "", line 1, in runfile('C:/Users/X/Desktop/thesis.py', wdir='C:/Users/X/Desktop')

    File "D:\Programs\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 705, in runfile execfile(filename, namespace)

    File "D:\Programs\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile exec(compile(f.read(), filename, 'exec'), namespace)

    File "C:/Users/X/Desktop/thesis.py", line 133, in env = SubprocVecEnv([make_env(env_id, i) for i in range(numenvs)])

    File "D:\Programs\Anaconda3\lib\site-packages\stable_baselines\common\vec_env\subproc_vec_env.py", line 52, in init process.start()

    File "D:\Programs\Anaconda3\lib\multiprocessing\process.py", line 105, in start self._popen = self._Popen(self)

    File "D:\Programs\Anaconda3\lib\multiprocessing\context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj)

    File "D:\Programs\Anaconda3\lib\multiprocessing\context.py", line 322, in _Popen return Popen(process_obj)

    File "D:\Programs\Anaconda3\lib\multiprocessing\popen_spawn_win32.py", line 65, in init reduction.dump(process_obj, to_child)

    File "D:\Programs\Anaconda3\lib\multiprocessing\reduction.py", line 60, in dump ForkingPickler(file, protocol).dump(obj)

    BrokenPipeError: [Errno 32] Broken pipe

    Any ideas on how to fix this? I have implemented a simply Gym env, does it need to extend/implement SubprocVecEnv?

    question 
    opened by lhorus 38
  • [question] [feature request] support for Dict and Tuple spaces

    [question] [feature request] support for Dict and Tuple spaces

    I want to train using two images from different cameras and an array of 1d data from a sensor. I'm passing these input as my env state. Obviously I need a cnn that can take those inputs, concatenate, and train on them. My question is how to pass these input to such a custom cnn in polocies.py. Also, I tried to pass two images and apparently dummy_vec_env.py had trouble with that. obs = env.reset() File "d:\resources\stable-baselines\stable_baselines\common\vec_env\dummy_vec_env.py", line 57, in reset self._save_obs(env_idx, obs) File "d:\resources\stable-baselines\stable_baselines\common\vec_env\dummy_vec_env.py", line 75, in _save_obs self.buf_obs[key][env_idx] = obs ValueError: cannot copy sequence with size 2 to array axis with dimension 80

    I appreciate any thoughts or examples.

    enhancement question v3 
    opened by AloshkaD 37
  • Policy base invalid action mask

    Policy base invalid action mask

    Currently support: Algorithm: PPO1, PPO2, A2C, ACER, ACKTR, TRPO Action_space: Discrete, MultiDiscrete Policy Network: MlpPolicy, MlpLnLstmPolicy, MlpLstmPolicy Policy Network(Theoretically supported, but not tested): CnnPolicy, CnnLnLstmPolicy, CnnLstmPolicy Vectorized Environments: DummyVecEnv, SubprocVecEnv

    How to use: Environment, Test

    opened by ChengYen-Tang 35
  • ppo2 performance and gpu utilization

    ppo2 performance and gpu utilization

    I am running a ppo2 model. I see high cpu utilization and low gpu utilization.

    When running:

    from tensorflow.python.client import device_lib
    print(device_lib.list_local_devices())
    

    I get:

    Python 3.7.3 (default, Mar 27 2019, 17:13:21) [MSC v.1915 64 bit (AMD64)] :: Anaconda, Inc. on win32
    Type "help", "copyright", "credits" or "license" for more information.
    >>> from tensorflow.python.client import device_lib
    >>> print(device_lib.list_local_devices())
    2019-05-06 11:06:02.117760: I tensorflow/core/platform/cpu_feature_guard.cc:141] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX AVX2
    2019-05-06 11:06:02.341488: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1433] Found device 0 with properties:
    name: GeForce GTX 1660 Ti major: 7 minor: 5 memoryClockRate(GHz): 1.77
    pciBusID: 0000:01:00.0
    totalMemory: 6.00GiB freeMemory: 4.92GiB
    2019-05-06 11:06:02.348112: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1512] Adding visible gpu devices: 0
    2019-05-06 11:06:02.838521: I tensorflow/core/common_runtime/gpu/gpu_device.cc:984] Device interconnect StreamExecutor with strength 1 edge matrix:
    2019-05-06 11:06:02.842724: I tensorflow/core/common_runtime/gpu/gpu_device.cc:990]      0
    2019-05-06 11:06:02.845154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1003] 0:   N
    2019-05-06 11:06:02.848092: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 4641 MB memory) -> physical GPU (device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5)
    [name: "/device:CPU:0"
    device_type: "CPU"
    memory_limit: 268435456
    locality {
    }
    incarnation: 8905916217148098349
    , name: "/device:GPU:0"
    device_type: "GPU"
    memory_limit: 4866611609
    locality {
      bus_id: 1
      links {
      }
    }
    incarnation: 7192145949653879362
    physical_device_desc: "device: 0, name: GeForce GTX 1660 Ti, pci bus id: 0000:01:00.0, compute capability: 7.5"
    ]
    

    I understand that tensorflow is "seeing" my gpu. Why is the low utilization when training a stable baseline model?

    # multiprocess environment
    n_cpu = 4
    env = PortfolioEnv(total_steps=settings['total_steps'], window_length=settings['window_length'], allow_short=settings['allow_short'] )
    env = SubprocVecEnv([lambda: env for i in range(n_cpu)])
    
    if settings['policy'] == 'MlpPolicy':
        model = PPO2(MlpPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
    elif settings['policy'] == 'MlpLstmPolicy': 
        model = PPO2(MlpLstmPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
    elif settings['policy'] == 'MlpLnLstmPolicy': 
        model = PPO2(MlpLnLstmPolicy, env, verbose=0, tensorboard_log=settings['tensorboard_log'])
    
    model.learn(total_timesteps=settings['total_timesteps'])
    
    model_name = str(settings['model_name']) + '_' + str(settings['policy']) + '_' + str(settings['total_timesteps']) + '_' + str(settings['total_steps']) + '_' + str(settings['window_length']) + '_' + str(settings['allow_short'])  
    model.save(model_name)
    
    question windows 
    opened by hn2 32
  • [Feature Request] Invalid Action Mask

    [Feature Request] Invalid Action Mask

    it would be very useful to be able to adjust the gradient based on a binary vector of which outputs you want to be considered when computing the gradient.

    This would be insanely helpful when dealing with environments where the number of actions is dependent on the observation. A simple example of this would be in StarCraft. At the beginning of a game, not every action is valid.

    enhancement 
    opened by H-Park 28
  • ValueError: could not broadcast input array from shape (2) into shape (7,3,5)

    ValueError: could not broadcast input array from shape (2) into shape (7,3,5)

    Describe the bug I am trying to run stable_baseline alogs such as ppo1, ddpg and get this error: ValueError: could not broadcast input array from shape (2) into shape (7,1,5)

    Code example

    action will be the portfolio weights from 0 to 1 for each asset

        self.action_space = gym.spaces.Box(-1, 1, shape=(len(instruments) + 1,), dtype=np.float32)  # include cash
    
        # get the observation space from the data min and max
        self.observation_space = gym.spaces.Box(low=-np.inf, high=np.inf, shape=(len(instruments), window_length, history.shape[-1]), dtype=np.float32)
    

    I tried using obs.reshape(-1), obs.flatten(), obs.ravel() nothing works. Also tried CnnPolicy onstead of MlpPolicy and got:

    ValueError: Negative dimension size caused by subtracting 8 from 7 for 'model/c1/Conv2D' (op: 'Conv2D') with input shapes: [?,7,1,5], [8,8,5,32].

    System Info Describe the characteristic of your environment: *library was installed using: git clone https://github.com/hill-a/stable-baselines.git cd stable-baselines pip install -e .

    • GPU models and configuration: no gpu, cpu only
    • Python 3.7.2
    • tensorflow 1.12.0
    • stable-baselines 2.4.1

    Additional context Add any other context about the problem here.

    tensorflow==1.13.1 cpu

    custom gym env 
    opened by hn2 28
  • Why does env.render() create multiple render screens? | LSTM policy predict with one env [question]

    Why does env.render() create multiple render screens? | LSTM policy predict with one env [question]

    When I run the code example from the docs for cartpole multiprocessing, it renders one window with all env's playing the game. It also renders individual windows with the same env's playing the same games.

    import gym
    import numpy as np
    
    from stable_baselines.common.policies import MlpPolicy
    from stable_baselines.common.vec_env import SubprocVecEnv
    from stable_baselines.common import set_global_seeds
    from stable_baselines import ACKTR
    
    def make_env(env_id, rank, seed=0):
        """
        Utility function for multiprocessed env.
    
        :param env_id: (str) the environment ID
        :param num_env: (int) the number of environments you wish to have in subprocesses
        :param seed: (int) the inital seed for RNG
        :param rank: (int) index of the subprocess
        """
        def _init():
            env = gym.make(env_id)
            env.seed(seed + rank)
            return env
        set_global_seeds(seed)
        return _init
    
    env_id = "CartPole-v1"
    num_cpu = 4  # Number of processes to use
    # Create the vectorized environment
    env = SubprocVecEnv([make_env(env_id, i) for i in range(num_cpu)])
    
    model = ACKTR(MlpPolicy, env, verbose=1)
    model.learn(total_timesteps=25000)
    
    obs = env.reset()
    for _ in range(1000):
        action, _states = model.predict(obs)
        obs, rewards, dones, info = env.step(action)
        env.render()
    

    System Info Describe the characteristic of your environment:

    • Vanilla install, followed the docs using pip
    • gpus: 2-gtx-1080ti's
    • Python version 3.6.5
    • Tensorflow version 1.12.0
    • ffmpeg 4.0

    Additional context cartpole

    question 
    opened by SerialIterator 24
  • [feature request] Implement goal-parameterized algorithms (HER)

    [feature request] Implement goal-parameterized algorithms (HER)

    I'd like to implement Hindsight Experience Replay (HER). This can be based on a whatever goal-parameterized RL off-policy algorithm.

    Goal-parameterized architectures: it requires a variable for the current goal and one for the current outcome. By outcome, I mean anything that is requires to compute the current outcome in the process of targeting the goal, e.g. the RL task is to reach a 3D target (the goal) with a robotic hand. The position of the target is the goal, the position of the hand is the outcome. The reward is a function of the distance between the two. Goal and outcome are usually subparts of the state space.

    How Gym handles this: In Gym, there is a class called GoalEnv to deal with such environments.

    • The variable observation_space is replaced by another class that contains the true observation space observation_space.spaces['observation'], goal space (observation_space.spaces['desired_goal']) and outcome space(observation_space.spaces['achieved_goal']).
    • The observation returned first by env.step is now a dictionnary obs['observation'], obs['desired_goal'], obs['achieved_goal'].
    • The environment defines a reward function (compute_reward), that takes as argument the goal and the outcome to return the reward
    • It also contains a sample_goal function that simply samples a goal uniformly from the goal space.

    Stable-baselines does not consider this so far. The replay buffer, BasePolicy, BaseRLModels OffPolicyRLModels only consider observation, and are not made to include a notion of goal or outcome. Two solutions:

    1. Adapt these default classes to allow the representation of goals and outcomes in option.
    2. Concatenate the goal and outcome to the observation everywhere so that the previously mention classes don't see the difference. This requires to keep track of the indices of goals and outcomes in the full obs_goal_outcome vector. Ashley started to do this from what I understood. However, he did not take into account the GoalEnv class of Gym, which I think we should use as it's kind of neat, and it's used for the Robotic Fetch environment, which are kind of the only one generally used so far.

    I think the second is more clear as it separates observation from goals and outcomes, but probably it would make the code less easy to follow, and would require more changes than the first option. So let's go for the first as Ashley started.

    First thoughts on how it could be done.

    1. we need (as Ashley started to do), a wrapper around the gym environment. GoalEnv are different from usual env because they return a dict in place of the former observation vector. This wrapper would unpack the observation in obs, goal, outcome from the GoalEnv.step. It would return a concatenation of all of those. Ashley considered that the goal was in the observation space, so that the concatenation was twice as long as the observation. This is generally not true. So we would need to keep as attribute the size of the goal and outcome spaces. It would keep the different spaces as attributes, keep the function to sample goals, and the reward function.

    2. A multi-goal replay buffer to implement HER replay. It takes the observation from the buffer and redecompose it in obs, goal and outcome before performing replay.

    I think it does not require too much work after what Ashley started to do. It would be a few modifications to integrate the GoalEnv of gym, as it is a standard way to use multi-goal environments. Then correct the assumption he made about the dimension of the goal.

    If you're all ok, I will start in that direction and test them on the Fetch environments. In the baselines, their performance is achieved with 19 processes in parallel. They basically average the update of the 19 actors. I'll try first without parallelization.

    enhancement 
    opened by ccolas 22
  • Image input into TD3

    Image input into TD3

    Hi,

    I have a custom env with a image observation space and a continuous action space. After training TD3 policies, when I evaluate them there seems to be no reaction to the image observation (I manually drag objects in front of the camera to see what happens).

    from stable_baselines.td3.policies import CnnPolicy as td3CnnPolicy
    from stable_baselines import TD3
    
    env = gym.make('GripperEnv-v0')
    env = Monitor(env, log_dir)
    ExperimentName = "TD3_test"
    policy_kwargs = dict(layers=[64, 64])
    model = TD3(td3CnnPolicy, env, verbose=1, policy_kwargs=policy_kwargs, tensorboard_log="tmp/", buffer_size=15000,
                batch_size=2200, train_freq=2200, learning_starts=10000, learning_rate=1e-3)
    
    callback = SaveOnBestTrainingRewardCallback(check_freq=1100, log_dir=log_dir)
    time_steps = 50000
    model.learn(total_timesteps=int(time_steps), callback=callback)
    model.save("128128/"+ExperimentName)
    

    I can view the observation using opencv and it is the right image (single channel, pixels between 0 and 1).

    So how I understand it is that the CNN is 3 conv2D layers that connect to two layers 64 wide. Is it possible that I somehow disconnected these two parts or could it be that my hyper-parameters are just that bad? The behavior that is learnt by the policies is similar to if I just put in zero pixels in the network.

    bug question custom gym env 
    opened by C-monC 20
  • How to create an actor-critic network with two separate LSTMs

    How to create an actor-critic network with two separate LSTMs

    Hi, I was wondering if it's possible to have an actor-critic network with two separate LSTMs, where one LSTM outputs value (critic) and one LSTM outputs actions (actor)? Similar to #1002 but the two LSTMs would be receiving the same input from CNN layers.

    Based on the LSTMPolicy source code, the net_arch parameter can only take one 'LSTM' occurrence and LSTM's are only supported in the shared part of the policy network.

    # Build the non-shared part of the policy-network
                    latent_policy = latent
                    for idx, pi_layer_size in enumerate(policy_only_layers):
                        if pi_layer_size == "lstm":
                            raise NotImplementedError("LSTMs are only supported in the shared part of the policy network.")
                        assert isinstance(pi_layer_size, int), "Error: net_arch[-1]['pi'] must only contain integers."
                        latent_policy = act_fun(
                            linear(latent_policy, "pi_fc{}".format(idx), pi_layer_size, init_scale=np.sqrt(2)))
    
                    # Build the non-shared part of the value-network
                    latent_value = latent
                    for idx, vf_layer_size in enumerate(value_only_layers):
                        if vf_layer_size == "lstm":
                            raise NotImplementedError("LSTMs are only supported in the shared part of the value function "
                                                      "network.")
                        assert isinstance(vf_layer_size, int), "Error: net_arch[-1]['vf'] must only contain integers."
                        latent_value = act_fun(
                            linear(latent_value, "vf_fc{}".format(idx), vf_layer_size, init_scale=np.sqrt(2)))
    

    I was wondering if there is a way around this, or if the implementation of this policy class would simply not support having two LSTMs? I tried writing code to get around this, but the tensorboard model graph for the resulting network has looked funky that I might just switch to stable baselines 3.

    opened by ashleychung830 0
  • How to convert timestep based learning to episodic learning

    How to convert timestep based learning to episodic learning

    Question I am using a custom environment to to path planning using ddpg algorithm , (i am using stable baslines3)

    model = DDPG("MlpPolicy", env, action_noise=action_noise, verbose=1) model.learn(total_timesteps=10000, log_interval=1) model.save("sb3_ddpg_model")

    here model.learn is used for timestep based learning but i want to convert it to certain like 3000 steps per episode and have multiple episodes,how can i achieve that?

    opened by muk465 1
  • TypeError: can't pickle dolfin.cpp.geometry.Point objects

    TypeError: can't pickle dolfin.cpp.geometry.Point objects

    Important Note: We do not do technical support, nor consulting and don't answer personal questions per email.

    If you have any questions, feel free to create an issue with the tag [question].
    If you wish to suggest an enhancement or feature request, add the tag [feature request].
    If you are submitting a bug report, please fill in the following details.

    If your issue is related to a custom gym environment, please check it first using:

    from stable_baselines.common.env_checker import check_env
    
    env = CustomEnv(arg1, ...)
    # It will check your custom environment and output additional warnings if needed
    check_env(env)
    

    Describe the bug A clear and concise description of what the bug is. TypeError: can't pickle dolfin.cpp.geometry.Point objects when I try to use Multithreaded computation Code example Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful. ''' n_cpu = 12 env = SubprocVecEnv([lambda: env_ for i in range(n_cpu)]) ''' Please use the markdown code blocks for both code and stack traces.

    from stable_baselines import ...
    
    
    Traceback (most recent call last): File ...
    
    

    System Info Describe the characteristic of your environment:

    • Describe how the library was installed (pip, docker, source, ...)
    • pip install stable-baselines
    • GPU models and configuration
    • Python version
    • python 3.7.0
    • Tensorflow version
    • 1.14.0
    • Versions of any other relevant libraries gym 0.25.2 Additional context Add any other context about the problem here.
    opened by jiangzhangze 0
  • Custom gym Env Assertation error regarding reset () method

    Custom gym Env Assertation error regarding reset () method

    Hello,

    I am having some issues when checking my custom environment. I have checked the several solutions adopted and suggested by other people here, but they don't seem to solve the issue I'm having.

    it says: AssertionError: The observation returned by the reset() method does not match the given observation space.

    Here are the lines of codes I used to create my custom env:

    class EnvWrapperSB2(gym.Env):
        def __init__(self, no_threads, **params):
            super(EnvWrapperSB2, self).__init__()
            #self.action_space = None
            #self.observation_space = None
            self.params = params
            self.no_threads = no_threads
            self.ports = [13968+i+np.random.randint(40000) for i in range(no_threads)]
            self.commands = self._craft_commands(params)
            #self.action_space = spaces.Discrete(1)
            self.action_space = spaces.Box(low=0.0, high=1.0, shape=(1,), dtype=int)
            self.observation_space = spaces.Box(low=0.0, high=1.0, shape=(1,), dtype=np.float32)
    
            self.SCRIPT_RUNNING = False
            self.envs = []
    
            self.run()
            for port in self.ports:
                env = ns3env.Ns3Env(port=port, stepTime=params['envStepTime'], startSim=0, simSeed=0, simArgs=params, debug=False)
                self.envs.append(env)
    
            self.SCRIPT_RUNNING = True
    
        def run(self):
            if self.SCRIPT_RUNNING:
                raise AlreadyRunningException("Script is already running")
    
            for cmd, port in zip(self.commands, self.ports):
                subprocess.Popen(['bash', '-c', cmd])
            self.SCRIPT_RUNNING = True
    
        def _craft_commands(self, params):
            try:
                waf_pwd = find_waf_path("./")
            except FileNotFoundError:
                import sys
                sys.path.append("../../")
                waf_pwd = find_waf_path("../../")
    
            command = f'{waf_pwd} --run "RLinWiFi-master-original-queue-size'
            for key, val in params.items():
                command+=f" --{key}={val}"
    
            commands = []
            for p in self.ports:
                commands.append(command+f' --openGymPort={p}"')
    
            return commands
    
        def reset(self):
            obs = []
            for env in self.envs:
                obs.append(env.reset())
            #print("reset - obs tamanho",len(obs))
            #print("reset - obs",obs)
    
            return np.array(obs)
            
        def step(self, actions):
            next_obs, reward, done, info = [], [], [], []
    
            for i, env in enumerate(self.envs):
                no, rew, dn, inf = env.step(actions[i].tolist())
                next_obs.append(no)
                reward.append(rew)
                done.append(dn)
                info.append(inf)
    
            return np.array(next_obs), np.array(reward), np.array(done), np.array(info)
    
        #@property
        #def observation_space(self):
         #   dim = repr(self.envs[0].observation_space).replace('(', '').replace(',)', '').split(", ")[2]
          #  return (self.no_threads, int(dim))
    
        #@property
        #def action_space(self):
         #   dim = repr(self.envs[0].action_space).replace('(', '').replace(',)', '').split(", ")[2]
          #  return (self.no_threads, int(dim))
    
        def close(self):
            time.sleep(5)
            for env in self.envs:
                env.close()
            # subprocess.Popen(['bash', '-c', "killall linear-mesh"])
    
            self.SCRIPT_RUNNING = False
    
        def __getattr__(self, attr):
            for env in self.envs:
                env.attr()
    
    

    Then I check the environment with the intention to use it on a SAC agent.

    sim_args = {
        "simTime": simTime,
        "envStepTime": stepTime,
        "historyLength": history_length,
        "scenario": "basic",
        "nWifi": 5,
    }
    threads_no = 1
    env = EnvWrapperSB2(threads_no, **sim_args)
    
    from stable_baselines.common.env_checker import check_env
    
    # If the environment don't follow the interface, an error will be thrown
    check_env(env, warn=True)
    
    AssertionError: The observation returned by the `reset()` method does not match the given observation space
    
    model = SAC(MlpPolicy, env, verbose=1)
    

    System Info

    • Python version = 3.7.10
    • Tensorflow version = 1.14.0
    • Stable-Baseline = 2.10.2

    Can someone please help me? perhaps I'm doing something wrong

    opened by sheila-janota 0
  • [question] for an RL algorithm with a discrete action space, is it possible to get a probability of outcomes when feeding in data?

    [question] for an RL algorithm with a discrete action space, is it possible to get a probability of outcomes when feeding in data?

    For example, if I feed in data to an RL algorithm with possible actions up or down, is it possible to know for every frame what the probability is that the model is going to pick up or down?

    opened by george-adams1 0
  • [Question]Callback collected model does not have same reward as training verbose[custom gym environment]

    [Question]Callback collected model does not have same reward as training verbose[custom gym environment]

    Model saved periodically do not match the Reward on training window

    I have a question when I was checking my training result. I am using a custom gym environment, and PPO algorithm from SB3.

    During training, I save the model periodically in order to see how the model is evolving. And during learning, I also set the verbose=1 to keep track on the training progress. However, when I look at my temporary model I save periodically, the reward of those models do not have the same reward as the time they were saved.

    For example, I saved "model_1" at timesteps=10,000 using a custom callback function. At the same time, the training windows showed "ep_rew_mean=366 " at timesteps=10,000. However, when I test "model_1" individually, the reward of this is 200. During the testing, I set model.predict(obs,deterministic = True). I wonder why this will happen, and is this cause by my callback function?

    Moreover, my final model also do not have the same reward as the training window.

    Here is my code for custom callback function:

    class SaveOnModelCallback(BaseCallback):
        """
        Callback for saving a model (the check is done every ``check_freq`` steps)
        based on the training reward (in practice, we recommend using ``EvalCallback``).
    
        :param check_freq: (int)
        :param log_dir: (str) Path to the folder where the model will be saved.
          It must contains the file created by the ``Monitor`` wrapper.
        :param verbose: (int)
        """
        def __init__(self, check_freq: int, log_dir: str, verbose=1):
            super(SaveOnModelCallback, self).__init__(verbose)
            self.check_freq = check_freq
            self.log_dir = log_dir
            self.save_path = os.path.join(log_dir, 'best_model')
    
        def _init_callback(self) -> None:
            # Create folder if needed
            if self.save_path is not None:
                os.makedirs(self.save_path, exist_ok=True)
    
        def _on_step(self) -> bool:
            if self.n_calls % self.check_freq == 0:  
              count = self.n_calls // self.check_freq
              str1 = 'Tempmodel'
              print(f"Num timesteps: {self.num_timesteps}")
              print(f"Saving model to {self.save_path}.zip")
              self.model.save(str1+str(count))
    
    
            return True
    
    opened by hotpotking-lol 1
Releases(v2.10.1)
  • v2.10.1(Aug 5, 2020)

    Breaking Changes:

    • render() method of VecEnvs now only accept one argument: mode

    New Features:

    • Added momentum parameter to A2C for the embedded RMSPropOptimizer (@kantneel)
    • ActionNoise is now an abstract base class and implements __call__, NormalActionNoise and OrnsteinUhlenbeckActionNoise have return types (@PartiallyTyped)
    • HER now passes info dictionary to compute_reward, allowing for the computation of rewards that are independent of the goal (@tirafesi)

    Bug Fixes:

    • Fixed DDPG sampling empty replay buffer when combined with HER (@tirafesi)
    • Fixed a bug in HindsightExperienceReplayWrapper, where the openai-gym signature for compute_reward was not matched correctly (@johannes-dornheim)
    • Fixed SAC/TD3 checking time to update on learn steps instead of total steps (@PartiallyTyped)
    • Added **kwarg pass through for reset method in atari_wrappers.FrameStack (@PartiallyTyped)
    • Fix consistency in setup_model() for SAC, target_entropy now uses self.action_space instead of self.env.action_space (@PartiallyTyped)
    • Fix reward threshold in test_identity.py
    • Partially fix tensorboard indexing for PPO2 (@enderdead)
    • Fixed potential bug in DummyVecEnv where copy() was used instead of deepcopy()
    • Fixed a bug in GAIL where the dataloader was not available after saving, causing an error when using CheckpointCallback
    • Fixed a bug in SAC where any convolutional layers were not included in the target network parameters.
    • Fixed render() method for VecEnvs
    • Fixed seed()``` method forSubprocVecEnv``
    • Fixed a bug callback.locals did not have the correct values (@PartiallyTyped)
    • Fixed a bug in the close() method of SubprocVecEnv, causing wrappers further down in the wrapper stack to not be closed. (@NeoExtended)
    • Fixed a bug in the generate_expert_traj() method in record_expert.py when using a non-image vectorized environment (@jbarsce)
    • Fixed a bug in CloudPickleWrapper's (used by VecEnvs) __setstate___ where loading was incorrectly using pickle.loads (@shwang).
    • Fixed a bug in SAC and TD3 where the log timesteps was not correct(@YangRui2015)
    • Fixed a bug where the environment was reset twice when using evaluate_policy

    Others:

    • Added version.txt to manage version number in an easier way
    • Added .readthedocs.yml to install requirements with read the docs
    • Added a test for seeding ``SubprocVecEnv``` and rendering

    Documentation:

    • Fix typos (@caburu)
    • Fix typos in PPO2 (@kvenkman)
    • Removed stable_baselines\deepq\experiments\custom_cartpole.py (@aakash94)
    • Added Google's motion imitation project
    • Added documentation page for monitor
    • Fixed typos and update VecNormalize example to show normalization at test-time
    • Fixed train_mountaincar description
    • Added imitation baselines project
    • Updated install instructions
    • Added Slime Volleyball project (@hardmaru)
    • Added a table of the variables accessible from the on_step function of the callbacks for each algorithm (@PartiallyTyped)
    • Fix typo in README.md (@ColinLeongUDRI)
    Source code(tar.gz)
    Source code(zip)
  • v2.10.0(Mar 12, 2020)

    Breaking Changes

    • evaluate_policy now returns the standard deviation of the reward per episode as second return value (instead of n_steps)

    • evaluate_policy now returns as second return value a list of the episode lengths when return_episode_rewards is set to True (instead of n_steps)

    • Callback are now called after each env.step() for consistency (it was called every n_steps before in algorithm like A2C or PPO2)

    • Removed unused code in common/a2c/utils.py (calc_entropy_softmax, make_path)

    • Refactoring, including removed files and moving functions.

      • Algorithms no longer import from each other, and common does not import from algorithms.

      • a2c/utils.py removed and split into other files:

        • common/tf_util.py: sample, calc_entropy, mse, avg_norm, total_episode_reward_logger, q_explained_variance, gradient_add, avg_norm, check_shape, seq_to_batch, batch_to_seq.
        • common/tf_layers.py: conv, linear, lstm, _ln, lnlstm, conv_to_fc, ortho_init.
        • a2c/a2c.py: discount_with_dones.
        • acer/acer_simple.py: get_by_index, EpisodeStats.
        • common/schedules.py: constant, linear_schedule, middle_drop, double_linear_con, double_middle_drop, SCHEDULES, Scheduler.
      • trpo_mpi/utils.py functions moved (traj_segment_generator moved to common/runners.py, flatten_lists to common/misc_util.py).

      • ppo2/ppo2.py functions moved (safe_mean to common/math_util.py, constfn and get_schedule_fn to common/schedules.py).

      • sac/policies.py function mlp moved to common/tf_layers.py.

      • sac/sac.py function get_vars removed (replaced with tf.util.get_trainable_vars).

      • deepq/replay_buffer.py renamed to common/buffers.py.

    New Features:

    • Parallelized updating and sampling from the replay buffer in DQN. (@flodorner)
    • Docker build script, scripts/build_docker.sh, can push images automatically.
    • Added callback collection
    • Added unwrap_vec_normalize and sync_envs_normalization in the vec_env module to synchronize two VecNormalize environment
    • Added a seeding method for vectorized environments. (@NeoExtended)
    • Added extend method to store batches of experience in ReplayBuffer. (@solliet)

    Bug Fixes:

    • Fixed Docker images via scripts/build_docker.sh and Dockerfile: GPU image now contains tensorflow-gpu, and both images have stable_baselines installed in developer mode at correct directory for mounting.
    • Fixed Docker GPU run script, scripts/run_docker_gpu.sh, to work with new NVidia Container Toolkit.
    • Repeated calls to RLModel.learn() now preserve internal counters for some episode logging statistics that used to be zeroed at the start of every call.
    • Fix DummyVecEnv.render for num_envs > 1. This used to print a warning and then not render at all. (@shwang)
    • Fixed a bug in PPO2, ACER, A2C, and ACKTR where repeated calls to learn(total_timesteps) reset the environment on every call, potentially biasing samples toward early episode timesteps. (@shwang)
    • Fixed by adding lazy property ActorCriticRLModel.runner. Subclasses now use lazily-generated self.runner instead of reinitializing a new Runner every time learn() is called.
    • Fixed a bug in check_env where it would fail on high dimensional action spaces
    • Fixed Monitor.close() that was not calling the parent method
    • Fixed a bug in BaseRLModel when seeding vectorized environments. (@NeoExtended)
    • Fixed num_timesteps computation to be consistent between algorithms (updated after env.step()) Only TRPO and PPO1 update it differently (after synchronization) because they rely on MPI
    • Fixed bug in TRPO with NaN standardized advantages (@richardwu)
    • Fixed partial minibatch computation in ExpertDataset (@richardwu)
    • Fixed normalization (with VecNormalize) for off-policy algorithms
    • Fixed sync_envs_normalization to sync the reward normalization too
    • Bump minimum Gym version (>=0.11)

    Others:

    • Removed redundant return value from a2c.utils::total_episode_reward_logger. (@shwang)
    • Cleanup and refactoring in common/identity_env.py (@shwang)
    • Added a Makefile to simplify common development tasks (build the doc, type check, run the tests)

    Documentation:

    • Add dedicated page for callbacks
    • Fixed example for creating a GIF (@KuKuXia)
    • Change Colab links in the README to point to the notebooks repo
    • Fix typo in Reinforcement Learning Tips and Tricks page. (@mmcenta)
    Source code(tar.gz)
    Source code(zip)
  • v2.9.0(Dec 19, 2019)

    Breaking Changes:

    • The seed argument has been moved from learn() method to model constructor in order to have reproducible results
    • allow_early_resets of the Monitor wrapper now default to True
    • make_atari_env now returns a DummyVecEnv by default (instead of a SubprocVecEnv) this usually improves performance.
    • Fix inconsistency of sample type, so that mode/sample function returns tensor of tf.int64 in CategoricalProbabilityDistribution/MultiCategoricalProbabilityDistribution (@seheevic)

    New Features:

    • Add n_cpu_tf_sess to model constructor to choose the number of threads used by Tensorflow

    • Environments are automatically wrapped in a DummyVecEnv if needed when passing them to the model constructor

    • Added stable_baselines.common.make_vec_env helper to simplify VecEnv creation

    • Added stable_baselines.common.evaluation.evaluate_policy helper to simplify model evaluation

    • VecNormalize changes:

      • Now supports being pickled and unpickled (@AdamGleave).
      • New methods .normalize_obs(obs) and normalize_reward(rews) apply normalization to arbitrary observation or rewards without updating statistics (@shwang)
      • .get_original_reward() returns the unnormalized rewards from the most recent timestep
      • .reset() now collects observation statistics (used to only apply normalization)
    • Add parameter exploration_initial_eps to DQN. (@jdossgollin)

    • Add type checking and PEP 561 compliance. Note: most functions are still not annotated, this will be a gradual process.

    • DDPG, TD3 and SAC accept non-symmetric action spaces. (@Antymon)

    • Add check_env util to check if a custom environment follows the gym interface (@araffin and @justinkterry)

    Bug Fixes:

    • Fix seeding, so it is now possible to have deterministic results on cpu
    • Fix a bug in DDPG where predict method with deterministic=False would fail
    • Fix a bug in TRPO: mean_losses was not initialized causing the logger to crash when there was no gradients (@MarvineGothic)
    • Fix a bug in cmd_util from API change in recent Gym versions
    • Fix a bug in DDPG, TD3 and SAC where warmup and random exploration actions would end up scaled in the replay buffer (@Antymon)

    Deprecations:

    • nprocs (ACKTR) and num_procs (ACER) are deprecated in favor of n_cpu_tf_sess which is now common to all algorithms
    • VecNormalize: load_running_average and save_running_average are deprecated in favour of using pickle.

    Others:

    • Add upper bound for Tensorflow version (<2.0.0).
    • Refactored test to remove duplicated code
    • Add pull request template
    • Replaced redundant code in load_results (@jbulow)
    • Minor PEP8 fixes in dqn.py (@justinkterry)
    • Add a message to the assert in PPO2
    • Update replay buffer doctring
    • Fix VecEnv docstrings

    Documentation:

    • Add plotting to the Monitor example (@rusu24edward)
    • Add Snake Game AI project (@pedrohbtp)
    • Add note on the support Tensorflow versions.
    • Remove unnecessary steps required for Windows installation.
    • Remove DummyVecEnv creation when not needed
    • Added make_vec_env to the examples to simplify VecEnv creation
    • Add QuaRL project (@srivatsankrishnan)
    • Add Pwnagotchi project (@evilsocket)
    • Fix multiprocessing example (@rusu24edward)
    • Fix result_plotter example
    • Add JNRR19 tutorial (by @edbeeching, @hill-a and @araffin)
    • Updated notebooks link
    • Fix typo in algos.rst, "containes" to "contains" (@SyllogismRXS)
    • Fix outdated source documentation for load_results
    • Add PPO_CPP project (@Antymon)
    • Add section on C++ portability of Tensorflow models (@Antymon)
    • Update custom env documentation to reflect new gym API for the close() method (@justinkterry)
    • Update custom env documentation to clarify what step and reset return (@justinkterry)
    • Add RL tips and tricks for doing RL experiments
    • Corrected lots of typos
    • Add spell check to documentation if available
    Source code(tar.gz)
    Source code(zip)
  • v2.8.0(Sep 29, 2019)

    Breaking Changes:

    • OpenMPI-dependent algorithms (PPO1, TRPO, GAIL, DDPG) are disabled in the default installation of stable_baselines. mpi4py is now installed as an extra. When mpi4py is not available, stable-baselines skips imports of OpenMPI-dependent algorithms. See installation notes <openmpi> and Issue #430.
    • SubprocVecEnv now defaults to a thread-safe start method, forkserver when available and otherwise spawn. This may require application code be wrapped in if __name__ == '__main__'. You can restore previous behavior by explicitly setting start_method = 'fork'. See PR #428.
    • Updated dependencies: tensorflow v1.8.0 is now required
    • Removed checkpoint_path and checkpoint_freq argument from DQN that were not used
    • Removed bench/benchmark.py that was not used
    • Removed several functions from common/tf_util.py that were not used
    • Removed ppo1/run_humanoid.py

    New Features:

    • important change Switch to using zip-archived JSON and Numpy savez for storing models for better support across library/Python versions. (@Miffyli)
    • ACKTR now supports continuous actions
    • Add double_q argument to DQN constructor

    Bug Fixes:

    • Skip automatic imports of OpenMPI-dependent algorithms to avoid an issue where OpenMPI would cause stable-baselines to hang on Ubuntu installs. See installation notes <openmpi> and Issue #430.
    • Fix a bug when calling logger.configure() with MPI enabled (@keshaviyengar)
    • set allow_pickle=True for numpy>=1.17.0 when loading expert dataset
    • Fix a bug when using VecCheckNan with numpy ndarray as state. Issue #489. (@ruifeng96150)

    Deprecations:

    • Models saved with cloudpickle format (stable-baselines<=2.7.0) are now deprecated in favor of zip-archive format for better support across Python/Tensorflow versions. (@Miffyli)

    Others:

    • Implementations of noise classes (AdaptiveParamNoiseSpec, NormalActionNoise, OrnsteinUhlenbeckActionNoise) were moved from stable_baselines.ddpg.noise to stable_baselines.common.noise. The API remains backward-compatible; for example from stable_baselines.ddpg.noise import NormalActionNoise is still okay. (@shwang)
    • Docker images were updated
    • Cleaned up files in common/ folder and in acktr/ folder that were only used by old ACKTR version (e.g. filter.py)
    • Renamed acktr_disc.py to acktr.py

    Documentation:

    • Add WaveRL project (@jaberkow)
    • Add Fenics-DRL project (@DonsetPG)
    • Fix and rename custom policy names (@eavelardev)
    • Add documentation on exporting models.
    • Update maintainers list (Welcome to @Miffyli)
    Source code(tar.gz)
    Source code(zip)
  • v2.7.0(Jul 31, 2019)

    New Features

    • added Twin Delayed DDPG (TD3) algorithm, with HER support
    • added support for continuous action spaces to action_probability, computing the PDF of a Gaussian policy in addition to the existing support for categorical stochastic policies.
    • added flag to action_probability to return log-probabilities.
    • added support for python lists and numpy arrays in logger.writekvs. (@dwiel)
    • the info dict returned by VecEnvs now include a terminal_observation key providing access to the last observation in a trajectory. (@qxcv)

    Bug Fixes

    • fixed a bug in traj_segment_generator where the episode_starts was wrongly recorded, resulting in wrong calculation of Generalized Advantage Estimation (GAE), this affects TRPO, PPO1 and GAIL (thanks to @miguelrass for spotting the bug)
    • added missing property n_batch in BasePolicy.

    Others

    • renamed some keys in traj_segment_generator to be more meaningful
    • retrieve unnormalized reward when using Monitor wrapper with TRPO, PPO1 and GAIL to display them in the logs (mean episode reward)
    • clean up DDPG code (renamed variables)

    Documentation

    • doc fix for the hyperparameter tuning command in the rl zoo
    • added an example on how to log additional variable with tensorboard and a callback
    Source code(tar.gz)
    Source code(zip)
  • v2.6.0(Jun 13, 2019)

    Breaking Changes:

    • breaking change removed stable_baselines.ddpg.memory in favor of stable_baselines.deepq.replay_buffer (see fix below)

    Breaking Change: DDPG replay buffer was unified with DQN/SAC replay buffer. As a result, when loading a DDPG model trained with stable_baselines<2.6.0, it throws an import error. You can fix that using:

    import sys
    import pkg_resources
    
    import stable_baselines
    
    # Fix for breaking change for DDPG buffer in v2.6.0
    if pkg_resources.get_distribution("stable_baselines").version >= "2.6.0":
        sys.modules['stable_baselines.ddpg.memory'] = stable_baselines.deepq.replay_buffer
        stable_baselines.deepq.replay_buffer.Memory = stable_baselines.deepq.replay_buffer.ReplayBuffer
    

    We recommend you to save again the model afterward, so the fix won't be needed the next time the trained agent is loaded.

    New Features:

    • revamped HER implementation: clean re-implementation from scratch, now supports DQN, SAC and DDPG
    • add action_noise param for SAC, it helps exploration for problem with deceptive reward
    • The parameter filter_size of the function conv in A2C utils now supports passing a list/tuple of two integers (height and width), in order to have non-squared kernel matrix. (@yutingsz)
    • add random_exploration parameter for DDPG and SAC, it may be useful when using HER + DDPG/SAC. This hack was present in the original OpenAI Baselines DDPG + HER implementation.
    • added load_parameters and get_parameters to base RL class. With these methods, users are able to load and get parameters to/from existing model, without touching tensorflow. (@Miffyli)
    • added specific hyperparameter for PPO2 to clip the value function (cliprange_vf)
    • added VecCheckNan wrapper

    Bug Fixes:

    • bugfix for VecEnvWrapper.__getattr__ which enables access to class attributes inherited from parent classes.
    • fixed path splitting in TensorboardWriter._get_latest_run_id() on Windows machines (@PatrickWalter214)
    • fixed a bug where initial learning rate is logged instead of its placeholder in A2C.setup_model (@sc420)
    • fixed a bug where number of timesteps is incorrectly updated and logged in A2C.learn and A2C._train_step (@sc420)
    • fixed num_timesteps (total_timesteps) variable in PPO2 that was wrongly computed.
    • fixed a bug in DDPG/DQN/SAC, when there were the number of samples in the replay buffer was lesser than the batch size (thanks to @dwiel for spotting the bug)
    • removed a2c.utils.find_trainable_params please use common.tf_util.get_trainable_vars instead. find_trainable_params was returning all trainable variables, discarding the scope argument. This bug was causing the model to save duplicated parameters (for DDPG and SAC) but did not affect the performance.

    Deprecations:

    • deprecated memory_limit and memory_policy in DDPG, please use buffer_size instead. (will be removed in v3.x.x)

    Others:

    • important change switched to using dictionaries rather than lists when storing parameters, with tensorflow Variable names being the keys. (@Miffyli)
    • removed unused dependencies (tdqm, dill, progressbar2, seaborn, glob2, click)
    • removed get_available_gpus function which hadn't been used anywhere (@Pastafarianist)

    Documentation:

    • added guide for managing NaN and inf
    • updated ven_env doc
    • misc doc updates
    Source code(tar.gz)
    Source code(zip)
  • v2.5.1(May 4, 2019)

    Warning: breaking change when using custom policies

    • doc update (fix example of result plotter + improve doc)
    • fixed logger issues when stdout lacks read function
    • fixed a bug in common.dataset.Dataset where shuffling was not disabled properly (it affects only PPO1 with recurrent policies)
    • fixed output layer name for DDPG q function, used in pop-art normalization and l2 regularization of the critic
    • added support for multi env recording to generate_expert_traj (@XMaster96)
    • added support for LSTM model recording to generate_expert_traj (@XMaster96)
    • GAIL: remove mandatory matplotlib dependency and refactor as subclass of TRPO (@kantneel and @AdamGleave)
    • added get_attr(), env_method() and set_attr() methods for all VecEnv. Those methods now all accept indices keyword to select a subset of envs. set_attr now returns None rather than a list of None. (@kantneel)
    • GAIL: gail.dataset.ExpertDataset supports loading from memory rather than file, and gail.dataset.record_expert supports returning in-memory rather than saving to file.
    • added support in VecEnvWrapper for accessing attributes of arbitrarily deeply nested instances of VecEnvWrapper and VecEnv. This is allowed as long as the attribute belongs to exactly one of the nested instances i.e. it must be unambiguous. (@kantneel)
    • fixed bug where result plotter would crash on very short runs (@Pastafarianist)
    • added option to not trim output of result plotter by number of timesteps (@Pastafarianist)
    • clarified the public interface of BasePolicy and ActorCriticPolicy. Breaking change when using custom policies: masks_ph is now called dones_ph.
    • support for custom stateful policies.
    • fixed episode length recording in trpo_mpi.utils.traj_segment_generator (@GerardMaggiolino)
    Source code(tar.gz)
    Source code(zip)
  • v2.5.0(Mar 28, 2019)

    • fixed various bugs in GAIL
    • added scripts to generate dataset for gail
    • added tests for GAIL + data for Pendulum-v0
    • removed unused utils file in DQN folder
    • fixed a bug in A2C where actions were cast to int32 even in the continuous case
    • added addional logging to A2C when Monitor wrapper is used
    • changed logging for PPO2: do not display NaN when reward info is not present
    • change default value of A2C lr schedule
    • removed behavior cloning script
    • added pretrain method to base class, in order to use behavior cloning on all models
    • fixed close() method for DummyVecEnv.
    • added support for Dict spaces in DummyVecEnv and SubprocVecEnv. (@AdamGleave)
    • added support for arbitrary multiprocessing start methods and added a warning about SubprocVecEnv that are not thread-safe by default. (@AdamGleave)
    • added support for Discrete actions for GAIL
    • fixed deprecation warning for tf: replaces tf.to_float() by tf.cast()
    • fixed bug in saving and loading ddpg model when using normalization of obs or returns (@tperol)
    • changed DDPG default buffer size from 100 to 50000.
    • fixed a bug in ddpg.py in combined_stats for eval. Computed mean on eval_episode_rewards and eval_qs (@keshaviyengar)
    • fixed a bug in setup.py that would error on non-GPU systems without TensorFlow installed

    Welcome to @AdamGleave who joins the maintainer team.

    Source code(tar.gz)
    Source code(zip)
  • v2.4.1(Feb 11, 2019)

    • fixed computation of training metrics in TRPO and PPO1
    • added reset_num_timesteps keyword when calling train() to continue tensorboard learning curves
    • reduced the size taken by tensorboard logs (added a full_tensorboard_log to enable full logging, which was the previous behavior)
    • fixed image detection for tensorboard logging
    • fixed ACKTR for recurrent policies
    • fixed gym breaking changes
    • fixed custom policy examples in the doc for DQN and DDPG
    • remove gym spaces patch for equality functions
    • fixed tensorflow dependency: cpu version was installed overwritting tensorflow-gpu when present.
    • fixed a bug in traj_segment_generator (used in ppo1 and trpo) where new was not updated. (spotted by @junhyeokahn)
    Source code(tar.gz)
    Source code(zip)
  • v2.4.0(Jan 17, 2019)

    • added Soft Actor-Critic (SAC) model
    • fixed a bug in DQN where prioritized_replay_beta_iters param was not used
    • fixed DDPG that did not save target network parameters
    • fixed bug related to shape of true_reward (@abhiskk)
    • fixed example code in documentation of tf_util:Function (@JohannesAck)
    • added learning rate schedule for SAC
    • fixed action probability for continuous actions with actor-critic models
    • added optional parameter to action_probability for likelihood calculation of given action being taken.
    • added more flexible custom LSTM policies
    • added auto entropy coefficient optimization for SAC
    • clip continuous actions at test time too for all algorithms (except SAC/DDPG where it is not needed)
    • added a mean to pass kwargs to policy when creating a model (+ save those kwargs)
    • fixed DQN examples in DQN folder
    • added possibility to pass activation function for DDPG, DQN and SAC

    We would like to thanks our contributors (in random order): @abhiskk @JohannesAck @EliasHasle @mrakgr @Bleyddyn and welcoming a new maintainer: @erniejunior

    Source code(tar.gz)
    Source code(zip)
  • v2.3.0(Dec 5, 2018)

    • added support for storing model in file like object. (thanks to @erniejunior)
    • fixed wrong image detection when using tensorboard logging with DQN
    • fixed bug in ppo2 when passing non callable lr after loading
    • fixed tensorboard logging in ppo2 when nminibatches=1
    • added early stoppping via callback return value (@erniejunior)
    • added more flexible custom mlp policies (@erniejunior)
    Source code(tar.gz)
    Source code(zip)
  • v2.2.1(Nov 18, 2018)

  • v2.2.0(Nov 7, 2018)

    • Hotfix for ppo2, the wrong placeholder was used for the value function

    Note: this bug was present since v1.0, so we recommend to update to the latest version of stable-baselines

    Source code(tar.gz)
    Source code(zip)
  • v2.1.2(Nov 6, 2018)

    • added async_eigen_decomp parameter for ACKTR and set it to False by default (remove deprecation warnings)
    • added methods for calling env methods/setting attributes inside a VecEnv (thanks to @bjmuld)
    • updated gym minimum version

    Contributors (since v2.0.0):

    Thanks to @bjmuld @iambenzo @iandanforth @r7vme @brendenpetersen @huvar

    Source code(tar.gz)
    Source code(zip)
  • v2.1.1(Oct 20, 2018)

    • fixed MpiAdam synchronization issue in PPO1 (thanks to @brendenpetersen) issue #50
    • fixed dependency issues (new mujoco-py requires a mujoco licence + gym broke MultiDiscrete space shape)
    Source code(tar.gz)
    Source code(zip)
  • v2.1.0(Oct 2, 2018)

    WARNING: This version contains breaking changes, please read the full details

    • added patch fix for equal function using gym.spaces.MultiDiscrete and gym.spaces.MultiBinary
    • fixes for DQN action_probability
    • re-added double DQN + refactored DQN policies breaking changes
    • replaced async with async_eigen_decomp in ACKTR/KFAC for python 3.7 compatibility
    • removed action clipping for prediction of continuous actions (see issue #36)
    • fixed NaN issue due to clipping the continuous action in the wrong place (issue #36)
    Source code(tar.gz)
    Source code(zip)
  • v2.0.0(Sep 18, 2018)

    WARNING: This version contains breaking changes, please read the full details

    • Renamed DeepQ to DQN breaking changes
    • Renamed DeepQPolicy to DQNPolicy breaking changes
    • fixed DDPG behavior breaking changes
    • changed default policies for DDPG, so that DDPG now works correctly breaking changes
    • added more documentation (some modules from common).
    • added doc about using custom env
    • added Tensorboard support for A2C, ACER, ACKTR, DDPG, DeepQ, PPO1, PPO2 and TRPO
    • added episode reward to Tensorboard
    • added documentation for Tensorboard usage
    • added Identity for Box action space
    • fixed render function ignoring parameters when using wrapped environments
    • fixed PPO1 and TRPO done values for recurrent policies
    • fixed image normalization not occurring when using images
    • updated VecEnv objects for the new Gym version
    • added test for DDPG
    • refactored DQN policies
    • added registry for policies, can be passed as string to the agent
    • added documentation for custom policies + policy registration
    • fixed numpy warning when using DDPG Memory
    • fixed DummyVecEnv not copying the observation array when stepping and resetting
    • added pre-built docker images + installation instructions
    • added deterministic argument in the predict function
    • added assert in PPO2 for recurrent policies
    • fixed predict function to handle both vectorized and unwrapped environment
    • added input check to the predict function
    • refactored ActorCritic models to reduce code duplication
    • refactored Off Policy models (to begin HER and replay_buffer refactoring)
    • added tests for auto vectorization detection
    • fixed render function, to handle positional arguments
    Source code(tar.gz)
    Source code(zip)
  • v1.0.7(Aug 29, 2018)

    • added html documentation using sphinx + integration with read the docs
    • cleaned up README + typos
    • fixed normalization for DQN with images
    • fixed DQN identity test
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Aug 20, 2018)

    • refactored A2C, ACER, ACTKR, DDPG, DeepQ, GAIL, TRPO, PPO1 and PPO2 under a single constant class
    • added callback to refactored algorithm training
    • added saving and loading to refactored algorithms
    • refactored ACER, DDPG, GAIL, PPO1 and TRPO to fit with A2C, PPO2 and ACKTR policies
    • added new policies for most algorithms (Mlp, MlpLstm, MlpLnLstm, Cnn, CnnLstm and CnnLnLstm)
    • added dynamic environment switching (so continual RL learning is now feasible)
    • added prediction from observation and action probability from observation for all the algorithms
    • fixed graphs issues, so models wont collide in names
    • fixed behavior_clone weight loading for GAIL
    • fixed Tensorflow using all the GPU VRAM
    • fixed models so that they are all compatible with vectorized environments
    • fixed set_global_seed to update gym.spaces's random seed
    • fixed PPO1 and TRPO performance issues when learning identity function
    • added new tests for loading, saving, continuous actions and learning the identity function
    • fixed DQN wrapping for atari
    • added saving and loading for Vecnormalize wrapper
    • added automatic detection of action space (for the policy network)
    • fixed ACER buffer with constant values assuming n_stack=4
    • fixed some RL algorithms not clipping the action to be in the action_space, when using gym.spaces.Box
    • refactored algorithms can take either a gym.Environment or a str (if the environment name is registered)
    • Hoftix in ACER (compared to v1.0.0)

    Future Work :

    • Finish refactoring HER
    • Refactor ACKTR and ACER for continuous implementation
    Source code(tar.gz)
    Source code(zip)
  • v1.0.0(Aug 20, 2018)

  • v0.1.6(Aug 14, 2018)

    • Fixed tf.session().__enter__() being used, rather than sess = tf.session() and passing the session to the objects
    • Fixed uneven scoping of TensorFlow Sessions throughout the code
    • Fixed rolling vecwrapper to handle observations that are not only grayscale images
    • Fixed deepq saving the environment when trying to save itself
    • Fixed ValueError: Cannot take the length of Shape with unknown rank. in acktr, when running run_atari.py script.
    • Fixed calling baselines sequentially no longer creates graph conflicts
    • Fixed mean on empty array warning with deepq
    • Fixed kfac eigen decomposition not cast to float64, when the parameter use_float64 is set to True
    • Fixed Dataset data loader, not correctly resetting id position if shuffling is disabled
    • Fixed EOFError when reading from connection in the worker in subproc_vec_env.py
    • Fixed behavior_clone weight loading and saving for GAIL
    • Avoid taking root square of negative number in trpo_mpi.py
    • Removed some duplicated code (a2cpolicy, trpo_mpi)
    • Removed unused, undocumented and crashing function reset_task in subproc_vec_env.py
    • Reformated code to PEP8 style
    • Documented all the codebase
    • Added atari tests
    • Added logger tests

    Missing: tests for acktr continuous (+ HER, gail but they rely on mujoco...)

    Source code(tar.gz)
    Source code(zip)
Owner
Ashley Hill
Machine Learning, Robotics, and other oddities.
Ashley Hill
Reinforcement Learning Coach by Intel AI Lab enables easy experimentation with state of the art Reinforcement Learning algorithms

Coach Coach is a python reinforcement learning framework containing implementation of many state-of-the-art algorithms. It exposes a set of easy-to-us

Intel Labs 2.2k Jan 5, 2023
Modular Deep Reinforcement Learning framework in PyTorch. Companion library of the book "Foundations of Deep Reinforcement Learning".

SLM Lab Modular Deep Reinforcement Learning framework in PyTorch. Documentation: https://slm-lab.gitbook.io/slm-lab/ BeamRider Breakout KungFuMaster M

Wah Loon Keng 1.1k Dec 24, 2022
A toolkit for developing and comparing reinforcement learning algorithms.

Status: Maintenance (expect bug fixes and minor updates) OpenAI Gym OpenAI Gym is a toolkit for developing and comparing reinforcement learning algori

OpenAI 29.6k Jan 1, 2023
Dopamine is a research framework for fast prototyping of reinforcement learning algorithms.

Dopamine Dopamine is a research framework for fast prototyping of reinforcement learning algorithms. It aims to fill the need for a small, easily grok

Google 10k Jan 7, 2023
Doom-based AI Research Platform for Reinforcement Learning from Raw Visual Information. :godmode:

ViZDoom ViZDoom allows developing AI bots that play Doom using only the visual information (the screen buffer). It is primarily intended for research

Marek Wydmuch 1.5k Dec 30, 2022
A toolkit for reproducible reinforcement learning research.

garage garage is a toolkit for developing and evaluating reinforcement learning algorithms, and an accompanying library of state-of-the-art implementa

Reinforcement Learning Working Group 1.6k Jan 9, 2023
An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

Reinforcement Learning Working Group 823 Jan 6, 2023
A platform for Reasoning systems (Reinforcement Learning, Contextual Bandits, etc.)

Applied Reinforcement Learning @ Facebook Overview ReAgent is an open source end-to-end platform for applied reinforcement learning (RL) developed and

Facebook Research 3.3k Jan 5, 2023
TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning.

TF-Agents: A reliable, scalable and easy to use TensorFlow library for Contextual Bandits and Reinforcement Learning. TF-Agents makes implementing, de

null 2.4k Dec 29, 2022
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Tensorforce 3.2k Jan 2, 2023
TensorFlow Reinforcement Learning

TRFL TRFL (pronounced "truffle") is a library built on top of TensorFlow that exposes several useful building blocks for implementing Reinforcement Le

DeepMind 3.1k Dec 29, 2022
Deep Reinforcement Learning for Keras.

Deep Reinforcement Learning for Keras What is it? keras-rl implements some state-of-the art deep reinforcement learning algorithms in Python and seaml

Keras-RL 5.4k Jan 4, 2023
ChainerRL is a deep reinforcement learning library built on top of Chainer.

ChainerRL ChainerRL is a deep reinforcement learning library that implements various state-of-the-art deep reinforcement algorithms in Python using Ch

Chainer 1.1k Dec 26, 2022
Open world survival environment for reinforcement learning

Crafter Open world survival environment for reinforcement learning. Highlights Crafter is a procedurally generated 2D world, where the agent finds foo

Danijar Hafner 213 Jan 5, 2023
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

null 404 Dec 25, 2022
Paddle-RLBooks is a reinforcement learning code study guide based on pure PaddlePaddle.

Paddle-RLBooks Welcome to Paddle-RLBooks which is a reinforcement learning code study guide based on pure PaddlePaddle. 欢迎来到Paddle-RLBooks,该仓库主要是针对强化学

AgentMaker 117 Dec 12, 2022
OpenAI Baselines: high-quality implementations of reinforcement learning algorithms

Status: Maintenance (expect bug fixes and minor updates) Baselines OpenAI Baselines is a set of high-quality implementations of reinforcement learning

OpenAI 13.5k Jan 7, 2023
PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

DLR-RM 4.7k Jan 1, 2023
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022