PyTorch version of Stable Baselines, reliable implementations of reinforcement learning algorithms.

Overview

pipeline status Documentation Status coverage report codestyle

Stable Baselines3

Stable Baselines3 (SB3) is a set of reliable implementations of reinforcement learning algorithms in PyTorch. It is the next major version of Stable Baselines.

You can read a detailed presentation of Stable Baselines3 in the v1.0 blog post.

These algorithms will make it easier for the research community and industry to replicate, refine, and identify new ideas, and will create good baselines to build projects on top of. We expect these tools will be used as a base around which new ideas can be added, and as a tool for comparing a new approach against existing ones. We also hope that the simplicity of these tools will allow beginners to experiment with a more advanced toolset, without being buried in implementation details.

Note: despite its simplicity of use, Stable Baselines3 (SB3) assumes you have some knowledge about Reinforcement Learning (RL). You should not utilize this library without some practice. To that extent, we provide good resources in the documentation to get started with RL.

Main Features

The performance of each algorithm was tested (see Results section in their respective page), you can take a look at the issues #48 and #49 for more details.

Features Stable-Baselines3
State of the art RL methods ✔️
Documentation ✔️
Custom environments ✔️
Custom policies ✔️
Common interface ✔️
Ipython / Notebook friendly ✔️
Tensorboard support ✔️
PEP8 code style ✔️
Custom callback ✔️
High code coverage ✔️
Type hints ✔️

Planned features

Please take a look at the Roadmap and Milestones.

Migration guide: from Stable-Baselines (SB2) to Stable-Baselines3 (SB3)

A migration guide from SB2 to SB3 can be found in the documentation.

Documentation

Documentation is available online: https://stable-baselines3.readthedocs.io/

RL Baselines3 Zoo: A Training Framework for Stable Baselines3 Reinforcement Learning Agents

RL Baselines3 Zoo is a training framework for Reinforcement Learning (RL).

It provides scripts for training, evaluating agents, tuning hyperparameters, plotting results and recording videos.

In addition, it includes a collection of tuned hyperparameters for common environments and RL algorithms, and agents trained with those settings.

Goals of this repository:

  1. Provide a simple interface to train and enjoy RL agents
  2. Benchmark the different Reinforcement Learning algorithms
  3. Provide tuned hyperparameters for each environment and RL algorithm
  4. Have fun with the trained agents!

Github repo: https://github.com/DLR-RM/rl-baselines3-zoo

Documentation: https://stable-baselines3.readthedocs.io/en/master/guide/rl_zoo.html

SB3-Contrib: Experimental RL Features

We implement experimental features in a separate contrib repository: SB3-Contrib

This allows SB3 to maintain a stable and compact core, while still providing the latest features, like Truncated Quantile Critics (TQC) or Quantile Regression DQN (QR-DQN).

Documentation is available online: https://sb3-contrib.readthedocs.io/

Installation

Note: Stable-Baselines3 supports PyTorch 1.4+.

Prerequisites

Stable Baselines3 requires python 3.6+.

Windows 10

To install stable-baselines on Windows, please look at the documentation.

Install using pip

Install the Stable Baselines3 package:

pip install stable-baselines3[extra]

This includes an optional dependencies like Tensorboard, OpenCV or atari-py to train on atari games. If you do not need those, you can use:

pip install stable-baselines3

Please read the documentation for more details and alternatives (from source, using docker).

Example

Most of the library tries to follow a sklearn-like syntax for the Reinforcement Learning algorithms.

Here is a quick example of how to train and run PPO on a cartpole environment:

import gym

from stable_baselines3 import PPO

env = gym.make("CartPole-v1")

model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=10000)

obs = env.reset()
for i in range(1000):
    action, _states = model.predict(obs, deterministic=True)
    obs, reward, done, info = env.step(action)
    env.render()
    if done:
      obs = env.reset()

env.close()

Or just train a model with a one liner if the environment is registered in Gym and if the policy is registered:

from stable_baselines3 import PPO

model = PPO('MlpPolicy', 'CartPole-v1').learn(10000)

Please read the documentation for more examples.

Try it online with Colab Notebooks !

All the following examples can be executed online using Google colab notebooks:

Implemented Algorithms

Name Recurrent Box Discrete MultiDiscrete MultiBinary Multi Processing
A2C ✔️ ✔️ ✔️ ✔️ ✔️
DDPG ✔️
DQN ✔️
HER ✔️ ✔️
PPO ✔️ ✔️ ✔️ ✔️ ✔️
SAC ✔️
TD3 ✔️

Actions gym.spaces:

  • Box: A N-dimensional box that containes every point in the action space.
  • Discrete: A list of possible actions, where each timestep only one of the actions can be used.
  • MultiDiscrete: A list of possible actions, where each timestep only one action of each discrete set can be used.
  • MultiBinary: A list of possible actions, where each timestep any of the actions can be used in any combination.

Testing the installation

All unit tests in stable baselines3 can be run using pytest runner:

pip install pytest pytest-cov
make pytest

You can also do a static type check using pytype:

pip install pytype
make type

Codestyle check with flake8:

pip install flake8
make lint

Projects Using Stable-Baselines3

We try to maintain a list of project using stable-baselines3 in the documentation, please tell us when if you want your project to appear on this page ;)

Citing the Project

To cite this repository in publications:

@misc{stable-baselines3,
  author = {Raffin, Antonin and Hill, Ashley and Ernestus, Maximilian and Gleave, Adam and Kanervisto, Anssi and Dormann, Noah},
  title = {Stable Baselines3},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/DLR-RM/stable-baselines3}},
}

Maintainers

Stable-Baselines3 is currently maintained by Ashley Hill (aka @hill-a), Antonin Raffin (aka @araffin), Maximilian Ernestus (aka @ernestum), Adam Gleave (@AdamGleave) and Anssi Kanervisto (@Miffyli).

Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

How To Contribute

To any interested in making the baselines better, there is still some documentation that needs to be done. If you want to contribute, please read CONTRIBUTING.md guide first.

Acknowledgments

The initial work to develop Stable Baselines3 was partially funded by the project Reduced Complexity Models from the Helmholtz-Gemeinschaft Deutscher Forschungszentren.

The original version, Stable Baselines, was created in the robotics lab U2IS (INRIA Flowers team) at ENSTA ParisTech.

Logo credits: L.M. Tenkes

Issues
  • Support for MultiBinary / MultiDiscrete spaces

    Support for MultiBinary / MultiDiscrete spaces

    Description

    • Added support for MultiDiscrete and MultiBinary observation / action spaces for PPO and A2C
    • Added MultiCategorical and Bernoulli distributions
    • Added tests for MultiCategorical and Bernoulli distributions and actions spaces

    Motivation and Context

    • [x] I have raised an issue to propose this change (required for new features and bug fixes)

    closes #5 closes #4

    Types of changes

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)
    • [ ] Documentation (update in the documentation)

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required)
    • [x] I have updated the changelog accordingly (required).
    • [ ] My change requires a change to the documentation.
    • [x] I have updated the tests accordingly (required for a bug fix or a new feature).
    • [ ] I have updated the documentation accordingly.
    • [x] I have checked the codestyle using make lint
    • [x] I have ensured pytest and pytype both pass.
    opened by rolandgvc 54
  • Roadmap to Stable-Baselines3 V1.0

    Roadmap to Stable-Baselines3 V1.0

    This issue is meant to be updated as the list of changes is not exhaustive

    Dear all,

    Stable-Baselines3 beta is now out :tada: ! This issue is meant to reference what is implemented and what is missing before a first major version.

    As mentioned in the README, before v1.0, breaking changes may occur. I would like to encourage contributors (especially the maintainers) to make comments on how to improve the library before v1.0 (and maybe make some internal changes).

    I will try to review the features mentioned in https://github.com/hill-a/stable-baselines/issues/576 (and https://github.com/hill-a/stable-baselines/issues/733) and I will create issues soon to reference what is missing.

    What is implemented?

    • [x] basic features (training/saving/loading/predict)
    • [x] basic set of algorithms (A2C/PPO/SAC/TD3)
    • [x] basic pre-processing (Box and Discrete observation/action spaces are handled)
    • [x] callback support
    • [x] complete benchmark for the continuous action case
    • [x] basic rl zoo for training/evaluating plotting (https://github.com/DLR-RM/rl-baselines3-zoo)
    • [x] consistent api
    • [x] basic tests and most type hints
    • [x] continuous integration (I'm in discussion with the organization admins for that)
    • [x] handle more observation/action spaces #4 and #5 (thanks @rolandgvc)
    • [x] tensorboard integration #9 (thanks @rolandgvc)
    • [x] basic documentation and notebooks
    • [x] automatic build of the documentation
    • [x] Vanilla DQN #6 (thanks @Artemis-Skade)
    • [x] Refactor off-policy critics to reduce code duplication #3 (see #78 )
    • [x] DDPG #3
    • [x] do a complete benchmark for the discrete case #49 (thanks @Miffyli !)
    • [x] performance check for continuous actions #48 (even better than gSDE paper)
    • [x] get/set parameters for the base class (#138 )
    • [x] clean up type-hints in docs #10 (cumbersome to read)
    • [x] documenting the migration between SB and SB3 #11
    • [x] finish typing some methods #175
    • [x] HER #8 (thanks @megan-klaiber)
    • [x] finishing to update and clean the doc #166 (help is wanted)
    • [x] finishing to update the notebooks and the tutorial #7 (I will do that, only HER notebook missing)

    What are the new features?

    • [x] much cleaner base code (and no more warnings =D )
    • [x] independent saving/loading/predict for policies
    • [x] State-Dependent Exploration (SDE) for using RL directly on real robots (this is a unique feature, it was the starting point of SB3, I published a paper on that: https://arxiv.org/abs/2005.05719)
    • [x] proper evaluation (using separate env) is included in the base class (using EvalCallback)
    • [x] all environments are VecEnv
    • [x] better saving/loading (now can include the replay buffer and the optimizers)
    • [x] any number of critics are allowed for SAC/TD3
    • [x] custom actor/critic net arch for off-policy algos (#113 )
    • [x] QR-DQN in SB3-Contrib
    • [x] Truncated Quantile Critics (TQC) (see #83 ) in SB3-Contrib
    • @Miffyli suggested a "contrib" repo for experimental features (it is here)

    What is missing?

    • [x] syncing some files with Stable-Baselines to remain consistent (we may be good now, but need to be checked)
    • [x] finish code-review of exisiting code #17

    Checklist for v1.0 release

    • [x] Update Readme
    • [x] Prepare blog post
    • [x] Update doc: add links to the stable-baselines3 contrib
    • [x] Update docker image to use newer Ubuntu version
    • [x] Populate RL zoo

    What is next? (for V1.1+)

    • basic dict/tuple support for observations (#243 )
    • simple recurrent policies?
    • DQN extensions (double, PER, IQN)
    • Implement TRPO
    • multi-worker training for all algorithms (#179 )
    • n-step returns for off-policy algorithms #47 (@PartiallyTyped )
    • SAC discrete #157 (need to be discussed, benefit vs DQN+extensions?)
    • Energy Based Prioritisation? (@RyanRizzo96)
    • implement action_proba in the base class?
    • test the doc snippets #14 (help is welcomed)
    • noisy networks (https://arxiv.org/abs/1706.10295) @PartiallyTyped ? exploration in parameter space?
    • Munchausen Reinforcement Learning (MDQN) (probably in the contrib first, e.g. https://github.com/pfnet/pfrl/pull/74)

    side note: should we change the default start_method to fork? (now that we don't have tf anymore)

    enhancement 
    opened by araffin 46
  • Dictionary Observations

    Dictionary Observations

    In machine learning, input comes in the form of matrices. Typically, models take in 1 matrix at a time, such as in the case of image classification where matrix containing the input image is given to the model and the model classifies the image. However, there are many situations in which taking multiple inputs is necessary. One example is when training a reinforcement learning agent and the observations that the agent sees comes in the form of an image (e.g., camera, grid sensor, etc) and a vector describing the agent's state (e.g., current position, health, etc). In this situation, it is necessary to feed 2 inputs to the model. This PR addresses this.

    Description

    • added example environments with multi-input observations
    • added DictReplayBuffer and DictRolloutBuffer to handle dictionaries
    • added CombinedExtractor feature extractor that handles generic dictionary data
    • added StackedObservations and StackedDictObservations to decouple data stacking from the VecFrameStack wrapper
    • added test_dict_env.py test
    • added a is_vectorized_env() method per observation space type in common\utils.py

    Motivation and Context

    • [x] I have raised an issue to propose this change (link)

    closes #216

    closes #287 (image support for HER) closes #284 (for off-policy algorithms)

    Types of changes

    • [x] New feature (non-breaking change which adds functionality)
    • [x] Breaking change (fix or feature that would cause existing functionality to change)
    • [x] Documentation (update in the documentation)

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required)
    • [x] I have updated the changelog accordingly (required).
    • [x] My change requires a change to the documentation.
    • [x] I have updated the tests accordingly
    • [x] I have updated the documentation accordingly.
    • [x] I have reformatted the code using make format
    • [x] I have checked the codestyle using make check-codestyle and make lint
    • [x] I have ensured make pytest and make type both pass. (required)
    • [x] I have checked that the documentation builds using make doc (required)

    TODOs

    • [x] check that documentation is properly updated
    • [x] check that dict with vectors only is the same as mlp policy + vector flatten
    • [x] Update env checker
    • [x] (optional) refactor HER: https://github.com/DLR-RM/stable-baselines3/tree/feat/refactor-her
    • [x] test A2C/PPO/SAC alone with GoalEnv
    opened by J-Travnik 45
  • Match performance with stable-baselines (discrete case)

    Match performance with stable-baselines (discrete case)

    This PR will be done when stable-baselines3 agent performance matches stable-baselines in discrete envs. Will be tested on discrete control tasks and Atari environments.

    Closes #49 Closes #105

    PS: Sorry about the confusing branch-name.

    Changes

    • Fix storing correct dones (#105, credits to AndyShih12)
    • Fix number of filters in NatureCNN
    • Add common.sb2_compat.RMSpropTFLike, which is a modification of RMSprop that matches TF version, and is required for matching performance in A2C.

    TODO

    • [x] Match performance of A2C and PPO.

    • [x] A2C Cartpole matches (mostly, see this. Averaged over 10 random seeds for both. Requires the TF-like RMSprop, and even still in the very end SB3 seems more unstable.)

    • [x] A2C Atari matches (mostly, see sb2 and sb3. Original sb3 result here. Three random seeds, each line separate run (ignore legend). Using TF-like RMSprop. Performance and stability mostly matches, except sb2 has sudden spike in performance in Q*Bert. Something to do with stability in distributions?)

    • [x] PPO Cartpole (using rl-zoo parameters, see learning curves, averaged over 20 random seeds)

    • [x] PPO Atari (mostly, see sb2 and sb3 results (shaded curves averaged over two seeds). Q*Bert still seems to have an edge on SB2 for unknown reasons)

    • [x] Check and match performance of DQN. Seems ok. See following learning curves, each curve is an average over three random seeds: atari_spaceinvaders.pdf atari_qbert.pdf atari_breakout.pdf atari_pong.pdf

    • [x] Check if "dones" fix can (and should) be moved to computing GAE side.

    • [x] ~~Write docs on how to match A2C and PPO settings to stable-baselines ("moving from stable-baselines"). There are some important quirks to note here.~~ Move this to migration guide PR #123 .

    Types of changes

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)
    • [x] Documentation (update in the documentation)

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required)
    • [x] I have updated the changelog accordingly (required).
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the tests accordingly (required for a bug fix or a new feature).
    • [ ] I have updated the documentation accordingly.
    • [x] I have reformatted the code using make format (required)
    • [x] I have checked the codestyle using make check-codestyle and make lint (required)
    • [x] I have ensured make pytest and make type both pass. (required)
    opened by Miffyli 28
  • Tensorboard integration

    Tensorboard integration

    Description

    Adding support for logging to tensorboard.

    Missing:

    • [x] Documentation
    • [x] More tests
    • [x] check we don't make the same mistakes as SB2 (https://github.com/hill-a/stable-baselines/issues/855 https://github.com/hill-a/stable-baselines/issues/56 )

    Motivation and Context

    • [x] I have raised an issue to propose this change (required for new features and bug fixes)

    closes #9

    Types of changes

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [x] Breaking change (fix or feature that would cause existing functionality to change)
    • [ ] Documentation (update in the documentation)

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required)
    • [x] I have updated the changelog accordingly (required).
    • [x] My change requires a change to the documentation.
    • [x] I have updated the tests accordingly (required for a bug fix or a new feature).
    • [x] I have updated the documentation accordingly.
    • [x] I have checked the codestyle using make lint
    • [x] I have ensured make pytest and make type both pass.
    opened by rolandgvc 27
  • Implement DQN

    Implement DQN

    Description

    Implementation of vanilla dqn

    closes #6 closes #37 closes #46

    Missing:

    • [x] Update examples to include DQN
    • [x] Add test for replay buffer truncation

    Motivation and Context

    • [x] I have raised an issue to propose this change (required for new features and bug fixes)

    Types of changes

    • [ ] Bug fix (non-breaking change which fixes an issue)
    • [x] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)
    • [ ] Documentation (update in the documentation)

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required)
    • [x] I have updated the changelog accordingly (required).
    • [x] My change requires a change to the documentation.
    • [x] I have updated the tests accordingly (required for a bug fix or a new feature).
    • [x] I have updated the documentation accordingly.
    • [x] I have checked the codestyle using make lint
    • [x] I have ensured make pytest and make type both pass.
    opened by Artemis-Skade 26
  • Memory allocation for buffers

    Memory allocation for buffers

    With the current implementation of buffers.py one can request a buffersize which doesn't fit in the memory provided but because of numpys implementation of np.zeros() the memory is not allocated before it is actually used. But because the buffer is meant to be filled completely (otherwise one could just use a smaller buffer) the computer will finally run out of memory and start to swap heavily. Because there are only smaller parts of the buffer that are accessed at once (minibatches) the system will just swap the necessary pages in and out of memory. At that moment the progress of the run is most likely lost and one has to start a new run with a smaller buffer.

    I would recommend using np.ones instead, as it will allocate the buffer at the beginning and fail if there is not enough memory provided by the system. The only issue is that there is no clear error description in the case where the system memory is exceeded but python gets simply killed by the OS with a SIGKILL. Maybe one could catch that command?

    bug enhancement 
    opened by Artemis-Skade 25
  • [feature request] Add total episodes parameter to model learn method

    [feature request] Add total episodes parameter to model learn method

    Hi,

    TL;DR: I would like the option to pass either a total_episodes parameter or a total_timesteps to the model.learn() method.

    Now, for my reasoning. Currently, we can only define the total_timesteps when training an agent, as follows.

    model = A2C('MlpLstmPolicy', env, verbose=1, policy_kwargs=policy_kwargs)
    model.learn(total_timesteps=1000)
    

    However, for some scenarios (e.g., stock trading), it is quite common to have a fixed number of timesteps per episode, given by the available time-series data points. Also, it can be quite valuable to scan all data points an equal amount of time thoroughly and to determine the number of passes, which is defined by the number of episodes.

    Thus, to train for a given number of episodes for a fixed number of timesteps, I have to get the total_timesteps value, before passing it to method model.learn() as follows:

    desired_total_episodes = 100
    n_points = train_df.shape[0]) # get the number of data points
    total_timesteps = desired_total_episodes * n_points
    

    Even so, this answer on StackOverflow says that

    Where the episode length is known, set it to the desired number of episode you would like to train. However, it might be less because the agent might not (probably wont) reach max steps every time.

    I must admit I do not know how accurate this answer, but this worries me that my model may not scan all the data equally.

    Another option, as discussed in this issue from previous stable baseline repo, is to use a callback function. Still, for this callback approach, I would have to pass a total_timesteps variable that is high enough so that I can have the desired number of episodes. Hence, this callback approach seems like an out of the way workaround.

    In conclusion, I believe that including the option to pass a total_episodes could be a simple and effective approach that would broaden the number of use cases attended by this project.

    Thank you for your attention!

    enhancement 
    opened by xicocaio 21
  • [Feature Request] Refactor `predict` method in `BasePolicy` class

    [Feature Request] Refactor `predict` method in `BasePolicy` class

    🚀 Feature

    A clear and concise description of the feature proposal.

    At present the predict method in the BasePolicy class contains quite a lot of logic that could be reused to provide similar functionality. In particular, the current logic of the this method is as follows:

    1. Pre-process NumPy observation and convert it into a PyTorch Tensor.
    2. Generate a action(s) from the child policy class through the _predict method, with these actions in the form of a PyTorch Tensor.
    3. Post-process the actions, including converting the PyTorch Tensor into a NumPy array.

    My suggestion is that steps (1) and (3) are refactors into individual functions on the BasePolicy class, which are then called in the predict method.

    Motivation

    I would like to introduce some policy classes for which I can calculate the action probabilities and not the actions themselves. (This is for some work on off-policy estimation that I am doing.)

    Let's call this functionality predict_probabilities, then at present the initial logic of this functionality is identical to step (1) of the predict method. If the code is refactored as suggested, then both approaches can use the same pre-processing functionality.

    Additionally, I think the refactor would generally make the code more readable and easier to extend parts of functionality to other similar uses.

    Pitch

    I am happy to do a PR for the proposed refactor, so I would like to know whether or not you would be happy with the proposal.

    Alternatives

    None

    Additional context

    None

    ### Checklist

    • [x] I have checked that there is no similar issue in the repo (required)
    documentation enhancement 
    opened by tfurmston 18
  • Use python logger instead of print

    Use python logger instead of print

    Currently, we use self.verbose and print for logging and debug info but a cleaner way would be to use the logging package from python.

    enhancement 
    opened by araffin 17
  • [Question] How do I track reward?

    [Question] How do I track reward?

    ### Describe the bug

    It's very curious to me that the verbose option does not print rewards and episode lengths. Am I missing something? For added info, I'm using a custom environment so the standard Monitor wrapper does not work for me.

    Any guidance is much appreciated.

    ### Code example

    There is no real bug. The wrappers simply expect dictionaries where the custom environment, which I can't easily work on, returns a list. All I need to know is if there is a simple way to monitor the rewards outside of using wrappers and callbacks.

    import gym
    
    from stable_baselines3 import A2C
    from stable_baselines3.common.vec_env import DummyVecEnv, VecNormalize, VecMonitor
    from stable_baselines3.common.monitor import Monitor
    
    env = gym.make( "custom_gym_ad")
    
    env = DummyVecEnv([lambda: env])
    env = VecNormalize(env)
    env.spec = lambda: None
    env.spec.id = "custom_gym_ad"
    env = Monitor(env, filename="a2c_tb_log/")
    
    model = A2C("MlpPolicy", env, verbose=1)
    model.learn(total_timesteps=10000, log_interval=1)
    
    env.close()
    
    Traceback (most recent call last):
      File "/home/user/code/prjct/test/new_test.py", line 16, in <module>
    	model.learn(total_timesteps=10000, log_interval=1)
      File "/home/user/anaconda3/envs/torch3.9/lib/python3.9/site-packages/stable_baselines3/a2c/a2c.py", line 192, in learn
    	return super(A2C, self).learn(
      File "/home/user/anaconda3/envs/torch3.9/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 237, in learn
    	continue_training = self.collect_rollouts(self.env, callback, self.rollout_buffer, n_rollout_steps=self.n_steps)
      File "/home/user/anaconda3/envs/torch3.9/lib/python3.9/site-packages/stable_baselines3/common/on_policy_algorithm.py", line 187, in collect_rollouts
    	self._update_info_buffer(infos)
      File "/home/user/anaconda3/envs/torch3.9/lib/python3.9/site-packages/stable_baselines3/common/base_class.py", line 452, in _update_info_buffer
    	maybe_ep_info = info.get("episode")
    AttributeError: 'list' object has no attribute 'get'
    

    ### System Info Describe the characteristic of your environment:

    • Everything conda installed, except for sb3
    • Python 3.9.7
    • stable-baselines3 1.2.0 (pip install)
    import stable_baselines3 as sb3
    sb3.get_system_info()
    
    AttributeError                            Traceback (most recent call last)
    <ipython-input-1-0ec52a8e58dd> in <module> 
          1 import stable_baselines3 as sb3 
    ----> 2 sb3.get_system_info()                                                                                
    AttributeError: module 'stable_baselines3' has no attribute 'get_system_info'
    
    question custom gym env RTFM 
    opened by biggzlar 7
  • [Question] Factor of 1.5 in KL Divergence-based PPO early stopping

    [Question] Factor of 1.5 in KL Divergence-based PPO early stopping

    Question

    Is there a concrete reason for the factor of 1.5 in https://github.com/DLR-RM/stable-baselines3/blob/master/stable_baselines3/ppo/ppo.py#L251 ?

    I saw a similar code snippet in a bunch of implementations, and my best guess is that someone sometime did this for some reason, and then everyone copied it. Or is there some justification for it?

    To my understanding, the actual impact of this factor is minimal, since it amounts to a scaling of a hyperparameter. Still, it makes it a bit less intuitive w.r.t. the actual behavior of the code.

    Checklist

    • [x] I have read the documentation (required)
    • [x] I have checked that there is no similar issue in the repo (required)
    question 
    opened by RedTachyon 1
  • [Feature Request] RAINBOW

    [Feature Request] RAINBOW

    Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

    🚀 Feature

    Implement RAINBOW (DQN + all extensions: noisy-net, double dqn, dueling, c51, prioritized experience replay).

    closes #487 if it is implemented.

    Motivation

    This would allow to have DQN with all extensions while keeping DQN code simple.

    • [x] I have checked that there is no similar issue in the repo (required)
    enhancement help wanted 
    opened by araffin 0
  • [Question] Why is HER using `achieved_goal` in training?

    [Question] Why is HER using `achieved_goal` in training?

    Question

    I'm investigating the functionality of the HER method right now. The method requires the observations to be in the form

    observation = {"observation":Box, "desired_goal":Box, "achieved_goal":Box}
    

    This form makes sense to easily replace the desired goal in the sampled transitions from the replay buffer with actually achieved goals.

    However, I do not understand why the RL algorithm uses the achieved_goal as an observation for training. To my understanding, the agent should only get the observation and desired_goal as input and learn the policy from that. Giving it the achieved_goal does not make much sense to me, since this information is only available after the RL step. It would mean that we add information of the next state s_{t+1} to the current state s_t. Am I missing something here or is there a flaw in the implementation?

    Thank you!

    Additional context

    original HER paper

    Checklist

    • [x] I have read the documentation (required)
    • [x] I have checked that there is no similar issue in the repo (required)
    question 
    opened by JakobThumm 1
  • [Feature Request] To change the way to deal with the logarithm of standard deviation for SAC

    [Feature Request] To change the way to deal with the logarithm of standard deviation for SAC

    🚀 Feature

    To change the way to deal with the logarithm of standard deviation for SAC.

    Motivation

    in sac/policies.py line 169-171, we use

            log_std = self.log_std(latent_pi)
            # Original Implementation to cap the standard deviation
            log_std = th.clamp(log_std, LOG_STD_MIN, LOG_STD_MAX)
    

    However, this may lead to zero gradients when log_std is out of range due to torch.clamp.

    Alternatives

    Replace code above with

            log_std = torch.tanh(log_std)
            log_std = LOG_STD_MIN + 0.5 * (
                LOG_STD_MAX  - LOG_STD_MIN 
            ) * (log_std + 1)
    

    as in rad line 81-84.

    enhancement 
    opened by lzhyu 2
  • Replace deepcopy with copy when returning info object

    Replace deepcopy with copy when returning info object

    Description

    Replacing the call to copy.deepcopy with a call to copy.copy in the step_wait method of `DummyVecEnv (line 51):

    Motivation and Context

    closes #618

    • [ x] I have raised an issue to propose this change (required for new features and bug fixes)

    Types of changes

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)
    • [ ] Documentation (update in the documentation)

    Checklist:

    • [ ] I've read the CONTRIBUTION guide (required)
    • [x] I have updated the changelog accordingly (required).
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the tests accordingly (required for a bug fix or a new feature).
    • [ ] I have updated the documentation accordingly.
    • [x] I have reformatted the code using make format (required)
    • [x] I have checked the codestyle using make check-codestyle and make lint (required)
    • [x] I have ensured make pytest and make type both pass. (required)
    • [x] I have checked that the documentation builds using make doc (required)

    Note: You can run most of the checks using make commit-checks.

    Note: we are using a maximum length of 127 characters per line

    opened by mieldehabanero 0
  • [Feature Request] Shallow copying to increase performance 2x when using rl-baselines3-zoo's train.py

    [Feature Request] Shallow copying to increase performance 2x when using rl-baselines3-zoo's train.py

    🚀 Feature

    Replacing copy.deepcopy for a call to copy.copy when returning the self.buf_infos object in the step_wait method of `DummyVecEnv:

       def step_wait(self) -> VecEnvStepReturn:
           for env_idx in range(self.num_envs):
               obs, self.buf_rews[env_idx], self.buf_dones[env_idx], self.buf_infos[env_idx] = self.envs[env_idx].step(
                   self.actions[env_idx]
               )
               if self.buf_dones[env_idx]:
                   # save final observation where user can get it, then reset
                   self.buf_infos[env_idx]["terminal_observation"] = obs
                   obs = self.envs[env_idx].reset()
               self._save_obs(env_idx, obs)
           return (self._obs_from_buf(), np.copy(self.buf_rews), np.copy(self.buf_dones), deepcopy(self.buf_infos))
    

    Motivation

    • rl-baselines3-zoo's train.py script uses by default DummyVecEnv.
    • step_wait is called with every step, spends 56% of its runtime in this call to deepcopy with the remaining time spent in the actual call to the environment's step function.
    • All of the direct contents of the returned object (info) are replaced with new ones every time the function is called so a deepcopy (although safer) is not necessary
    • Reducing by half the time it takes to take a step translates into an almost 50% reduction of training time.
    • This change would affect most people using rl-baselines3-zoo's scripts and should not create any issues.

    Pitch

    Replacing the call to copy.deepcopy with a call to copy.copy in the step_wait method of `DummyVecEnv (line 51):

    Alternatives

    Returning the info object as is and not doing any copying at all. This could cause data corruption.

    Additional context

    Profiling for two identical runs of 10,000 steps for a custom environment without (left) and with (right) the proposed change one can clearly see the difference in performance.

    Skärmavbild 2021-10-20 kl  17 06 58 Skärmavbild 2021-10-20 kl  17 10 48

    ### Checklist

    • [ x] I have checked that there is no similar issue in the repo (required)
    enhancement 
    opened by mieldehabanero 1
  • Update to newest Gym version

    Update to newest Gym version

    Description

    This should fix CI for Gym 0.20.0 wrt Atari ROMs

    Types of changes

    • [x] Bug fix (non-breaking change which fixes an issue)
    • [ ] New feature (non-breaking change which adds functionality)
    • [ ] Breaking change (fix or feature that would cause existing functionality to change)
    • [ ] Documentation (update in the documentation)

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required)
    • [ ] I have updated the changelog accordingly (required).
    • [ ] My change requires a change to the documentation.
    • [ ] I have updated the tests accordingly (required for a bug fix or a new feature).
    • [ ] I have updated the documentation accordingly.
    • [ ] I have reformatted the code using make format (required)
    • [ ] I have checked the codestyle using make check-codestyle and make lint (required)
    • [ ] I have ensured make pytest and make type both pass. (required)
    • [ ] I have checked that the documentation builds using make doc (required)

    Note: You can run most of the checks using make commit-checks.

    Note: we are using a maximum length of 127 characters per line

    opened by jkterry1 4
  • [Question] Regarding implementation of multi env off-policy algorithm (DQN, Replaybuffer)

    [Question] Regarding implementation of multi env off-policy algorithm (DQN, Replaybuffer)

    Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

    Question

    Hi, I am amending the codebase to support multiple envs for training DQN networks, which are off-policy in nature. We need this feature for compliance with pettingzoo (you can check it here). I have started from (#439), and now it seems to run without problem in Cartpole environment.

    Since I want to make contribution about this issue, I have some questions regarding design of this replaybuffer.

    • I saw that size of the replaybuffer equals to the (replaybuffer size param * num_envs), which is undoubtedly larger (if num_envs is larger than 1) than the user's intention. But I am confused if it is your design intention or not. If we want to make the size of replaybuffer to exactly same (or close to) the parameter that user has given, we have to set the size as (replaybuffer size param / num_envs), I think.
    • Also as far as I know replaybuffer has this shape : (buffer ID, env ID, obs dim). So I was wondering if it was your intention to store this buffer as 3 dimension, so storing experience from different envs in different place. Maybe we could just store all the information without classifying them by their envs (then shape would be just (buffer ID, obs dim)). This is just my curiosity, I think classifying them would be more interpretable.

    And if you don't mind, maybe I could make PR and do further discussion there. Here is my working branch.

    Additional context

    Add any other context about the question here.

    Checklist

    • [x] I have read the documentation (required)
    • [x] I have checked that there is no similar issue in the repo (required)
    question 
    opened by SonSang 2
  • [Bug] get_obs_shape returns wrong shape for MultiBinary spaces

    [Bug] get_obs_shape returns wrong shape for MultiBinary spaces

    Important Note: We do not do technical support, nor consulting and don't answer personal questions per email. Please post your question on the RL Discord, Reddit or Stack Overflow in that case.

    If your issue is related to a custom gym environment, please use the custom gym env template.

    🐛 Bug

    stable_baselines3.common.preprocessing.get_obs_shape returns the wrong shape when a MultiBinary spaces is multi-dimensions.

    To Reproduce

    Steps to reproduce the behavior.

    Please try to provide a minimal example to reproduce the bug. Error messages and stack traces are also helpful.

    from gym import spaces
    from stable_baselines3.common.preprocessing import get_obs_shape
    
    
    test_multi = spaces.MultiBinary([5, 4, 5])
    
    get_obs_shape(test_multi)
    
    
        151     elif isinstance(observation_space, spaces.MultiBinary):
        152         # Number of binary features
    --> 153         return (int(observation_space.n),)
        154     elif isinstance(observation_space, spaces.Dict):
        155         return {key: get_obs_shape(subspace) for (key, subspace) in observation_space.spaces.items()}
    
    TypeError: int() argument must be a string, a bytes-like object or a number, not 'list'
    
    

    Expected behavior

    get_obs_shape should returnobservation_space.shape not int(observation_space.n)

    ### System Info

    Describe the characteristic of your environment:

    • Describe how the library was installed (pip, docker, source, ...): pip
    • GPU models and configuration: none
    • Python version: 3.7.
    • PyTorch version: 1.9.0
    • Gym version" 0.18.3

    Checklist

    • [x] I have checked that there is no similar issue in the repo (required)
    • [x] I have read the documentation (required)
    • [x] I have provided a minimal working example to reproduce the bug (required)
    bug 
    opened by hjarraya 8
Releases(v1.3.0)
  • v1.3.0(Oct 23, 2021)

    WARNING: This version will be the last one supporting Python 3.6 (end of life in Dec 2021). We highly recommended you to upgrade to Python >= 3.7.

    SB3-Contrib changelog: https://github.com/Stable-Baselines-Team/stable-baselines3-contrib/releases/tag/v1.3.0

    Breaking Changes:

    • sde_net_arch argument in policies is deprecated and will be removed in a future version.

    • _get_latent (ActorCriticPolicy) was removed

    • All logging keys now use underscores instead of spaces (@timokau). Concretely this changes:

      • time/total timesteps to time/total_timesteps for off-policy algorithms (PPO and A2C) and the eval callback (on-policy algorithms already used the underscored version),
      • rollout/exploration rate to rollout/exploration_rate and
      • rollout/success rate to rollout/success_rate.

    New Features:

    • Added methods get_distribution and predict_values for ActorCriticPolicy for A2C/PPO/TRPO (@cyprienc)
    • Added methods forward_actor and forward_critic for MlpExtractor
    • Added sb3.get_system_info() helper function to gather version information relevant to SB3 (e.g., Python and PyTorch version)
    • Saved models now store system information where agent was trained, and load functions have print_system_info parameter to help debugging load issues.

    Bug Fixes:

    • Fixed dtype of observations for SimpleMultiObsEnv
    • Allow VecNormalize to wrap discrete-observation environments to normalize reward when observation normalization is disabled.
    • Fixed a bug where DQN would throw an error when using Discrete observation and stochastic actions
    • Fixed a bug where sub-classed observation spaces could not be used
    • Added force_reset argument to load() and set_env() in order to be able to call learn(reset_num_timesteps=False) with a new environment

    Others:

    • Cap gym max version to 0.19 to avoid issues with atari-py and other breaking changes
    • Improved error message when using dict observation with the wrong policy
    • Improved error message when using EvalCallback with two envs not wrapped the same way.
    • Added additional infos about supported python version for PyPi in setup.py

    Documentation:

    • Add Rocket League Gym to list of supported projects (@AechPro)
    • Added gym-electric-motor to project page (@wkirgsn)
    • Added policy-distillation-baselines to project page (@CUN-bjy)
    • Added ONNX export instructions (@batu)
    • Update read the doc env (fixed docutils issue)
    • Fix PPO environment name (@IljaAvadiev)
    • Fix custom env doc and add env registration example
    • Update algorithms from SB3 Contrib
    • Use underscores for numeric literals in examples to improve clarity
    Source code(tar.gz)
    Source code(zip)
  • v1.2.0(Sep 8, 2021)

    Breaking Changes:

    • SB3 now requires PyTorch >= 1.8.1
    • VecNormalize ret attribute was renamed to returns

    Bug Fixes:

    • Hotfix for VecNormalize where the observation filter was not updated at reset (thanks @vwxyzjn)
    • Fixed model predictions when using batch normalization and dropout layers by calling train() and eval() (@davidblom603)
    • Fixed model training for DQN, TD3 and SAC so that their target nets always remain in evaluation mode (@ayeright)
    • Passing gradient_steps=0 to an off-policy algorithm will result in no gradient steps being taken (vs as many gradient steps as steps done in the environment during the rollout in previous versions)

    Others:

    • Enabled Python 3.9 in GitHub CI
    • Fixed type annotations
    • Refactored predict() by moving the preprocessing to obs_to_tensor() method

    Documentation:

    • Updated multiprocessing example
    • Added example of VecEnvWrapper
    • Added a note about logging to tensorboard more often
    • Added warning about simplicity of examples and link to RL zoo (@MihaiAnca13)
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Jul 2, 2021)

    Breaking Changes

    • All customs environments (e.g. the BitFlippingEnv or IdentityEnv) were moved to stable_baselines3.common.envs folder
    • Refactored HER which is now the HerReplayBuffer class that can be passed to any off-policy algorithm
    • Handle timeout termination properly for off-policy algorithms (when using TimeLimit)
    • Renamed _last_dones and dones to _last_episode_starts and episode_starts in RolloutBuffer.
    • Removed ObsDictWrapper as Dict observation spaces are now supported
      her_kwargs = dict(n_sampled_goal=2, goal_selection_strategy="future", online_sampling=True)
      # SB3 < 1.1.0
      # model = HER("MlpPolicy", env, model_class=SAC, **her_kwargs)
      # SB3 >= 1.1.0:
      model = SAC("MultiInputPolicy", env, replay_buffer_class=HerReplayBuffer, replay_buffer_kwargs=her_kwargs)
    
    • Updated the KL Divergence estimator in the PPO algorithm to be positive definite and have lower variance (@09tangriro)
    • Updated the KL Divergence check in the PPO algorithm to be before the gradient update step rather than after end of epoch (@09tangriro)
    • Removed parameter channels_last from is_image_space as it can be inferred.
    • The logger object is now an attribute model.logger that be set by the user using model.set_logger()
    • Changed the signature of logger.configure and utils.configure_logger, they now return a Logger object
    • Removed Logger.CURRENT and Logger.DEFAULT
    • Moved warn(), debug(), log(), info(), dump() methods to the Logger class
    • .learn() now throws an import error when the user tries to log to tensorboard but the package is not installed

    New Features

    • Added support for single-level Dict observation space (@JadenTravnik)
    • Added DictRolloutBuffer DictReplayBuffer to support dictionary observations (@JadenTravnik)
    • Added StackedObservations and StackedDictObservations that are used within VecFrameStack
    • Added simple 4x4 room Dict test environments
    • HerReplayBuffer now supports VecNormalize when online_sampling=False
    • Added VecMonitor and VecExtractDictObs wrappers to handle gym3-style vectorized environments (@vwxyzjn)
    • Ignored the terminal observation if the it is not provided by the environment such as the gym3-style vectorized environments. (@vwxyzjn)
    • Added policy_base as input to the OnPolicyAlgorithm for more flexibility (@09tangriro)
    • Added support for image observation when using HER
    • Added replay_buffer_class and replay_buffer_kwargs arguments to off-policy algorithms
    • Added kl_divergence helper for Distribution classes (@09tangriro)
    • Added support for vector environments with num_envs > 1 (@benblack769)
    • Added wrapper_kwargs argument to make_vec_env (@amy12xx)

    Bug Fixes

    • Fixed potential issue when calling off-policy algorithms with default arguments multiple times (the size of the replay buffer would be the same)
    • Fixed loading of ent_coef for SAC and TQC, it was not optimized anymore (thanks @Atlis)
    • Fixed saving of A2C and PPO policy when using gSDE (thanks @liusida)
    • Fixed a bug where no output would be shown even if verbose>=1 after passing verbose=0 once
    • Fixed observation buffers dtype in DictReplayBuffer (@c-rizz)
    • Fixed EvalCallback tensorboard logs being logged with the incorrect timestep. They are now written with the timestep at which they were recorded. (@skandermoalla)

    Others

    • Added flake8-bugbear to tests dependencies to find likely bugs
    • Updated env_checker to reflect support of dict observation spaces
    • Added Code of Conduct
    • Added tests for GAE and lambda return computation
    • Updated distribution entropy test (thanks @09tangriro)
    • Added sanity check batch_size > 1 in PPO to avoid NaN in advantage normalization

    Documentation:

    • Added gym pybullet drones project (@JacopoPan)
    • Added link to SuperSuit in projects (@justinkterry)
    • Fixed DQN example (thanks @ltbd78)
    • Clarified channel-first/channel-last recommendation
    • Update sphinx environment installation instructions (@tom-doerr)
    • Clarified pip installation in Zsh (@tom-doerr)
    • Clarified return computation for on-policy algorithms (TD(lambda) estimate was used)
    • Added example for using ProcgenEnv
    • Added note about advanced custom policy example for off-policy algorithms
    • Fixed DQN unicode checkmarks
    • Updated migration guide (@juancroldan)
    • Pinned docutils==0.16 to avoid issue with rtd theme
    • Clarified callback save_freq definition
    • Added doc on how to pass a custom logger
    • Remove recurrent policies from A2C docs (@bstee615)
    Source code(tar.gz)
    Source code(zip)
  • v1.0(Mar 17, 2021)

    First Major Version

    Blog post: https://araffin.github.io/post/sb3/

    100+ pre-trained models in the zoo: https://github.com/DLR-RM/rl-baselines3-zoo

    Breaking Changes:

    • Removed stable_baselines3.common.cmd_util (already deprecated), please use env_util instead

    Warning

    A refactoring of the HER algorithm is planned together with support for dictionary observations (see PR #243 and #351) This will be a backward incompatible change (model trained with previous version of HER won't work with the new version).

    New Features:

    • Added support for custom_objects when loading models

    Bug Fixes:

    • Fixed a bug with DQN predict method when using deterministic=False with image space

    Documentation:

    • Fixed examples
    • Added new project using SB3: rl_reach (@PierreExeter)
    • Added note about slow-down when switching to PyTorch
    • Add a note on continual learning and resetting environment
    • Updated RL-Zoo to reflect the fact that is it more than a collection of trained agents
    • Added images to illustrate the training loop and custom policies (created with https://excalidraw.com/)
    • Updated the custom policy section
    Source code(tar.gz)
    Source code(zip)
  • v1.0rc1(Mar 6, 2021)

  • v0.11.1(Feb 27, 2021)

    Breaking Changes:

    • evaluate_policy now returns rewards/episode lengths from a Monitor wrapper if one is present, this allows to return the unnormalized reward in the case of Atari games for instance.
    • Renamed common.vec_env.is_wrapped to common.vec_env.is_vecenv_wrapped to avoid confusion with the new is_wrapped() helper
    • Renamed _get_data() to _get_constructor_parameters() for policies (this affects independent saving/loading of policies)
    • Removed n_episodes_rollout and merged it with train_freq, which now accepts a tuple (frequency, unit):
    • replay_buffer in collect_rollout is no more optional
    
      # SB3 < 0.11.0
      # model = SAC("MlpPolicy", env, n_episodes_rollout=1, train_freq=-1)
      # SB3 >= 0.11.0:
      model = SAC("MlpPolicy", env, train_freq=(1, "episode"))
    

    New Features:

    • Add support for VecFrameStack to stack on first or last observation dimension, along with automatic check for image spaces.
    • VecFrameStack now has a channels_order argument to tell if observations should be stacked on the first or last observation dimension (originally always stacked on last).
    • Added common.env_util.is_wrapped and common.env_util.unwrap_wrapper functions for checking/unwrapping an environment for specific wrapper.
    • Added env_is_wrapped() method for VecEnv to check if its environments are wrapped with given Gym wrappers.
    • Added monitor_kwargs parameter to make_vec_env and make_atari_env
    • Wrap the environments automatically with a Monitor wrapper when possible.
    • EvalCallback now logs the success rate when available (is_success must be present in the info dict)
    • Added new wrappers to log images and matplotlib figures to tensorboard. (@zampanteymedio)
    • Add support for text records to Logger. (@lorenz-h)

    Bug Fixes:

    • Fixed bug where code added VecTranspose on channel-first image environments (thanks @qxcv)
    • Fixed DQN predict method when using single gym.Env with deterministic=False
    • Fixed bug that the arguments order of explained_variance() in ppo.py and a2c.py is not correct (@thisray)
    • Fixed bug where full HerReplayBuffer leads to an index error. (@megan-klaiber)
    • Fixed bug where replay buffer could not be saved if it was too big (> 4 Gb) for python<3.8 (thanks @hn2)
    • Added informative PPO construction error in edge-case scenario where n_steps * n_envs = 1 (size of rollout buffer), which otherwise causes downstream breaking errors in training (@decodyng)
    • Fixed discrete observation space support when using multiple envs with A2C/PPO (thanks @ardabbour)
    • Fixed a bug for TD3 delayed update (the update was off-by-one and not delayed when train_freq=1)
    • Fixed numpy warning (replaced np.bool with bool)
    • Fixed a bug where VecNormalize was not normalizing the terminal observation
    • Fixed a bug where VecTranspose was not transposing the terminal observation
    • Fixed a bug where the terminal observation stored in the replay buffer was not the right one for off-policy algorithms
    • Fixed a bug where action_noise was not used when using HER (thanks @ShangqunYu)
    • Fixed a bug where train_freq was not properly converted when loading a saved model

    Others:

    • Add more issue templates
    • Add signatures to callable type annotations (@ernestum)
    • Improve error message in NatureCNN
    • Added checks for supported action spaces to improve clarity of error messages for the user
    • Renamed variables in the train() method of SAC, TD3 and DQN to match SB3-Contrib.
    • Updated docker base image to Ubuntu 18.04
    • Set tensorboard min version to 2.2.0 (earlier version are apparently not working with PyTorch)
    • Added warning for PPO when n_steps * n_envs is not a multiple of batch_size (last mini-batch truncated) (@decodyng)
    • Removed some warnings in the tests

    Documentation:

    • Updated algorithm table
    • Minor docstring improvements regarding rollout (@stheid)
    • Fix migration doc for A2C (epsilon parameter)
    • Fix clip_range docstring
    • Fix duplicated parameter in EvalCallback docstring (thanks @tfederico)
    • Added example of learning rate schedule
    • Added SUMO-RL as example project (@LucasAlegre)
    • Fix docstring of classes in atari_wrappers.py which were inside the constructor (@LucasAlegre)
    • Added SB3-Contrib page
    • Fix bug in the example code of DQN (@AptX395)
    • Add example on how to access the tensorboard summary writer directly. (@lorenz-h)
    • Updated migration guide
    • Updated custom policy doc (separate policy architecture recommended)
    • Added a note about OpenCV headless version
    • Corrected typo on documentation (@mschweizer)
    • Provide the environment when loading the model in the examples (@lorepieri8)
    Source code(tar.gz)
    Source code(zip)
  • v0.10.0(Oct 28, 2020)

    Breaking Changes

    • Warning: Renamed common.cmd_util to common.env_util for clarity (affects make_vec_env and make_atari_env functions)

    New Features

    • Allow custom actor/critic network architectures using net_arch=dict(qf=[400, 300], pi=[64, 64]) for off-policy algorithms (SAC, TD3, DDPG)
    • Added Hindsight Experience Replay HER. (@megan-klaiber)
    • VecNormalize now supports gym.spaces.Dict observation spaces
    • Support logging videos to Tensorboard (@SwamyDev)
    • Added share_features_extractor argument to SAC and TD3 policies

    Bug Fixes

    • Fix GAE computation for on-policy algorithms (off-by one for the last value) (thanks @Wovchena)
    • Fixed potential issue when loading a different environment
    • Fix ignoring the exclude parameter when recording logs using json, csv or log as logging format (@SwamyDev)
    • Make make_vec_env support the env_kwargs argument when using an env ID str (@ManifoldFR)
    • Fix model creation initializing CUDA even when device="cpu" is provided
    • Fix check_env not checking if the env has a Dict actionspace before calling _check_nan (@wmmc88)
    • Update the check for spaces unsupported by Stable Baselines 3 to include checks on the action space (@wmmc88)
    • Fixed feature extractor bug for target network where the same net was shared instead of being separate. This bug affects SAC, DDPG and TD3 when using CnnPolicy (or custom feature extractor)
    • Fixed a bug when passing an environment when loading a saved model with a CnnPolicy, the passed env was not wrapped properly (the bug was introduced when implementing HER so it should not be present in previous versions)

    Others

    • Improved typing coverage
    • Improved error messages for unsupported spaces
    • Added .vscode to the gitignore

    Documentation

    • Added first draft of migration guide
    • Added intro to imitation library (@shwang)
    • Enabled doc for CnnPolicies
    • Added advanced saving and loading example
    • Added base doc for exporting models
    • Added example for getting and setting model parameters
    Source code(tar.gz)
    Source code(zip)
  • v0.9.0(Oct 4, 2020)

    Breaking Changes:

    • Removed device keyword argument of policies; use policy.to(device) instead. (@qxcv)
    • Rename BaseClass.get_torch_variables -> BaseClass._get_torch_save_params and BaseClass.excluded_save_params -> BaseClass._excluded_save_params
    • Renamed saved items tensors to pytorch_variables for clarity
    • make_atari_env, make_vec_env and set_random_seed must be imported with (and not directly from stable_baselines3.common):
    from stable_baselines3.common.cmd_util import make_atari_env, make_vec_env
    from stable_baselines3.common.utils import set_random_seed
    

    New Features:

    • Added unwrap_vec_wrapper() to common.vec_env to extract VecEnvWrapper if needed
    • Added StopTrainingOnMaxEpisodes to callback collection (@xicocaio)
    • Added device keyword argument to BaseAlgorithm.load() (@liorcohen5)
    • Callbacks have access to rollout collection locals as in SB2. (@PartiallyTyped)
    • Added get_parameters and set_parameters for accessing/setting parameters of the agent
    • Added actor/critic loss logging for TD3. (@mloo3)

    Bug Fixes:

    • Fixed a bug where the environment was reset twice when using evaluate_policy
    • Fix logging of clip_fraction in PPO (@diditforlulz273)
    • Fixed a bug where cuda support was wrongly checked when passing the GPU index, e.g., device="cuda:0" (@liorcohen5)
    • Fixed a bug when the random seed was not properly set on cuda when passing the GPU index

    Others:

    • Improve typing coverage of the VecEnv
    • Fix type annotation of make_vec_env (@ManifoldFR)
    • Removed AlreadySteppingError and NotSteppingError that were not used
    • Fixed typos in SAC and TD3
    • Reorganized functions for clarity in BaseClass (save/load functions close to each other, private functions at top)
    • Clarified docstrings on what is saved and loaded to/from files
    • Simplified save_to_zip_file function by removing duplicate code
    • Store library version along with the saved models
    • DQN loss is now logged

    Documentation:

    • Added StopTrainingOnMaxEpisodes details and example (@xicocaio)
    • Updated custom policy section (added custom feature extractor example)
    • Re-enable sphinx_autodoc_typehints
    • Updated doc style for type hints and remove duplicated type hints
    Source code(tar.gz)
    Source code(zip)
  • v0.8.0(Aug 3, 2020)

    Breaking Changes:

    • AtariWrapper and other Atari wrappers were updated to match SB2 ones
    • save_replay_buffer now receives as argument the file path instead of the folder path (@tirafesi)
    • Refactored Critic class for TD3 and SAC, it is now called ContinuousCritic and has an additional parameter n_critics
    • SAC and TD3 now accept an arbitrary number of critics (e.g. policy_kwargs=dict(n_critics=3)) instead of only 2 previously

    New Features:

    • Added DQN Algorithm (@Artemis-Skade)
    • Buffer dtype is now set according to action and observation spaces for ReplayBuffer
    • Added warning when allocation of a buffer may exceed the available memory of the system when psutil is available
    • Saving models now automatically creates the necessary folders and raises appropriate warnings (@PartiallyTyped)
    • Refactored opening paths for saving and loading to use strings, pathlib or io.BufferedIOBase (@PartiallyTyped)
    • Added DDPG algorithm as a special case of TD3.
    • Introduced BaseModel abstract parent for BasePolicy, which critics inherit from.

    Bug Fixes:

    • Fixed a bug in the close() method of SubprocVecEnv, causing wrappers further down in the wrapper stack to not be closed. (@NeoExtended)
    • Fix target for updating q values in SAC: the entropy term was not conditioned by terminals states
    • Use cloudpickle.load instead of pickle.load in CloudpickleWrapper. (@shwang)
    • Fixed a bug with orthogonal initialization when bias=False in custom policy (@rk37)
    • Fixed approximate entropy calculation in PPO and A2C. (@andyshih12)
    • Fixed DQN target network sharing feature extractor with the main network.
    • Fixed storing correct dones in on-policy algorithm rollout collection. (@andyshih12)
    • Fixed number of filters in final convolutional layer in NatureCNN to match original implementation.

    Others:

    • Refactored off-policy algorithm to share the same .learn() method
    • Split the collect_rollout() method for off-policy algorithms
    • Added _on_step() for off-policy base class
    • Optimized replay buffer size by removing the need of next_observations numpy array
    • Optimized polyak updates (1.5-1.95 speedup) through inplace operations (@PartiallyTyped)
    • Switch to black codestyle and added make format, make check-codestyle and commit-checks
    • Ignored errors from newer pytype version
    • Added a check when using gSDE
    • Removed codacy dependency from Dockerfile
    • Added common.sb2_compat.RMSpropTFLike optimizer, which corresponds closer to the implementation of RMSprop from Tensorflow.

    Documentation:

    • Updated notebook links
    • Fixed a typo in the section of Enjoy a Trained Agent, in RL Baselines3 Zoo README. (@blurLake)
    • Added Unity reacher to the projects page (@koulakis)
    • Added PyBullet colab notebook
    • Fixed typo in PPO example code (@joeljosephjin)
    • Fixed typo in custom policy doc (@RaphaelWag)
    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Jun 10, 2020)

    Breaking Changes:

    • render() method of VecEnvs now only accept one argument: mode

    • Created new file common/torch_layers.py, similar to SB refactoring

      • Contains all PyTorch network layer definitions and feature extractors: MlpExtractor, create_mlp, NatureCNN
    • Renamed BaseRLModel to BaseAlgorithm (along with offpolicy and onpolicy variants)

    • Moved on-policy and off-policy base algorithms to common/on_policy_algorithm.py and common/off_policy_algorithm.py, respectively.

    • Moved PPOPolicy to ActorCriticPolicy in common/policies.py

    • Moved PPO (algorithm class) into OnPolicyAlgorithm (common/on_policy_algorithm.py), to be shared with A2C

    • Moved following functions from BaseAlgorithm:

      • _load_from_file to load_from_zip_file (save_util.py)
      • _save_to_file_zip to save_to_zip_file (save_util.py)
      • safe_mean to safe_mean (utils.py)
      • check_env to check_for_correct_spaces (utils.py. Renamed to avoid confusion with environment checker tools)
    • Moved static function _is_vectorized_observation from common/policies.py to common/utils.py under name is_vectorized_observation.

    • Removed {save,load}_running_average functions of VecNormalize in favor of load/save.

    • Removed use_gae parameter from RolloutBuffer.compute_returns_and_advantage.

    Bug Fixes:

    • Fixed render() method for VecEnvs
    • Fixed seed() method for SubprocVecEnv
    • Fixed loading on GPU for testing when using gSDE and deterministic=False
    • Fixed register_policy to allow re-registering same policy for same sub-class (i.e. assign same value to same key).
    • Fixed a bug where the gradient was passed when using gSDE with PPO/A2C, this does not affect SAC

    Others:

    • Re-enable unsafe fork start method in the tests (was causing a deadlock with tensorflow)
    • Added a test for seeding SubprocVecEnv and rendering
    • Fixed reference in NatureCNN (pointed to older version with different network architecture)
    • Fixed comments saying "CxWxH" instead of "CxHxW" (same style as in torch docs / commonly used)
    • Added bit further comments on register/getting policies ("MlpPolicy", "CnnPolicy").
    • Renamed progress (value from 1 in start of training to 0 in end) to progress_remaining.
    • Added policies.py files for A2C/PPO, which define MlpPolicy/CnnPolicy (renamed ActorCriticPolicies).
    • Added some missing tests for VecNormalize, VecCheckNan and PPO.

    Documentation:

    • Added a paragraph on "MlpPolicy"/"CnnPolicy" and policy naming scheme under "Developer Guide"
    • Fixed second-level listing in changelog
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Jun 1, 2020)

    Breaking Changes:

    • Remove State-Dependent Exploration (SDE) support for TD3
    • Methods were renamed in the logger:
      • logkv -> record, writekvs -> write, writeseq -> write_sequence,
      • logkvs -> record_dict, dumpkvs -> dump,
      • getkvs -> get_log_dict, logkv_mean -> record_mean,

    New Features:

    • Added env checker (Sync with Stable Baselines)
    • Added VecCheckNan and VecVideoRecorder (Sync with Stable Baselines)
    • Added determinism tests
    • Added cmd_util and atari_wrappers
    • Added support for MultiDiscrete and MultiBinary observation spaces (@rolandgvc)
    • Added MultiCategorical and Bernoulli distributions for PPO/A2C (@rolandgvc)
    • Added support for logging to tensorboard (@rolandgvc)
    • Added VectorizedActionNoise for continuous vectorized environments (@PartiallyTyped)
    • Log evaluation in the EvalCallback using the logger

    Bug Fixes:

    • Fixed a bug that prevented model trained on cpu to be loaded on gpu
    • Fixed version number that had a new line included
    • Fixed weird seg fault in docker image due to FakeImageEnv by reducing screen size
    • Fixed sde_sample_freq that was not taken into account for SAC
    • Pass logger module to BaseCallback otherwise they cannot write in the one used by the algorithms

    Others:

    • Renamed to Stable-Baseline3
    • Added Dockerfile
    • Sync VecEnvs with Stable-Baselines
    • Update requirement: gym>=0.17
    • Added .readthedoc.yml file
    • Added flake8 and make lint command
    • Added Github workflow
    • Added warning when passing both train_freq and n_episodes_rollout to Off-Policy Algorithms

    Documentation:

    • Added most documentation (adapted from Stable-Baselines)
    • Added link to CONTRIBUTING.md in the README (@kinalmehta)
    • Added gSDE project and update docstrings accordingly
    • Fix TD3 example code block
    Source code(tar.gz)
    Source code(zip)
Owner
DLR-RM
German Aerospace Center (DLR) - Institute of Robotics and Mechatronics (RM) - open source projects
DLR-RM
The Incredible PyTorch: a curated list of tutorials, papers, projects, communities and more relating to PyTorch.

This is a curated list of tutorials, projects, libraries, videos, papers, books and anything related to the incredible PyTorch. Feel free to make a pu

Ritchie Ng 8.5k Oct 22, 2021
PyTorch implementations of deep reinforcement learning algorithms and environments

Deep Reinforcement Learning Algorithms with PyTorch This repository contains PyTorch implementations of deep reinforcement learning algorithms and env

Petros Christodoulou 3.9k Oct 19, 2021
paper list in the area of reinforcenment learning for recommendation systems

paper list in the area of reinforcenment learning for recommendation systems

HenryZhao 18 Oct 20, 2021
Collection of generative models in Pytorch version.

pytorch-generative-model-collections Original : [Tensorflow version] Pytorch implementation of various GANs. This repository was re-implemented with r

Hyeonwoo Kang 2.3k Oct 18, 2021
Ilya Kostrikov 2.6k Oct 24, 2021
A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

A resource for learning about ML, DL, PyTorch and TensorFlow. Feedback always appreciated :)

Aladdin Persson 2.3k Oct 24, 2021
Tensorforce: a TensorFlow library for applied reinforcement learning

Tensorforce: a TensorFlow library for applied reinforcement learning Introduction Tensorforce is an open-source deep reinforcement learning framework,

Tensorforce 3k Oct 15, 2021
[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI

[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning DouZero is a reinforcement learning framework for DouDizhu (斗地主), t

Kwai Inc. 2.2k Oct 17, 2021
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

null 51.4k Oct 14, 2021
Tensors and Dynamic neural networks in Python with strong GPU acceleration

PyTorch is a Python package that provides two high-level features: Tensor computation (like NumPy) with strong GPU acceleration Deep neural networks b

null 46.1k Feb 13, 2021
Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Advanced Deep Learning with TensorFlow 2 and Keras (Updated for 2nd Edition)

Packt 1.1k Oct 21, 2021
Machine Learning From Scratch. Bare bones NumPy implementations of machine learning models and algorithms with a focus on accessibility. Aims to cover everything from linear regression to deep learning.

Machine Learning From Scratch About Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The purpose

Erik Linder-Norén 20.3k Oct 15, 2021
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 15.8k Oct 22, 2021
The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate.

The lightweight PyTorch wrapper for high-performance AI research. Scale your models, not the boilerplate. Website • Key Features • How To Use • Docs •

Pytorch Lightning 11.9k Feb 13, 2021
SAPIEN Manipulation Skill Benchmark

ManiSkill Benchmark SAPIEN Manipulation Skill Benchmark (abbreviated as ManiSkill, pronounced as "Many Skill") is a large-scale learning-from-demonstr

Hao Su's Lab, UCSD 58 Oct 23, 2021
StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

StudioGAN is a Pytorch library providing implementations of representative Generative Adversarial Networks (GANs) for conditional/unconditional image generation.

null 1.9k Oct 16, 2021
Self-supervised Augmentation Consistency for Adapting Semantic Segmentation (CVPR 2021)

Self-supervised Augmentation Consistency for Adapting Semantic Segmentation This repository contains the official implementation of our paper: Self-su

Visual Inference Lab @TU Darmstadt 80 Oct 15, 2021
High-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently.

TL;DR Ignite is a high-level library to help with training and evaluating neural networks in PyTorch flexibly and transparently. Click on the image to

null 3.7k Oct 24, 2021
Solutions of Reinforcement Learning 2nd Edition

Solutions of Reinforcement Learning, An Introduction

YIFAN WANG 1000 Oct 14, 2021