High-quality single file implementation of Deep Reinforcement Learning algorithms with research-friendly features

Overview

CleanRL (Clean Implementation of RL Algorithms)

Meeting Recordings : cleanrl

CleanRL is a Deep Reinforcement Learning library that provides high-quality single-file implementation with research-friendly features. The implementation is clean and simple, yet we can scale it to run thousands of experiments using AWS Batch. The highlight features of CleanRL are:

  • 📜 Single-file implementation
    • Every detail about an algorithm is put into the algorithm's own file. It is therefore easier to fully understand an algorithm and do research with.
  • 📊 Benchmarked Implementation (7+ algorithms and 34+ games at https://benchmark.cleanrl.dev)
  • 📈 Tensorboard Logging
  • 🪛 Local Reproducibility via Seeding
  • 🎮 Videos of Gameplay Capturing
  • 🧫 Experiment Management with Weights and Biases
  • 💸 Cloud Integration with docker and AWS

Good luck have fun 🚀

Get started

Prerequisites:

To run experiments locally, give the following a try:

git clone https://github.com/vwxyzjn/cleanrl.git && cd cleanrl
poetry install

# alternatively, you could use `poetry shell` and do
# `python run cleanrl/ppo.py`
poetry run python cleanrl/ppo.py \
    --seed 1 \
    --gym-id CartPole-v0 \
    --total-timesteps 50000

# open another temrminal and enter `cd cleanrl/cleanrl`
tensorboard --logdir runs

To use experiment tracking with wandb, run

wandb login # only required for the first time
poetry run python cleanrl/ppo.py \
    --seed 1 \
    --gym-id CartPole-v0 \
    --total-timesteps 50000 \
    --track \
    --wandb-project-name cleanrltest

To run training scripts in other games:

poetry shell

# classic control
python cleanrl/dqn.py --gym-id CartPole-v1
python cleanrl/ppo.py --gym-id CartPole-v1
python cleanrl/c51.py --gym-id CartPole-v1

# atari
poetry install -E atari
python cleanrl/dqn_atari.py --gym-id BreakoutNoFrameskip-v4
python cleanrl/c51_atari.py --gym-id BreakoutNoFrameskip-v4
python cleanrl/ppo_atari.py --gym-id BreakoutNoFrameskip-v4
python cleanrl/apex_dqn_atari.py --gym-id BreakoutNoFrameskip-v4

# pybullet
poetry install -E pybullet
python cleanrl/td3_continuous_action.py --gym-id MinitaurBulletDuckEnv-v0
python cleanrl/ddpg_continuous_action.py --gym-id MinitaurBulletDuckEnv-v0
python cleanrl/sac_continuous_action.py --gym-id MinitaurBulletDuckEnv-v0

# procgen
poetry install -E procgen
python cleanrl/ppo_procgen.py --gym-id starpilot
python cleanrl/ppo_procgen_impala_cnn.py --gym-id starpilot
python cleanrl/ppg_procgen.py --gym-id starpilot
python cleanrl/ppg_procgen_impala_cnn.py --gym-id starpilot

Algorithms Implemented

  • Deep Q-Learning (DQN)
    • dqn.py
      • For discrete action space.
    • dqn_atari.py
      • For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
  • Categorical DQN (C51)
    • c51.py
      • For discrete action space.
    • c51_atari.py
      • For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.
    • c51_atari_visual.py
      • Adds return and q-values visulization for dqn_atari.py.
  • Proximal Policy Gradient (PPO)
  • Soft Actor Critic (SAC)
  • Deep Deterministic Policy Gradient (DDPG)
  • Twin Delayed Deep Deterministic Policy Gradient (TD3)
  • Apex Deep Q-Learning (Apex-DQN)
    • apex_dqn_atari_visual.py
      • For playing Atari games. It uses convolutional layers and common atari-based pre-processing techniques.

Open RL Benchmark

Open RL Benchmark by CleanRL is a comprehensive, interactive and reproducible benchmark of deep Reinforcement Learning (RL) algorithms. It uses Weights and Biases to keep track of the experiment data of popular deep RL algorithms (e.g. DQN, PPO, DDPG, TD3) in a variety of games (e.g. Atari, Mujoco, PyBullet, Procgen, Griddly, MicroRTS). The experiment data includes:

Open RL Benchmark has over 1000+ experiments including runs from other projects, which is overwhelming to present in a single report. Instead, we present the results in separate reports. Please click on the links below to access them.

We hope it could bring a new level of transparency, openness, and reproducibility. Our plan is to benchmark as many algorithms and games as possible. If you are interested, please join us and contribute more algorithms and games. To get started, check out our contribution guide and our roadmap for the Open RL Benchmark

Cloud integration

Check out the documentation here

Support and get involved

We have a Discord Community for support. Feel free to ask questions. Posting in Github Issues and PRs are also welcome. Also our past video recordings are available at YouTube

Contribution

We have a short contribution guide here https://github.com/vwxyzjn/cleanrl/blob/master/CONTRIBUTING.md. Consider adding new algorithms or test new games on the Open RL Benchmark (https://benchmark.cleanrl.dev)

Big thanks to all the contributors of CleanRL!

References

I have been heavily inspired by the many repos and blog posts. Below contains a incomplete list of them.

The following ones helped me a lot with the continuous action space handling:

Citing CleanRL

If you use CleanRL in your work, please cite our technical paper:

@article{huang2021cleanrl,
    title={CleanRL: High-quality Single-file Implementations of Deep Reinforcement Learning Algorithms}, 
    author={Shengyi Huang and Rousslan Fernand Julien Dossa and Chang Ye and Jeff Braga},
    year={2021},
    journal={arXiv preprint arXiv:2111.08819},
    url={https://arxiv.org/abs/2111.08819}
}
Comments
  • Add RPO to CleanRL

    Add RPO to CleanRL

    Description

    Types of changes

    • [ ] Bug fix
    • [ ] New feature
    • [x] New algorithm
    • [ ] Documentation

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required).
    • [ ] I have ensured pre-commit run --all-files passes (required).
    • [ ] I have updated the documentation and previewed the changes via mkdocs serve.
    • [ ] I have updated the tests accordingly (if applicable).

    If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [x ] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [ ] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [x] I have added additional documentation and previewed the changes via mkdocs serve.
      • [x] I have explained note-worthy implementation details.
      • [x] I have explained the logged metrics.
      • [x] I have added links to the original paper and related papers (if applicable).
      • [x] I have added links to the PR related to the algorithm variant.
      • [x] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [x] I have added the learning curves (in PNG format).
      • [x] I have added links to the tracked experiments.
      • [ ] I have updated the overview sections at the docs and the repo
    • [ ] I have updated the tests accordingly (if applicable).
    opened by masud99r 24
  • Issues with applying PPO Impala  on Retro Env in regards to  running multiple environment

    Issues with applying PPO Impala on Retro Env in regards to running multiple environment

    So what I essentially need is to so have something like "venv = ProcgenEnv(num_envs=" ... but for retro.make(). Running multiple retro environments is causing issues for me, and retrowrapper isn't helping. Thank you!

    opened by hlsafin 20
  •  Implement Gymnasium-compliant PPO script

    Implement Gymnasium-compliant PPO script

    Description

    Types of changes

    • [ ] Bug fix
    • [x] New feature
    • [ ] New algorithm
    • [ ] Documentation

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required).
    • [x] I have ensured pre-commit run --all-files passes (required).
    • [ ] I have updated the documentation and previewed the changes via mkdocs serve.
    • [ ] I have updated the tests accordingly (if applicable).

    If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [x] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [x] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [x] I have added additional documentation and previewed the changes via mkdocs serve.
      • [x] I have explained note-worthy implementation details.
      • [x] I have explained the logged metrics.
      • [x] I have added links to the original paper and related papers (if applicable).
      • [x] I have added links to the PR related to the algorithm variant.
      • [x] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [x] I have added the learning curves (in PNG format).
      • [x] I have added links to the tracked experiments.
      • [x] I have updated the overview sections at the docs and the repo
    • [x] I have updated the tests accordingly (if applicable).
    opened by dtch1997 19
  • prototype jax with dqn

    prototype jax with dqn

    Description

    JAX implementation for DQN Implementation for #220

    Types of changes

    • [ ] Bug fix
    • [ ] New feature
    • [x] New algorithm
    • [x] Documentation

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required).
    • [x] I have ensured pre-commit run --all-files passes (required).
    • [x] I have updated the documentation and previewed the changes via mkdocs serve.
    • [x] I have updated the tests accordingly (if applicable).

    If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [x] I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [x] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [x] I have added additional documentation and previewed the changes via mkdocs serve.
      • [x] I have explained note-worthy implementation details.
      • [x] I have explained the logged metrics.
      • [x] I have added links to the original paper and related papers (if applicable).
      • [x] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [x] I have added the learning curves (in PNG format with width=500 and height=300).
      • [x] I have added links to the tracked experiments.
    • [x] I have updated the tests accordingly (if applicable).
    opened by kinalmehta 19
  • DDPG/TD3 target_actor output clip

    DDPG/TD3 target_actor output clip

    Problem Description

    Hi! It seems that the output of target_actor in DDPG/TD3 has been directly clipped to fit the action range boundaries, without multiplying by max_action. But in Fujimoto's DDPG/TD3 code[1] and some other implementations, the max_action has been add in the last tanh layer of the actor network, so they don't use clip. Have u ever tried the second implementation?

    if global_step > args.learning_starts:
                    data = rb.sample(args.batch_size)
                    with torch.no_grad():
                        next_state_actions = (target_actor(data.next_observations)).clamp(
                            envs.single_action_space.low[0], envs.single_action_space.high[0]
                        )
                        qf1_next_target = qf1_target(data.next_observations, next_state_actions)
                        next_q_value = data.rewards.flatten() + (1 - data.dones.flatten()) * args.gamma * (qf1_next_target).view(-1)
    

    [1] https://github.com/sfujim/TD3

    opened by huxiao09 19
  • Using jax scan for PPO + atari + envpool XLA

    Using jax scan for PPO + atari + envpool XLA

    Description

    Modifying the code to use jax.lax.scan for fast compile time and small speed improvement.

    The loss metrics of this pull request (blue) are consistent with the original version (green). image

    The performance is similar to the original with a slight speed improvement. image

    The command used is python cleanrl/ppo_atari_envpool_xla_jax_scan.py --env-id Breakout-v5 --total-timesteps 10000000 --num-envs 32 --seed 111 (blue) and python cleanrl/ppo_atari_envpool_xla_jax.py --env-id Breakout-v5 --total-timesteps 10000000 --num-envs 32 --seed 111 (green).

    Types of changes

    • [ ] Bug fix
    • [x] New feature
    • [ ] New algorithm
    • [ ] Documentation

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required).
    • [x] I have ensured pre-commit run --all-files passes (required).
    • [x] I have updated the documentation and previewed the changes via mkdocs serve.
    • [x] I have updated the tests accordingly (if applicable).

    If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [x] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [x] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [x] I have added additional documentation and previewed the changes via mkdocs serve.
      • [x] I have explained note-worthy implementation details.
      • [x] I have explained the logged metrics.
      • [x] I have added links to the original paper and related papers (if applicable).
      • [x] I have added links to the PR related to the algorithm variant.
      • [x] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [x] I have added the learning curves (in PNG format).
      • [x] I have added links to the tracked experiments.
      • [ ] I have updated the overview sections at the docs and the repo
    • [x] I have updated the tests accordingly (if applicable).
    opened by 51616 17
  • ppo with timeout handling

    ppo with timeout handling

    Description

    Closes #198

    Types of changes

    • [ ] Bug fix
    • [x] New feature
    • [ ] New algorithm
    • [ ] Documentation

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required).
    • [x] I have ensured pre-commit run --all-files passes (required).
    • [ ] I have updated the documentation and previewed the changes via mkdocs serve.
    • [ ] I have updated the tests accordingly (if applicable).

    If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [ ] I have contacted @vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [ ] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [ ] I have added additional documentation and previewed the changes via mkdocs serve.
      • [ ] I have explained note-worthy implementation details.
      • [ ] I have explained the logged metrics.
      • [ ] I have added links to the original paper and related papers (if applicable).
      • [ ] I have added links to the PR related to the algorithm.
      • [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [ ] I have added the learning curves (in PNG format with width=500 and height=300).
      • [ ] I have added links to the tracked experiments.
    • [ ] I have updated the tests accordingly (if applicable).
    opened by Howuhh 13
  • Documentation Site

    Documentation Site

    Problem Description

    Although CleanRL generally has a simplistic implementation, it will be desirable to have a documentation site for some situations. For example, I'm not sure where to put instructions on how to do start and resume with CleanRL's scripts. See #33, #14.

    opened by vwxyzjn 13
  • Add PPO Atari LSTM example

    Add PPO Atari LSTM example

    opened by vwxyzjn 12
  • TD3: fixed dimension of clipped_noise for target actions, added noise …

    TD3: fixed dimension of clipped_noise for target actions, added noise …

    Description

    Closes #279.

    • td3_continuous_action.py: noise sampled to compute the Q network update target matches action dimensions in the buffer
    • td3_continuous_action.py: aforementioned noise is also scaled to match the scaling range of the actions.

    Types of changes

    • [x] Bug fix
    • [ ] New feature
    • [ ] New algorithm
    • [ ] Documentation

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required).
    • [x] I have ensured pre-commit run --all-files passes (required).
    • ~~[ ] I have updated the documentation and previewed the changes via mkdocs serve.~~
    • [x] I have updated the tests accordingly (if applicable).

    If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [x] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [x] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [x] I have added additional documentation and previewed the changes via mkdocs serve.
      • [x] I have explained note-worthy implementation details.
      • [x] I have explained the logged metrics.
      • [x] I have added links to the original paper and related papers (if applicable).
      • [x] I have added links to the PR related to the algorithm.
      • [x] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [x] I have added the learning curves (in PNG format with width=500 and height=300).
      • [x] I have added links to the tracked experiments.
      • [x] I have updated the overview sections at the docs and the repo
    • [x] I have updated the tests accordingly (if applicable).
    opened by dosssman 11
  • Hyperparameter optimization

    Hyperparameter optimization

    Description

    This PR adds a first pass of hyperparameter optimization.

    The API design roughly looks like

    import optuna
    from cleanrl_utils.tuner import Tuner
        
    tuner = Tuner(
        script="cleanrl/ppo.py",
        metric="charts/episodic_return",
        metric_last_n_average_window=50,
        direction="maximize",
        target_scores={
            "CartPole-v1": [0, 500],
            "Acrobot-v1": [-500, 0],
        },
        params_fn=lambda trial: {
            "learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
            "num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
            "update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4]),
            "num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
            "vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
            "max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
            "total-timesteps": 10000,
            "num-envs": 16,
        },
        pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
        wandb_kwargs={"project": "cleanrl"},
    )
    tuner.tune(
        num_trials=10,
        num_seeds=3,
    )
    

    Preliminary docs are available at https://cleanrl-jlu83xh5n-vwxyzjn.vercel.app/advanced/hyperparameter-tuning/

    Types of changes

    • [ ] Bug fix
    • [x] New feature
    • [ ] New algorithm
    • [x] Documentation

    Checklist:

    • [x] I've read the CONTRIBUTION guide (required).
    • [x] I have ensured pre-commit run --all-files passes (required).
    • [x] I have updated the documentation and previewed the changes via mkdocs serve.
    • [x] I have updated the tests accordingly (if applicable).
    opened by vwxyzjn 11
  • Deprecate `ppo_procgen.py` in favor of EnvPool

    Deprecate `ppo_procgen.py` in favor of EnvPool

    Problem Description

    Given the EnvPool==0.8.0 release by @YukunJ, @LeoGuo98, @Trinkle23897 (https://github.com/sail-sg/envpool/pull/197), we can go ahead and deprecate ppo_procgen.py in favor of #338, which should also work with procgen but gives us the benefit of JAX, EnvPool's Async API, and a more concise codebase.

    opened by vwxyzjn 1
  • PPO with machado preprocessing

    PPO with machado preprocessing

    Description

    Types of changes

    • [ ] Bug fix
    • [ ] New feature
    • [ ] New algorithm
    • [ ] Documentation

    Checklist:

    • [ ] I've read the CONTRIBUTION guide (required).
    • [ ] I have ensured pre-commit run --all-files passes (required).
    • [ ] I have updated the documentation and previewed the changes via mkdocs serve.
    • [ ] I have updated the tests accordingly (if applicable).

    If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [ ] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [ ] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [ ] I have added additional documentation and previewed the changes via mkdocs serve.
      • [ ] I have explained note-worthy implementation details.
      • [ ] I have explained the logged metrics.
      • [ ] I have added links to the original paper and related papers (if applicable).
      • [ ] I have added links to the PR related to the algorithm variant.
      • [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [ ] I have added the learning curves (in PNG format).
      • [ ] I have added links to the tracked experiments.
      • [ ] I have updated the overview sections at the docs and the repo
    • [ ] I have updated the tests accordingly (if applicable).
    opened by vwxyzjn 3
  • What is the reason for returning mean in SAC get_action function if it's never used?

    What is the reason for returning mean in SAC get_action function if it's never used?

    Problem Description

    In the script sac_continuous_action.py, the get_action function in the Actor class returns action, log_prob, mean. action and log_prob is used but mean is never used. Is there a reason to return that value when it's never used in the code? As a new comer it's a little confusing on why that is needed.

    Checklist

    Current Behavior

    Works as expected

    Expected Behavior

    Works as expected

    Possible Solution

    Remove the mean returned in the get_action function in Actor class

    opened by sudonymously 0
  • Cleanrl for MARL

    Cleanrl for MARL

    Contribution to MARL

    I would like to contribute to Cleanrl repo by extending RL algorithms to Multi-Agent Systems (i.e MARL). I have discussed the same with @vwxyzjn, and he suggested starting an issue here. If anyone is interested in contributing to MARL, please respond here. Going forward, we can lay out the roadmap and share the responsibilities.

    Thank you.

    opened by vbaddam 8
  • Torchx integration

    Torchx integration

    Description

    Our current cloud integration is pretty hacky. I haven't seen anyone used it and it has been a maintenance burden for us. Using a more managed utility to launch experiments in the cloud is desirable. There are two primary contenders and their pros and cons:

    • torchx
      • ✅ support for slurm
      • ✅ support for running tasks locally
      • ✅ the docker image is automatically pushed with a hash for AWS Batch
      • ❌ still need to spin up cloud resources (e.g., aws batch), which is complicated but can be mitigated by using terraform
    • skypilot
      • ✅ support for managing spot instances and auto resume them
      • compare pricing
      • ✅ debuggability via sky ssh mycluster
        • ✅ good for folks who don't always have a GPU machine
      • ❌ need to wait for the clusters to be spun up

    All of them:

    • ✅ support for aws, gcp, azure

    This PR

    Better cloud integration utility by leveraging torchx. It should really be an elegant solution for us and has the following benefits:

    • we can deprecate our cloud utilities and release ourselves from their maintenance burden
    • support for slurm, kubernetes, aws batch, gcp (https://github.com/pytorch/torchx/issues/410#issuecomment-1301186265) and others

    Give it a try by running

    poetry run torchx run --scheduler local_docker utils.python --gpu 1 --script cleanrl/cleanrl.py
    poetry run torchx run --scheduler aws_batch --scheduler_args queue=c5a-large,image_repo=vwxyzjn/cleanrl  utils.python  --script cleanrl/ppo.py
    poetry run torchx status aws_batch://torchx/c5a-large:torchx_utils_python-pn9sx3wzq0qcwd
    

    asciicast

    image

    Types of changes

    • [ ] Bug fix
    • [x] New feature
    • [ ] New algorithm
    • [ ] Documentation

    Checklist:

    • [ ] I've read the CONTRIBUTION guide (required).
    • [ ] I have ensured pre-commit run --all-files passes (required).
    • [ ] I have updated the documentation and previewed the changes via mkdocs serve.
    • [ ] I have updated the tests accordingly (if applicable).

    If you are adding new algorithm variants or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [ ] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [ ] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [ ] I have added additional documentation and previewed the changes via mkdocs serve.
      • [ ] I have explained note-worthy implementation details.
      • [ ] I have explained the logged metrics.
      • [ ] I have added links to the original paper and related papers (if applicable).
      • [ ] I have added links to the PR related to the algorithm variant.
      • [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [ ] I have added the learning curves (in PNG format).
      • [ ] I have added links to the tracked experiments.
      • [ ] I have updated the overview sections at the docs and the repo
    • [ ] I have updated the tests accordingly (if applicable).
    opened by vwxyzjn 1
  • Brax + PPO integration

    Brax + PPO integration

    Description

    Test out integration with brax. It seems to work out of the box without having to implement observation normalization — https://wandb.ai/costa-huang/cleanRL/runs/2aemjwey?workspace=user-costa-huang

    image

    Compilation takes ~400 seconds, and getting 6000 rewards in Ant takes about 100 seconds with GPU. In comparison, the official demo takes 30 seconds to compile and about 80 seconds to reach ~8000 rewards (using TPU I presume). Our compilation time takes significantly longer, most likely because we didn't use lax.scan or jax.foriloop, but once the compilation finished the SPS is about 600k.

    CC @joaogui1

    Types of changes

    • [ ] Bug fix
    • [ ] New feature
    • [x] New algorithm
    • [ ] Documentation

    Checklist:

    • [ ] I've read the CONTRIBUTION guide (required).
    • [ ] I have ensured pre-commit run --all-files passes (required).
    • [ ] I have updated the documentation and previewed the changes via mkdocs serve.
    • [ ] I have updated the tests accordingly (if applicable).

    If you are adding new algorithms or your change could result in performance difference, you may need to (re-)run tracked experiments. See https://github.com/vwxyzjn/cleanrl/pull/137 as an example PR.

    • [ ] I have contacted vwxyzjn to obtain access to the openrlbenchmark W&B team (required).
    • [ ] I have tracked applicable experiments in openrlbenchmark/cleanrl with --capture-video flag toggled on (required).
    • [ ] I have added additional documentation and previewed the changes via mkdocs serve.
      • [ ] I have explained note-worthy implementation details.
      • [ ] I have explained the logged metrics.
      • [ ] I have added links to the original paper and related papers (if applicable).
      • [ ] I have added links to the PR related to the algorithm.
      • [ ] I have created a table comparing my results against those from reputable sources (i.e., the original paper or other reference implementation).
      • [ ] I have added the learning curves (in PNG format with width=500 and height=300).
      • [ ] I have added links to the tracked experiments.
      • [ ] I have updated the overview sections at the docs and the repo
    • [ ] I have updated the tests accordingly (if applicable).
    opened by vwxyzjn 1
Releases(v1.0.0)
  • v1.0.0(Nov 14, 2022)

    🎉 We are thrilled to announce the v1.0.0 CleanRL Release. Along with our CleanRL paper's recent publication in Journal of Machine Learning Research, our v1.0.0 release includes reworked documentation, new algorithm variants, support for google's new ML framework JAX, hyperparameter tuning utilities, and more. CleanRL has come a long way making high-quality deep reinforcement learning implementations easy to understand and reproducible. This release is a major milestone for the project and we are excited to share it with you. Over 90 PRs were merged to make this release possible. We would like to thank all the contributors who made this release possible.

    Reworked documentation

    One of the biggest change of the v1 release is the added documentation at docs.cleanrl.dev. Having great documentation is important for building a reliable and reproducible project. We have reworked the documentation to make it easier to understand and use. For each implemented algorithm, we have documented as much as we can to promote transparency:

    Here is a list of the algorithm variants and their documentation:

    | Algorithm | Variants Implemented | | ----------- | ----------- | | ✅ Proximal Policy Gradient (PPO) | ppo.py, docs | | | ppo_atari.py, docs | | ppo_continuous_action.py, docs | | ppo_atari_lstm.py, docs | | ppo_atari_envpool.py, docs | | ppo_atari_envpool_xla_jax.py, docs | | ppo_procgen.py, docs | | ppo_atari_multigpu.py, docs | | ppo_pettingzoo_ma_atari.py, docs | | ppo_continuous_action_isaacgym.py, docs | ✅ Deep Q-Learning (DQN) | dqn.py, docs | | | dqn_atari.py, docs | | | dqn_jax.py, docs | | | dqn_atari_jax.py, docs | | ✅ Categorical DQN (C51) | c51.py, docs | | | c51_atari.py, docs | | ✅ Soft Actor-Critic (SAC) | sac_continuous_action.py, docs | | ✅ Deep Deterministic Policy Gradient (DDPG) | ddpg_continuous_action.py, docs | | | ddpg_continuous_action_jax.py, docs | ✅ Twin Delayed Deep Deterministic Policy Gradient (TD3) | td3_continuous_action.py, docs | | | td3_continuous_action_jax.py, docs | | ✅ Phasic Policy Gradient (PPG) | ppg_procgen.py, docs | | ✅ Random Network Distillation (RND) | ppo_rnd_envpool.py, docs |

    We also improved the contribution guide to make it easier for new contributors to get started. We are still working on improving the documentation. If you have any suggestions, please let us know in the GitHub Issues.

    New algorithm variants, support for JAX

    We now support JAX-based learning algorithm variants, which are usually faster than the torch equivalent! Here are the docs of the new JAX-based DQN, TD3, and DDPG implementations:

    For example, below are the benchmark of DDPG + JAX (see docs here for further detail):

    Other new algorithm variants include multi-GPU PPO, PPO prototype that works with Isaac Gym, multi-agent Atari PPO, and refactored PPG and PPO-RND implementations:

    Tooling improvements

    We love tools! The v1.0.0 release comes with a series of DevOps improvements, including pre-commit utilities, CI integration with GitHub to run end-to-end test cases. We also make available a new hyperparameter tuning tool and a new tool for running benchmark experiments.

    DevOps

    We added a pre-commit utility to help contributors to format their code, check for spelling, and removing unused variables and imports before submitting a pull request (see Contribution guide for more detail).

    To ensure our single-file implementations can run without error, we also added CI/CD pipeline which now runs end-to-end test cases for all the algorithm variants. The pipeline also tests builds across different operating systems, such as Linux, macOS, and Windows (see here as an example). GitHub actions are free for open source projects, and we are very happy to have this tool to help us maintain the project.

    Hyperparameter tuning utilities

    We now have preliminary support for hyperparameter tuning via optuna (see docs), which is designed to help researchers to find a single set of hyperparameters that work well with a kind of games. The current API looks like below:

    import optuna
    from cleanrl_utils.tuner import Tuner
    tuner = Tuner(
        script="cleanrl/ppo.py",
        metric="charts/episodic_return",
        metric_last_n_average_window=50,
        direction="maximize",
        aggregation_type="average",
        target_scores={
            "CartPole-v1": [0, 500],
            "Acrobot-v1": [-500, 0],
        },
        params_fn=lambda trial: {
            "learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
            "num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
            "update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4, 8]),
            "num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
            "vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
            "max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
            "total-timesteps": 100000,
            "num-envs": 16,
        },
        pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
        sampler=optuna.samplers.TPESampler(),
    )
    tuner.tune(
        num_trials=100,
        num_seeds=3,
    )
    

    Benchmarking utilities

    We also added a new tool for running benchmark experiments. The tool is designed to help researchers to quickly run benchmark experiments across different algorithms environments with some random seeds. The tool lives in the cleanrl_utils.benchmark module, and the users can run commands such as:

    OMP_NUM_THREADS=1 xvfb-run -a python -m cleanrl_utils.benchmark \
        --env-ids CartPole-v1 Acrobot-v1 MountainCar-v0 \
        --command "poetry run python cleanrl/ppo.py --cuda False --track --capture-video" \
        --num-seeds 3 \
        --workers 5
    

    which will run the ppo.py script with --cuda False --track --capture-video arguments across 3 random seeds for 3 environments. It uses multiprocessing to create a pool of 5 workers run the experiments in parallel.

    What’s next?

    It is an exciting time and new improvements are coming to CleanRL. We plan to add more JAX-based implementations, huggingface integration, some RLops prototypes, and support Gymnasium. CleanRL is a community-based project and we always welcome new contributors. If there is an algorithm or new feature you would like to contribute, feel free to chat with us on our discord channel or raise a GitHub issue.

    More JAX implementations

    More JAX-based implementation are coming. Antonin Raffin, the core maintainer of Stable-baselines3, SBX, and rl-baselines3-zoo, is contributing an optimized Soft Actor Critic implementation in JAX (vwxyzjn/cleanrl#300) and TD3+TQC, and DroQ (vwxyzjn/cleanrl#272. These are incredibly exciting new algorithms. For example, DroQ is extremely sample effcient and can obtain ~5000 return in HalfCheetah-v3 in just 100k steps (tracked sbx experiment).

    Huggingface integration

    Huggingface Hub 🤗 is a great platform for sharing and collaborating models. We are working on a new integration with Huggingface Hub to make it easier for researchers to share their RL models and benchmark them against other models (vwxyzjn/cleanrl#292). Stay tuned! In the future, we will have a simple snippet for loading models like below:

    import random
    from typing import Callable
    
    import gym
    import numpy as np
    import torch
    
    
    def evaluate(
        model_path: str,
        make_env: Callable,
        env_id: str,
        eval_episodes: int,
        run_name: str,
        Model: torch.nn.Module,
        device: torch.device,
        epsilon: float = 0.05,
        capture_video: bool = True,
    ):
        envs = gym.vector.SyncVectorEnv([make_env(env_id, 0, 0, capture_video, run_name)])
        model = Model(envs).to(device)
        model.load_state_dict(torch.load(model_path))
        model.eval()
    
        obs = envs.reset()
        episodic_returns = []
        while len(episodic_returns) < eval_episodes:
            if random.random() < epsilon:
                actions = np.array([envs.single_action_space.sample() for _ in range(envs.num_envs)])
            else:
                q_values = model(torch.Tensor(obs).to(device))
                actions = torch.argmax(q_values, dim=1).cpu().numpy()
            next_obs, _, _, infos = envs.step(actions)
            for info in infos:
                if "episode" in info.keys():
                    print(f"eval_episode={len(episodic_returns)}, episodic_return={info['episode']['r']}")
                    episodic_returns += [info["episode"]["r"]]
            obs = next_obs
    
        return episodic_returns
    
    
    if __name__ == "__main__":
        from huggingface_hub import hf_hub_download
    
        from cleanrl.dqn import QNetwork, make_env
    
        model_path = hf_hub_download(repo_id="cleanrl/CartPole-v1-dqn-seed1", filename="q_network.pth")
    

    RLops

    How do we know the effect of a new feature / bug fix? DRL is brittle and has a series of reproducibility issues — even bug fixes sometimes could introduce performance regression (e.g., see how a bug fix of contact force in MuJoCo results in worse performance for PPO). Therefore, it is essential to understand how the proposed changes impact the performance of the algorithms.

    We are working a prototype tool that allows us to compare the performance of the library at different versions of the tracked experiment (vwxyzjn/cleanrl#307). With this tool, we can confidently merge new features / bug fixes without worrying about introducing catastrophic regression. The users can run commands such as:

    python -m cleanrl_utils.rlops --exp-name ddpg_continuous_action \
        --wandb-project-name cleanrl \
        --wandb-entity openrlbenchmark \
        --tags 'pr-299' 'rlops-pilot' \
        --env-ids HalfCheetah-v2 Walker2d-v2 Hopper-v2 InvertedPendulum-v2 Humanoid-v2 Pusher-v2 \
        --output-filename compare.png \
        --scan-history \
        --metric-last-n-average-window 100 \
        --report
    

    which generates the following image

    Support for Gymnasium

    Farama-Foundation/Gymnasium is the next generation of openai/gym that will continue to be maintained and introduce new features. Please see their announcement for further detail. We are migrating to gymnasium and the progress can be tracked in vwxyzjn/cleanrl#277.

    Also, the Farama foundation is working a project called Shimmy which offers conversion wrapper for deepmind/dm_env environments, such as dm_control and deepmind/lab. This is an exciting project that will allow us to support deepmind/dm_env in the future.

    Contributions

    CleanRL has benefited from the contributions of many awesome folks. I would like to cordially thank the core dev members @dosssman @yooceii @Dipamc @kinalmehta @bragajj for their efforts in helping maintain the CleanRL repository. I would also like to give a shout-out to our new contributors @cool-RR, @Howuhh, @jseppanen, @joaogui1, @ALPH2H, @ElliotMunro200, @WillDudley, and @sdpkjc.

    We always welcome new contributors to the project. If you are interested in contributing to CleanRL (e.g., new features, bug fixes, new algorithms), please check out our reworked contributing guide.

    New CleanRL Supported Publications

    • Md Masudur Rahman and Yexiang Xue. "Bootstrap Advantage Estimation for Policy Optimization in Reinforcement Learning." In Proceedings of the IEEE International Conference on Machine Learning and Applications (ICMLA), 2022. https://arxiv.org/pdf/2210.07312.pdf
    • Weng, Jiayi, Min Lin, Shengyi Huang, Bo Liu, Denys Makoviichuk, Viktor Makoviychuk, Zichen Liu et al. "Envpool: A highly parallel reinforcement learning environment execution engine." In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=BubxnHpuMbG
    • Huang, Shengyi, Rousslan Fernand Julien Dossa, Antonin Raffin, Anssi Kanervisto, and Weixun Wang. "The 37 Implementation Details of Proximal Policy Optimization." International Conference on Learning Representations 2022 Blog Post Track, https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/
    • Huang, Shengyi, and Santiago Ontañón. "A closer look at invalid action masking in policy gradient algorithms." The International FLAIRS Conference Proceedings, 35. https://journals.flvc.org/FLAIRS/article/view/130584
    • Schmidt, Dominik, and Thomas Schmied. "Fast and Data-Efficient Training of Rainbow: an Experimental Study on Atari." Deep Reinforcement Learning Workshop at the 35th Conference on Neural Information Processing Systems, https://arxiv.org/abs/2111.10247

    New Features PR

    • prototype jax with ddpg by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/187
    • Isaac Gym Envs PPO updates by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/233
    • JAX TD3 prototype by @joaogui1 in https://github.com/vwxyzjn/cleanrl/pull/225
    • prototype jax with dqn by @kinalmehta in https://github.com/vwxyzjn/cleanrl/pull/222
    • Poetry 1.2 by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/271
    • Add rnd_ppo.py documentation and refactor by @yooceii in https://github.com/vwxyzjn/cleanrl/pull/151
    • Hyperparameter optimization by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/228
    • Update the hyperparameter optimization example script by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/268
    • Export requirements.txt automatically by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/143
    • Auto-upgrade syntax via pyupgrade by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/158
    • Introduce benchmark utilities by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/165
    • Match PPG implementation by @Dipamc77 in https://github.com/vwxyzjn/cleanrl/pull/186
      • See the documentation here: https://docs.cleanrl.dev/rl-algorithms/ppg/
    • Proper multi-gpu support with PPO by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/178
      • See the documentation here: https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_atari_multigpupy
    • Support Pettingzoo Multi-agent Atari envs with PPO by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/188
      • See the documentation here: https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_pettingzoo_ma_ataripy# Bug Fixes PR

    Bug fix and refacotring PR

    • Let ppo_continuous_action.pyonly run 1M steps by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/161
    • Change ppo.py's default timesteps by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/164
    • Enable video recording for ppo_procgen.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/166
    • Refactor replay based scripts by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/173
    • Td3 ddpg action bound fix by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/211
    • added gamma to reward normalization wrappers by @Howuhh in https://github.com/vwxyzjn/cleanrl/pull/209
    • Seed envpool environment explicitly by @jseppanen in https://github.com/vwxyzjn/cleanrl/pull/238
    • Fix PPO + Isaac Gym Benchmark Script by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/243
    • Fix for noise sampling for the TD3 exploration by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/260

    Documentation PR

    • Add a note on PPG's performance by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/199
    • Clarify CleanRL is a non-modular library by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/200
    • Fix documentation link by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/213
    • JAX + DDPG docs fix by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/229
    • Fix links in docs for ppo_continuous_action_isaacgym.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/242
    • Fix docs (badge, TD3 + JAX, and DQN + JAX) by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/246
    • Fix typos by @ALPH2H in https://github.com/vwxyzjn/cleanrl/pull/282
    • Fix docs links in README.md by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/254
    • chore: remove unused parameters in jax implementations by @kinalmehta in https://github.com/vwxyzjn/cleanrl/pull/264
    • Add ddpg_continuous_action.py docs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/137
    • Fix DDPG docs' description by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/139
    • Fix typo in DDPG docs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/140
    • Fix incorrect links in the DDPG docs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/142
    • DDPG documnetation tweaks; added Q loss equations and light explanation by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/145
    • Add dqn_atari.py documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/124
    • Add documentation for td3_continuous_action.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/141
    • SAC Documentation - Benchmarks - Minor code tweaks by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/146
    • Add docs for c51.py and c51_atari.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/159
    • Add docs for dqn.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/157
    • Address stale documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/169
    • Documentation improvement - fix links and mkdocs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/181
    • Improve documentation and contribution guide by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/189
    • Fix documentation links in README.md by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/192
    • Fix the implemented varaints section in PPO by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/193

    Misc PR

    • Show correct exception cause by @cool-RR in https://github.com/vwxyzjn/cleanrl/pull/205
    • Remove pettingzoo's pistonball example by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/214
    • Leverage CI to speed up poetry lock by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/235
    • Ubuntu runner for poetry lock by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/236
    • Remove the github pages CI in favor of vercel by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/241
    • Clarify LICENSE info by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/253
    • Update published paper citation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/284
    • Refactor dqn word choice by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/257
    • Add Pull Request template by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/122
    • Amend license to give proper attribution by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/152
    • Introduce better contribution guide by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/154
    • Fix the default wandb project name in ppo_atari_envpool.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/160
    • Removes unmaintained scripts by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/170
    • Add PPO documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/163
    • Add docs header by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/174
    • Update README.md by @ElliotMunro200 in https://github.com/vwxyzjn/cleanrl/pull/177
    • Update issue_template.md by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/180
    • Temporarily Remove PPO-RND by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/190 .

    Contributors

    • @ElliotMunro200 made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/177
    • @dipamc made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/186
    • @cool-RR made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/205
    • @Howuhh made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/209
    • @jseppanen made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/238
    • @joaogui1 made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/225
    • @kinalmehta made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/222
    • @ALPH2H made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/282
    • @WillDudley made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/302
    • @sdpkjc made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/299
    • @masud99r made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/316

    Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v0.6.0...v1.0.0

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0b2(Oct 3, 2022)

    🎉 I am thrilled to announce the v1.0.0b2 CleanRL Beta Release. This new release comes with exciting new features. First, we now support JAX-based learning algorithms, which are usually faster than the torch equivalent! Here are the docs of the new JAX-based DQN, TD3, and DDPG implementations:

    image

    Also, we now have preliminary support for hyperparameter tuning via optuna (see docs), which is designed to help researchers to find a single set of hyperparameters that work well with a kind of games. The current API looks like below:

    import optuna
    from cleanrl_utils.tuner import Tuner
    tuner = Tuner(
        script="cleanrl/ppo.py",
        metric="charts/episodic_return",
        metric_last_n_average_window=50,
        direction="maximize",
        aggregation_type="average",
        target_scores={
            "CartPole-v1": [0, 500],
            "Acrobot-v1": [-500, 0],
        },
        params_fn=lambda trial: {
            "learning-rate": trial.suggest_loguniform("learning-rate", 0.0003, 0.003),
            "num-minibatches": trial.suggest_categorical("num-minibatches", [1, 2, 4]),
            "update-epochs": trial.suggest_categorical("update-epochs", [1, 2, 4, 8]),
            "num-steps": trial.suggest_categorical("num-steps", [5, 16, 32, 64, 128]),
            "vf-coef": trial.suggest_uniform("vf-coef", 0, 5),
            "max-grad-norm": trial.suggest_uniform("max-grad-norm", 0, 5),
            "total-timesteps": 100000,
            "num-envs": 16,
        },
        pruner=optuna.pruners.MedianPruner(n_startup_trials=5),
        sampler=optuna.samplers.TPESampler(),
    )
    tuner.tune(
        num_trials=100,
        num_seeds=3,
    )
    

    Besides, we added support for new algorithms/environments, which are

    I would like to cordially thank the core dev members @dosssman @yooceii @Dipamc @kinalmehta for their efforts in helping maintain the CleanRL repository. I would also like to give a shout-out to our new contributors @cool-RR, @Howuhh, @jseppanen, @joaogui1, @kinalmehta, and @ALPH2H.

    New CleanRL Supported Publications

    Jiayi Weng, Min Lin, Shengyi Huang, Bo Liu, Denys Makoviichuk, Viktor Makoviychuk, Zichen Liu, Yufan Song, Ting Luo, Yukun Jiang, Zhongwen Xu, & Shuicheng YAN (2022). EnvPool: A Highly Parallel Reinforcement Learning Environment Execution Engine. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track. https://openreview.net/forum?id=BubxnHpuMbG

    New Features PR

    • prototype jax with ddpg by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/187
    • Isaac Gym Envs PPO updates by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/233
    • JAX TD3 prototype by @joaogui1 in https://github.com/vwxyzjn/cleanrl/pull/225
    • prototype jax with dqn by @kinalmehta in https://github.com/vwxyzjn/cleanrl/pull/222
    • Poetry 1.2 by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/271
    • Add rnd_ppo.py documentation and refactor by @yooceii in https://github.com/vwxyzjn/cleanrl/pull/151
    • Hyperparameter optimization by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/228
    • Update the hyperparameter optimization example script by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/268

    Bug Fixes PR

    • Td3 ddpg action bound fix by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/211
    • added gamma to reward normalization wrappers by @Howuhh in https://github.com/vwxyzjn/cleanrl/pull/209
    • Seed envpool environment explicitly by @jseppanen in https://github.com/vwxyzjn/cleanrl/pull/238
    • Fix PPO + Isaac Gym Benchmark Script by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/243
    • Fix for noise sampling for the TD3 exploration by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/260

    Documentation PR

    • Add a note on PPG's performance by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/199
    • Clarify CleanRL is a non-modular library by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/200
    • Fix documentation link by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/213
    • JAX + DDPG docs fix by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/229
    • Fix links in docs for ppo_continuous_action_isaacgym.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/242
    • Fix docs (badge, TD3 + JAX, and DQN + JAX) by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/246
    • Fix typos by @ALPH2H in https://github.com/vwxyzjn/cleanrl/pull/282
    • Fix docs links in README.md by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/254
    • chore: remove unused parameters in jax implementations by @kinalmehta in https://github.com/vwxyzjn/cleanrl/pull/264

    Misc PR

    • Show correct exception cause by @cool-RR in https://github.com/vwxyzjn/cleanrl/pull/205
    • Remove pettingzoo's pistonball example by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/214
    • Leverage CI to speed up poetry lock by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/235
    • Ubuntu runner for poetry lock by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/236
    • Remove the github pages CI in favor of vercel by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/241
    • Clarify LICENSE info by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/253
    • Update published paper citation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/284
    • Refactor dqn word choice by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/257

    New Contributors

    • @cool-RR made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/205
    • @Howuhh made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/209
    • @jseppanen made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/238
    • @joaogui1 made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/225
    • @kinalmehta made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/222
    • @ALPH2H made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/282

    Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v1.0.0b1...v1.0.0b2

    Source code(tar.gz)
    Source code(zip)
  • v1.0.0b1(Jun 7, 2022)

    🎉 I am thrilled to announce the v1.0.0b1 CleanRL Beta Release. CleanRL has come a long way making high-quality deep reinforcement learning implementations easy to understand. In this release, we have put a huge effort into revamping our documentation site, making our implementation friendly to use for new users.

    I would like to cordially thank the core dev members @dosssman @yooceii @Dipamc77 @bragajj for their efforts in helping maintain the CleanRL repository. I would also like to give a shout-out to our new contributors @ElliotMunro200 and @Dipamc77.

    New CleanRL supported publications

    New algorithm variants

    • Match PPG implementation by @Dipamc77 in https://github.com/vwxyzjn/cleanrl/pull/186
      • See the documentation here: https://docs.cleanrl.dev/rl-algorithms/ppg/
    • Proper multi-gpu support with PPO by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/178
      • See the documentation here: https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_atari_multigpupy
    • Support Pettingzoo Multi-agent Atari envs with PPO by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/188
      • See the documentation here: https://docs.cleanrl.dev/rl-algorithms/ppo/#ppo_pettingzoo_ma_ataripy

    Refactoring changes

    • Let ppo_continuous_action.pyonly run 1M steps by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/161
    • Change ppo.py's default timesteps by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/164
    • Enable video recording for ppo_procgen.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/166
    • Refactor replay based scripts by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/173

    Documentation changes

    A significant amount of documentation changes (tracked by https://github.com/vwxyzjn/cleanrl/issues/121).

    See the overview documentation page here: https://docs.cleanrl.dev/rl-algorithms/overview/

    • Add ddpg_continuous_action.py docs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/137
    • Fix DDPG docs' description by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/139
    • Fix typo in DDPG docs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/140
    • Fix incorrect links in the DDPG docs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/142
    • DDPG documnetation tweaks; added Q loss equations and light explanation by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/145
    • Add dqn_atari.py documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/124
    • Add documentation for td3_continuous_action.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/141
    • SAC Documentation - Benchmarks - Minor code tweaks by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/146
    • Add docs for c51.py and c51_atari.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/159
    • Add docs for dqn.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/157
    • Address stale documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/169
    • Documentation improvement - fix links and mkdocs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/181
    • Improve documentation and contribution guide by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/189
    • Fix documentation links in README.md by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/192
    • Fix the implemented varaints section in PPO by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/193

    Misclanouse changes

    • Add Pull Request template by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/122
    • Amend license to give proper attribution by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/152
    • Introduce better contribution guide by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/154
    • Fix the default wandb project name in ppo_atari_envpool.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/160
    • Removes unmaintained scripts by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/170
    • Add PPO documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/163
    • Add docs header by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/174
    • Update README.md by @ElliotMunro200 in https://github.com/vwxyzjn/cleanrl/pull/177
    • Update issue_template.md by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/180
    • Temporarily Remove PPO-RND by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/190

    Utility changes

    • Export requirements.txt automatically by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/143
    • Auto-upgrade syntax via pyupgrade by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/158
    • Introduce benchmark utilities by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/165

    New Contributors

    • @ElliotMunro200 made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/177
    • @Dipamc77 made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/186

    Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v0.6.0...v1.0.0b1

    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Mar 16, 2022)

    What's Changed

    • Update paper citation entry by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/91
    • Clean up stale files by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/95
    • Refactor formats in parse_args by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/78
    • Add Gitpod support by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/94
    • Reorganize README.md by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/93
    • Downgrade setuptools by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/98
    • Fix readme links by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/104
    • Refactor value based methods by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/102
    • Introduce pre-commit pipelines by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/107
    • Refactor PPG and PPO for procgen by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/108
    • Update documentation on PPG and PPO Procgen by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/112
    • Add PPO Atari LSTM example by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/83
    • Prototype Envpool Support by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/100
    • Fix replay buffer compatibility with mujoco envs by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/113
    • Add the isort and black badges by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/119
    • Refactor parse_args() by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/118
    • Add ppo.py documentation by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/120
    • Replace episode_reward with episodic_return by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/125
    • Refactor ppo_pettingzoo.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/128
    • Update gym to 0.23.0 by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/129
    • Add SPS and q-values metrics for value-based methods by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/126
    • Make seed work again in value methods by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/134
    • Remove offline DQN scripts by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/135
    • Deprecate apex_dqn_atari.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/136
    • Update to gym==0.23.1 by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/138

    Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v0.5.0...v0.6.0

    Source code(tar.gz)
    Source code(zip)
  • v0.5.0(Nov 12, 2021)

    What's Changed

    • Use Poetry as the package manager by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/50
    • Remove links to deleted code on README algorithms by @FelipeMartins96 in https://github.com/vwxyzjn/cleanrl/pull/54
    • Add paper plotting utilities by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/55
    • Reorganization of files. by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/56
    • Bump Gym's version to 0.21.0 by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/61
    • Automatically Download Atari Roms by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/62
    • Make Spyder Editor Optional by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/66
    • Support Python 3.7.1+ by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/67
    • ddpg_continuous: Addded env argument to actor and target actor by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/69
    • Add pytest as an optional dependency by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/71
    • Remove SB3 dependency in ppo_continuous_action.py by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/72
    • Add e2e tests by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/70
    • Fix #74 SAC consistency in logging and training to match other scripts by @dosssman in https://github.com/vwxyzjn/cleanrl/pull/75
    • Add MuJoCo environments support. by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/76
    • Only run tests given changes to the cleanrl directory by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/77
    • Prototype Documentation Site by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/64
    • Cloud Utilities Improvement by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/65
    • Import built docker image to local registry by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/80
    • Remove docker dummy cache by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/81
    • Allow buildx to save to local and push by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/82
    • Rollback back PyTorch version for better compatibility by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/84
    • Cloud utilities refactor by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/85
    • Prepare for 0.5.0 release by @vwxyzjn in https://github.com/vwxyzjn/cleanrl/pull/88

    New Contributors

    • @FelipeMartins96 made their first contribution in https://github.com/vwxyzjn/cleanrl/pull/54

    Full Changelog: https://github.com/vwxyzjn/cleanrl/compare/v0.4.8...v0.5.0

    Source code(tar.gz)
    Source code(zip)
  • 0.4.0(Sep 24, 2020)

    What's new in the 0.4.0 release

    • Added contribution guide here https://github.com/vwxyzjn/cleanrl/blob/master/CONTRIBUTING.md. We welcome contribution of new algorithms and new games to be added to the Open RL Benchmark (http://benchmark.cleanrl.dev/)
    • Added tables for the benchmark results with standard deviations created by (https://github.com/vwxyzjn/cleanrl/blob/master/benchmark/plots.py)

    Atari Results

    | gym_id | apex_dqn_atari_visual | c51_atari_visual | dqn_atari_visual | ppo_atari_visual | |:----------------------------|:------------------------|:-------------------|:-------------------|:-------------------| | BeamRiderNoFrameskip-v4 | 2936.93 ± 362.18 | 13380.67 ± 0.00 | 7139.11 ± 479.11 | 2053.08 ± 83.37 | | QbertNoFrameskip-v4 | 3565.00 ± 690.00 | 16286.11 ± 0.00 | 11586.11 ± 0.00 | 17919.44 ± 383.33 | | SpaceInvadersNoFrameskip-v4 | 1019.17 ± 356.94 | 1099.72 ± 14.72 | 935.40 ± 93.17 | 1089.44 ± 67.22 | | PongNoFrameskip-v4 | 19.06 ± 0.83 | 18.00 ± 0.00 | 19.78 ± 0.22 | 20.72 ± 0.28 | | BreakoutNoFrameskip-v4 | 364.97 ± 58.36 | 386.10 ± 21.77 | 353.39 ± 30.61 | 380.67 ± 35.29 |

    Mujoco Results

    | gym_id | ddpg_continuous_action | td3_continuous_action | ppo_continuous_action | |:--------------------|:-------------------------|:------------------------|:------------------------| | Reacher-v2 | -6.25 ± 0.54 | -6.65 ± 0.04 | -7.86 ± 1.47 | | Pusher-v2 | -44.84 ± 5.54 | -59.69 ± 3.84 | -44.10 ± 6.49 | | Thrower-v2 | -137.18 ± 47.98 | -80.75 ± 12.92 | -58.76 ± 1.42 | | Striker-v2 | -193.43 ± 27.22 | -269.63 ± 22.14 | -112.03 ± 9.43 | | InvertedPendulum-v2 | 1000.00 ± 0.00 | 443.33 ± 249.78 | 968.33 ± 31.67 | | HalfCheetah-v2 | 10386.46 ± 265.09 | 9265.25 ± 1290.73 | 1717.42 ± 20.25 | | Hopper-v2 | 1128.75 ± 9.61 | 3095.89 ± 590.92 | 2276.30 ± 418.94 | | Swimmer-v2 | 114.93 ± 29.09 | 103.89 ± 30.72 | 111.74 ± 7.06 | | Walker2d-v2 | 1946.23 ± 223.65 | 3059.69 ± 1014.05 | 3142.06 ± 1041.17 | | Ant-v2 | 243.25 ± 129.70 | 5586.91 ± 476.27 | 2785.98 ± 1265.03 | | Humanoid-v2 | 877.90 ± 3.46 | 6342.99 ± 247.26 | 786.83 ± 95.66 |

    Pybullet Results

    | gym_id | ddpg_continuous_action | td3_continuous_action | ppo_continuous_action | |:-----------------------------------|:-------------------------|:------------------------|:------------------------| | MinitaurBulletEnv-v0 | -0.17 ± 0.02 | 7.73 ± 5.13 | 23.20 ± 2.23 | | MinitaurBulletDuckEnv-v0 | -0.31 ± 0.03 | 0.88 ± 0.34 | 11.09 ± 1.50 | | InvertedPendulumBulletEnv-v0 | 742.22 ± 47.33 | 1000.00 ± 0.00 | 1000.00 ± 0.00 | | InvertedDoublePendulumBulletEnv-v0 | 5847.31 ± 843.53 | 5085.57 ± 4272.17 | 6970.72 ± 2386.46 | | Walker2DBulletEnv-v0 | 567.61 ± 15.01 | 2177.57 ± 65.49 | 1377.68 ± 51.96 | | HalfCheetahBulletEnv-v0 | 2847.63 ± 212.31 | 2537.34 ± 347.20 | 2347.64 ± 51.56 | | AntBulletEnv-v0 | 2094.62 ± 952.21 | 3253.93 ± 106.96 | 1775.50 ± 50.19 | | HopperBulletEnv-v0 | 1262.70 ± 424.95 | 2271.89 ± 24.26 | 2311.20 ± 45.28 | | HumanoidBulletEnv-v0 | -54.45 ± 13.99 | 937.37 ± 161.05 | 204.47 ± 1.00 | | BipedalWalker-v3 | 66.01 ± 127.82 | 78.91 ± 232.51 | 272.08 ± 10.29 | | LunarLanderContinuous-v2 | 162.96 ± 65.60 | 281.88 ± 0.91 | 215.27 ± 10.17 | | Pendulum-v0 | -238.65 ± 14.13 | -345.29 ± 47.40 | -1255.62 ± 28.37 | | MountainCarContinuous-v0 | -1.01 ± 0.01 | -1.12 ± 0.12 | 93.89 ± 0.06 |

    Other Results

    | gym_id | ppo | dqn | |:---------------|:---------------|:----------------| | CartPole-v1 | 500.00 ± 0.00 | 182.93 ± 47.82 | | Acrobot-v1 | -80.10 ± 6.77 | -81.50 ± 4.72 | | MountainCar-v0 | -200.00 ± 0.00 | -142.56 ± 15.89 | | LunarLander-v2 | 46.18 ± 53.04 | 144.52 ± 1.75 |

    • Added experimental support for Apex-DQN that is significantly faster than DQN. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/apex_dqn_atari_visual.py. In the game of breakout, Apex-DQN takes less than 4 hours to achieve around 360 episode reward. In contrast, it took 25 hours for DQN to reach 360 episode rewards.
      • Our implementation is a little different from the original. First, in pytorch's ecosystem there isn't a well-mainained distributed prioritized experience buffer such as https://github.com/deepmind/reverb. So instead we split a single prioritized replay buffer of size 100000 to two prioritized replay of size 50000 in different data-processors in sub-processes to prepare data for the worker. This is kind of a work around and a hack but according to our benchmark, it works empirically good and fast enough.

    Benchmarked Learning Curves | Atari :-------------------------:|:-------------------------: Metrics, logs, and recorded videos are at | cleanrl.benchmark/reports/Atari | | |  

    • Supported CarRacing-v0 by PPO in the Experimental Domains. It is our first example with pixel observation space and continuous action space. See https://github.com/vwxyzjn/cleanrl/blob/master/cleanrl/experiments/ppo_car_racing.py.
      • During our experiments, we found the normalization of observation and reward seems to have a huge impact on PPO's performance, probably due to the large range of rewards provided by CarRacing-v0 (e.g. if dies you get -100 reward, but PPO is anecdotally sensitive to this kind of large rewards).

    image

    Source code(tar.gz)
    Source code(zip)
  • 0.3.0(Aug 1, 2020)

    See https://streamable.com/cq8e62 for a demo

    Significant amount of effort was put into the making of Open RL Benchmark (http://benchmark.cleanrl.dev/). It provides benchmark of popular Deep Reinforcement Learning algorithms in 34+ games with unprecedented level of transparency, openness, and reproducibility.

    In addition, the legacy common.py is depreciated in favor of using single-file implementations.

    Source code(tar.gz)
    Source code(zip)
  • 0.2.1(Jan 9, 2020)

    We've made the SAC algorithm works for both continuous and discrete action spaces, with primary references from the following papers:

    https://arxiv.org/abs/1801.01290 https://arxiv.org/abs/1812.05905 https://arxiv.org/abs/1910.07207

    My personal thanks to everyone who participated in the monthly dev cycle and, in particular, @dosssman who implemented the SAC with discrete action spaces.

    Additional improvement include support gym.wrappers.Monitor to automatically record agent’s performance at certain episodes (default is 1, 2, 9, 28, 65, ... 1000, 2000, 3000) and integrate with wandb. (so cool, see screenshot below) #4 Use the same replay buffer from minimalRL for DQN and SAC #5

    https://app.wandb.ai/cleanrl/cleanrl.benchmark

    image

    Source code(tar.gz)
    Source code(zip)
  • V0.1(Oct 7, 2019)

    This is the initial release 🙌🙌

    Working on more algorithms and where and bug fixes for the 1.0 release :) Comments and PR are more than welcome.

    Source code(tar.gz)
    Source code(zip)
Owner
Costa Huang
Computer Science Ph.D student at Drexel University researching Game Artificial Intelligence
Costa Huang
E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation

E2EC: An End-to-End Contour-based Method for High-Quality High-Speed Instance Segmentation E2EC: An End-to-End Contour-based Method for High-Quality H

zhangtao 146 Dec 29, 2022
Pre-trained Deep Learning models and demos (high quality and extremely fast)

OpenVINO™ Toolkit - Open Model Zoo repository This repository includes optimized deep learning models and a set of demos to expedite development of hi

OpenVINO Toolkit 3.4k Dec 31, 2022
SenseNet is a sensorimotor and touch simulator for deep reinforcement learning research

SenseNet is a sensorimotor and touch simulator for deep reinforcement learning research

null 59 Feb 25, 2022
PyTorch implementations of deep reinforcement learning algorithms and environments

Deep Reinforcement Learning Algorithms with PyTorch This repository contains PyTorch implementations of deep reinforcement learning algorithms and env

Petros Christodoulou 4.7k Jan 4, 2023
A user-friendly research and development tool built to standardize RL competency assessment for custom agents and environments.

Built with ❤️ by Sam Showalter Contents Overview Installation Dependencies Usage Scripts Standard Execution Environment Development Environment Benchm

SRI-AIC 1 Nov 18, 2021
Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX

CQL-JAX This repository implements Conservative Q Learning for Offline Reinforcement Reinforcement Learning in JAX (FLAX). Implementation is built on

Karush Suri 8 Nov 7, 2022
Reinforcement-learning - Repository of the class assignment questions for the course on reinforcement learning

DSE 314/614: Reinforcement Learning This repository containing reinforcement lea

Manav Mishra 4 Apr 15, 2022
This is an official implementation of "Polarized Self-Attention: Towards High-quality Pixel-wise Regression"

Polarized Self-Attention: Towards High-quality Pixel-wise Regression This is an official implementation of: Huajun Liu, Fuqiang Liu, Xinyi Fan and Don

DeLightCMU 212 Jan 8, 2023
PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech

PortaSpeech - PyTorch Implementation PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech. Model Size Module Nor

Keon Lee 279 Jan 4, 2023
Scripts of Machine Learning Algorithms from Scratch. Implementations of machine learning models and algorithms using nothing but NumPy with a focus on accessibility. Aims to cover everything from basic to advance.

Algo-ScriptML Python implementations of some of the fundamental Machine Learning models and algorithms from scratch. The goal of this project is not t

Algo Phantoms 81 Nov 26, 2022
Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021.

SphereRPN Code for the paper SphereRPN: Learning Spheres for High-Quality Region Proposals on 3D Point Clouds Object Detection, ICIP 2021. Authors: Th

Thang Vu 15 Dec 2, 2022
[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

DeepVecFont This is the homepage for "DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning". Yizhi Wang and Zhouhui Lian. WI

Yizhi Wang 17 Dec 22, 2022
[SIGGRAPH Asia 2021] DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning.

DeepVecFont This is the homepage for "DeepVecFont: Synthesizing High-quality Vector Fonts via Dual-modality Learning". Yizhi Wang and Zhouhui Lian. WI

Yizhi Wang 5 Oct 22, 2021
deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

deep-table implements various state-of-the-art deep learning and self-supervised learning algorithms for tabular data using PyTorch.

null 63 Oct 17, 2022
HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep Features in Adversarial Networks

HiFiGAN Denoiser This is a Unofficial Pytorch implementation of the paper HiFi-GAN: High Fidelity Denoising and Dereverberation Based on Speech Deep F

Rishikesh (ऋषिकेश) 134 Dec 27, 2022
Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research

Megaverse Megaverse is a new 3D simulation platform for reinforcement learning and embodied AI research. The efficient design of the engine enables ph

Aleksei Petrenko 191 Dec 23, 2022
MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research

MiniHack the Planet: A Sandbox for Open-Ended Reinforcement Learning Research

Facebook Research 338 Dec 29, 2022
Code associated with the paper "Deep Optics for Single-shot High-dynamic-range Imaging"

Deep Optics for Single-shot High-dynamic-range Imaging Code associated with the paper "Deep Optics for Single-shot High-dynamic-range Imaging" CVPR, 2

Stanford Computational Imaging Lab 40 Dec 12, 2022