An offline deep reinforcement learning library

Takuma Seno

Last update: Jan 2, 2023

Related tags

Overview

d3rlpy: An offline deep reinforcement learning library

d3rlpy is an offline deep reinforcement learning library for practitioners and researchers.

import d3rlpy

dataset, env = d3rlpy.datasets.get_dataset("hopper-medium-v0")

# prepare algorithm
sac = d3rlpy.algos.SAC()

# train offline
sac.fit(dataset, n_steps=1000000)

# train online
sac.fit_online(env, n_steps=1000000)

# ready to control
actions = sac.predict(x)

Documentation: https://d3rlpy.readthedocs.io
Paper: https://arxiv.org/abs/2111.03788

key features

⚡ Most Practical RL Library Ever

offline RL: d3rlpy supports state-of-the-art offline RL algorithms. Offline RL is extremely powerful when the online interaction is not feasible during training (e.g. robotics, medical).
online RL: d3rlpy also supports conventional state-of-the-art online training algorithms without any compromising, which means that you can solve any kinds of RL problems only with d3rlpy.
advanced engineering: d3rlpy is designed to implement the faster and efficient training algorithms. For example, you can train Atari environments with x4 less memory space and as fast as the fastest RL library.

🔰 Easy-To-Use API

zero-knowledge of DL library: d3rlpy provides many state-of-the-art algorithms through intuitive APIs. You can become a RL engineer even without knowing how to use deep learning libraries.
scikit-learn compatibility: d3rlpy is not only easy, but also completely compatible with scikit-learn API, which means that you can maximize your productivity with the useful scikit-learn's utilities.

🚀 Beyond State-Of-The-Art

distributional Q function: d3rlpy is the first library that supports distributional Q functions in the all algorithms. The distributional Q function is known as the very powerful method to achieve the state-of-the-performance.
many tweek options: d3rlpy is also the first to support N-step TD backup and ensemble value functions in the all algorithms, which lead you to the place no one ever reached yet.

installation

d3rlpy supports Linux, macOS and Windows.

PyPI (recommended)

$ pip install d3rlpy

Anaconda

$ conda install -c conda-forge d3rlpy

Docker

$ docker run -it --gpus all --name d3rlpy takuseno/d3rlpy:latest bash

supported algorithms

algorithm	discrete control	continuous control	offline RL?
Behavior Cloning (supervised learning)	✅	✅
Deep Q-Network (DQN)	✅	⛔
Double DQN	✅	⛔
Deep Deterministic Policy Gradients (DDPG)	⛔	✅
Twin Delayed Deep Deterministic Policy Gradients (TD3)	⛔	✅
Soft Actor-Critic (SAC)	✅	✅
Batch Constrained Q-learning (BCQ)	✅	✅	✅
Bootstrapping Error Accumulation Reduction (BEAR)	⛔	✅	✅
Advantage-Weighted Regression (AWR)	✅	✅	✅
Conservative Q-Learning (CQL)	✅	✅	✅
Advantage Weighted Actor-Critic (AWAC)	⛔	✅	✅
Critic Reguralized Regression (CRR)	⛔	✅	✅
Policy in Latent Action Space (PLAS)	⛔	✅	✅
TD3+BC	⛔	✅	✅

supported Q functions

other features

Basically, all features are available with every algorithm.

evaluation metrics in a scikit-learn scorer function style
export greedy-policy as TorchScript or ONNX
parallel cross validation with multiple GPU

experimental features

Model-based Algorithms
- Model-based Offline Policy Optimization (MOPO)
- Conservative Offline Model-Based Policy Optimization (COMBO)
Q-functions
- Fully parametrized Quantile Function (experimental)

benchmark results

d3rlpy is benchmarked to ensure the implementation quality. The benchmark scripts are available reproductions directory. The benchmark results are available d3rlpy-benchmarks repository.

examples

MuJoCo

import d3rlpy

# prepare dataset
dataset, env = d3rlpy.datasets.get_d4rl('hopper-medium-v0')

# prepare algorithm
cql = d3rlpy.algos.CQL(use_gpu=True)

# train
cql.fit(dataset,
        eval_episodes=dataset,
        n_epochs=100,
        scorers={
            'environment': d3rlpy.metrics.evaluate_on_environment(env),
            'td_error': d3rlpy.metrics.td_error_scorer
        })

See more datasets at d4rl.

Atari 2600

import d3rlpy
from sklearn.model_selection import train_test_split

# prepare dataset
dataset, env = d3rlpy.datasets.get_atari('breakout-expert-v0')

# split dataset
train_episodes, test_episodes = train_test_split(dataset, test_size=0.1)

# prepare algorithm
cql = d3rlpy.algos.DiscreteCQL(n_frames=4, q_func_factory='qr', scaler='pixel', use_gpu=True)

# start training
cql.fit(train_episodes,
        eval_episodes=test_episodes,
        n_epochs=100,
        scorers={
            'environment': d3rlpy.metrics.evaluate_on_environment(env),
            'td_error': d3rlpy.metrics.td_error_scorer
        })

See more Atari datasets at d4rl-atari.

PyBullet

import d3rlpy

# prepare dataset
dataset, env = d3rlpy.datasets.get_pybullet('hopper-bullet-mixed-v0')

# prepare algorithm
cql = d3rlpy.algos.CQL(use_gpu=True)

# start training
cql.fit(dataset,
        eval_episodes=dataset,
        n_epochs=100,
        scorers={
            'environment': d3rlpy.metrics.evaluate_on_environment(env),
            'td_error': d3rlpy.metrics.td_error_scorer
        })

See more PyBullet datasets at d4rl-pybullet.

Online Training

import d3rlpy
import gym

# prepare environment
env = gym.make('HopperBulletEnv-v0')
eval_env = gym.make('HopperBulletEnv-v0')

# prepare algorithm
sac = d3rlpy.algos.SAC(use_gpu=True)

# prepare replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=1000000, env=env)

# start training
sac.fit_online(env, buffer, n_steps=1000000, eval_env=eval_env)

tutorials

Try a cartpole example on Google Colaboratory!

offline RL tutorial:
online RL tutorial:

contributions

Any kind of contribution to d3rlpy would be highly appreciated! Please check the contribution guide.

The release planning can be checked at milestones.

community

Channel	Link
Chat	Gitter
Issues	GitHub Issues

family projects

Project	Description
d4rl-pybullet	An offline RL datasets of PyBullet tasks
d4rl-atari	A d4rl-style library of Google's Atari 2600 datasets
MINERVA	An out-of-the-box GUI tool for offline RL

roadmap

The roadmap to the future release is available in ROADMAP.md.

citation

The paper is available here.

@InProceedings{seno2021d3rlpy,
  author = {Takuma Seno, Michita Imai},
  title = {d3rlpy: An Offline Deep Reinforcement Library},
  booktitle = {NeurIPS 2021 Offline Reinforcement Learning Workshop},
  month = {December},
  year = {2021}
}

acknowledgement

This work is supported by Information-technology Promotion Agency, Japan (IPA), Exploratory IT Human Resources Project (MITOU Program) in the fiscal year 2020.

Comments

Problem with loading trained model

I am trying to load a trained model with CQL.load_model(..full model [path). I first got fname is missing I tried fname=..full_model_path I then got self is missing I added self It still doesn't load the model. no attribute 'impl' ...
bug

opened by hn2 21
Question regarding plotting Cumulative Reward graph on Tensorboard

I really enjoyed working with this repo. Thank you very much for your great work! I was just wondering how to have the cumulative reward plots on Tensorboard for deep Q network algorithm.

Thank you again!
enhancement

opened by ajam74001 14
[BUG] gaussian likelihood computation

======== dynamics.py ===========

def _gaussian_likelihood( x: torch.Tensor, mu: torch.Tensor, logstd: torch.Tensor ) -> torch.Tensor: inv_std = torch.exp(-logstd) return (((mu - x) ** 2) * inv_std).mean(dim=1, keepdim=True)

======= I think It should be... =============

def _gaussian_likelihood( x: torch.Tensor, mu: torch.Tensor, logstd: torch.Tensor ) -> torch.Tensor: inv_std = torch.exp(-logstd) return 0.5 * (((mu - x) ** 2) * (inv_std ** 2)).sum(dim=1, keepdim=True)

bug

opened by tominku 14
d4rlpy MDPDataset

Hi @takuseno, firstly thanks a lot for such a high quality repo for offline RL. I have a question about the method get_d4rl(), why the rewards are all moved by one step? while cursor < dataset_size: # collect data for step=t observation = dataset["observations"][cursor] action = dataset["actions"][cursor] if episode_step == 0: reward = 0.0 else: reward = dataset["rewards"][cursor - 1]

Long for your feedback.

opened by cclvr 14
[BUG] Final observation not stored
Hello,

Describe the bug it seems that the final observation is not stored in the Episode object.

Looking at the code, if an episode is only one step long, the Episode object should store:

initial observation

action, reward

final observation

But it seems that the observations array has the same length as the actions or rewards one which probably means that the final observation is not stored.

Note: this would probably require some changes later on in the code as no action is taken after the final observation.

Additional context The way it is handled in SB3 for instance is to have a separate array that store the next observation. A special treatment is also needed when using multiple envs at the same time that may reset automatically.

See https://github.com/DLR-RM/stable-baselines3/blob/503425932f5dc59880f854c4f0db3255a3aa8c1e/stable_baselines3/common/off_policy_algorithm.py#L488 and https://github.com/DLR-RM/stable-baselines3/blob/503425932f5dc59880f854c4f0db3255a3aa8c1e/stable_baselines3/common/buffers.py#L267 (when using only one array)

cc @megan-klaiber
bug
opened by araffin 12
ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group

I get this error when loading a trained model Whta does it mean?

ValueError: loaded state dict contains a parameter group that doesn't match the size of optimizer's group
bug

opened by hn2 11
[REQUEST] Save model less frequently than metrics

Hello, when running fit_online I'd like to be able to save the metrics regularly (eg, once every episode, which is 200 timesteps for the pendulum environment) without having to save the model .pt files at the same high frequency (because the model files are quite large).

Put another way, I'd like to be able to write data to the evaluation.csv file without having to write a model_?????.pt file every time.

I can't see how this is possible in the current code. If it's not possible, I'd like to request it as a feature. Thanks!
enhancement

opened by pstansell 11
How to switch batch size during training?

@takuseno , firstly thanks a lot for your clear and complete code base for offline RL. Recently I try to conduct new algorithms based on this code base, and I want to switch batch size during the training process, but I don't know how to modify it with the smallest changes . Could you help to give some clue? Looking forward to your replay.

opened by cclvr 10
[REQUEST] Run time benchmarks,

Hello dear @takuseno, Thank you very much for sharing this amazing library. I am training CQL and DQN models for breakout Atari on V100 GPU. However, the training is so slow (it takes a day to run 50 episodes). I was wondering if you have a benchmark for run times?
enhancement

opened by ajam74001 9

NaN in Predictions while online finetune

Hi @takuseno , First of all thanks again for your awesome work, I was able to train my agent in a custom environment with your help and already increased the performance significantly! Nevertheless, I wanted to fine tune the agent in an online environment. Unfortunately. this worked for only somewhere between 500-1000 steps (not fixed, seems arbitrary) until I get an AssertionError because NaN values are predicted. I get the following trace. Any idea where I could look into / fix this?

Exception has occurred: ValueError       (note: full exception trace is shown but execution is paused at: _run_module_as_main)
Expected parameter loc (Tensor of shape (1, 4)) of distribution Normal(loc: torch.Size([1, 4]), scale: torch.Size([1, 4])) to satisfy the constraint Real(), but found invalid values:
tensor([[nan, nan, nan, nan]])
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/torch/distributions/distribution.py", line 55, in __init__
    raise ValueError(
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/torch/distributions/normal.py", line 54, in __init__
    super(Normal, self).__init__(batch_shape, validate_args=validate_args)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/models/torch/distributions.py", line 99, in __init__
    self._dist = Normal(self._mean, self._std)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/models/torch/policies.py", line 175, in dist
    return SquashedGaussianDistribution(mu, clipped_logstd.exp())
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/models/torch/policies.py", line 189, in forward
    dist = self.dist(x)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/models/torch/policies.py", line 245, in best_action
    action = self.forward(x, deterministic=True, with_log_prob=False)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/algos/torch/ddpg_impl.py", line 195, in _predict_best_action
    return self._policy.best_action(x)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/algos/torch/base.py", line 58, in predict_best_action
    action = self._predict_best_action(x)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/torch_utility.py", line 295, in wrapper
    return f(self, *tensors, **kwargs)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/torch_utility.py", line 305, in wrapper
    return f(self, *args, **kwargs)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/algos/base.py", line 127, in predict
    return self._impl.predict_best_action(x)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/online/explorers.py", line 50, in sample
    greedy_actions = algo.predict(x)
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/online/iterators.py", line 212, in train_single_env
    action = explorer.sample(algo, x, total_step)[0]
  File "/home/user/ws/d3/.venv/lib/python3.10/site-packages/d3rlpy/algos/base.py", line 251, in fit_online
    train_single_env(
  File "/home/user/ws/d3/simulation/examples/tune_d3rlpy.py", line 78, in <module>
    cql.fit_online(env, buffer, explorer, n_steps=1000)
  File "/usr/lib/python3.10/runpy.py", line 86, in _run_code
    exec(code, run_globals)
  File "/usr/lib/python3.10/runpy.py", line 196, in _run_module_as_main (Current frame)
    return _run_code(code, main_globals, None,

I used following script to initiate fine-tuning:

cql = d3rlpy.algos.CQL(use_gpu=False, action_scaler=action_scaler, scaler=scaler)
cql.build_with_env(env)
cql.load_model("model_43596.pt")

buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env)
explorer = d3rlpy.online.explorers.ConstantEpsilonGreedy(0.1)
cql.fit_online(env, buffer, explorer, n_steps=1000)

opened by lettersfromfelix 9

Create a Generator version of fit as fitter

This is just to start studying the change and discuss about it

This provides many benefits such as monitoring, live changes to algo params etc

This will also alleviate the need for doing complicated hierarchies of Callbacks mechanisms that are easier to solve with iterators and generators.

At least for me it is very useful to have direct access to metrics, to have direct access to the algo object to change and query things every epoch and adjust things interactively instead of a programmatic callback way.

opened by jamartinh 9

loss=nan

Hello, I'm trying to run offline RL where the state is formed by 75 or 100 variables (sampled from a bayesian network). The collected samples are in a data frame called "data", and I run the following.


observations_dwh=data[['disease','weight','heartattack']].to_numpy()

rewards = data['variable74']

m=len(actions)

terminals = np.repeat(1,m)

dataset_dwh = MDPDataset(observations_dwh, actions, rewards, terminals)

train_episodes_dwh, test_episodes_dwh = train_test_split(dataset_dwh)

q_func_dwh=d3rlpy.algos.DQN()

q_func_dwh.fit(train_episodes_dwh,test_episodes_dwh,scorers={'advantage': discounted_sum_of_advantage_scorer,
                                              'td_error': td_error_scorer, # smaller is better
                                              'value_scale': average_value_estimation_scorer
                                             })`


And it runs quite good, except that the loss=nan from the first step,
any idea why?

Thanks.

bug

opened by MauricioGS99 0

NameNotFound: Environment BreakoutNoFrameskip doesn't exist

Hello,

I am running the example code on the welcome page of Github for Atari 2600 and Online Training. Both of the two pieces of code raise the error that the environment cannot be found. Please see below.

For Atari 2600, I just copy the code and paste in PyCharm on Windows 11.

import d3rlpy
from sklearn.model_selection import train_test_split

# prepare dataset
dataset, env = d3rlpy.datasets.get_atari('breakout-expert-v0')

# split dataset
train_episodes, test_episodes = train_test_split(dataset, test_size=0.1)

# prepare algorithm
cql = d3rlpy.algos.DiscreteCQL(
    n_frames=4,
    q_func_factory='qr',
    scaler='pixel',
    use_gpu=True,
)

# start training
cql.fit(
    train_episodes,
    eval_episodes=test_episodes,
    n_epochs=100,
    scorers={
        'environment': d3rlpy.metrics.evaluate_on_environment(env),
        'td_error': d3rlpy.metrics.td_error_scorer,
    },
)

And it says

Same for Online Training, I just copy and paste the code to PyCharm on Windows 11.

import d3rlpy
import gym

# prepare environment
env = gym.make('HopperBulletEnv-v0')
eval_env = gym.make('HopperBulletEnv-v0')

# prepare algorithm
sac = d3rlpy.algos.SAC(use_gpu=True)

# prepare replay buffer
buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=1000000, env=env)

# start training
sac.fit_online(env, buffer, n_steps=1000000, eval_env=eval_env)

And it says

Thank you!

bug

opened by Zebin-Li 3

[REQUEST] Support Mildly Conservative Q-Learning (MCQ)

Hi

Thank you for providing excellent code. I am using CQL for offline reinforcement learning. CQL is very useful with its attention span, but we need to compensate for its weaknesses.

So I found the following paper, would it be a valuable additional implementation for this repository? https://arxiv.org/abs/2206.04745

Unfortunately I don't have the power to implement this, so I will add it here as an issue. Thank you.
enhancement

opened by bakud 0
[REQUEST] Enable observation dictionary input.

Is your feature request related to a problem? Please describe. Currently, your MDPDataset class assert the observation to be an ndarray object. However, In the field of autonomous driving, the MDP observation cannot be represented by a simple ndarray object. Typically, the observation space can be composed of a BEV image and a speed profile, which is not supported by your MDPDataset yet.

Describe the solution you'd like I believe it will make the repo stronger to enable observation dictionary storage and training like {"BEV": ndarray(C, W, H), "speed": (1,)} in the MDPDataset (as well as Episode and Transition class).
enhancement

opened by Emiyalzn 0
[BUG] Pytorch module hooks are not executed
Describe the bug I'm trying to debug some issues during online training (using fit_online) using pytorch hooks, but these hooks are not being executed. Looking at the code, policies are explicitly calling self.forward() like this. Directly calling self.forward() doesn't execute any hooks (see this post), so __call__() should be used instead. So self.forward() should be replaced with self().

To Reproduce

Register a hook with the policy module, e.g. algo._impl.policy.register_module_forward_pre_hook(hook)

Train with algo.fit_online(...)

Observe that the hook is never invoked

Expected behavior The registered hooks should be executed.

Additional context N/A.
bug
opened by abhaybd 0
TransitionMiniBatch object is NOT writable

For validating an idea, I want to modify rewards in a TransitionMiniBatch dynamically. However, it threw an exception TransitionMiniBatch object is NOT writable. I checked the source code, and found that TransitionMiniBatch was implemented by C. I wonder there is a method to modify TransitionMiniBatch object. Thanks!
enhancement

opened by XiudingCai 1

Releases(v1.1.1)

v1.1.1(Jun 24, 2022)
Benchmark

The benchmark results of IQL and NFQ have been added to d3rlpy-benchmarks. Plus, the results of the more random seeds up to 10 have been added to all algorithms. The benchmark results are more reliable now.

Documentation

More descriptions have been added to Finetuning tutorial page.

Offline Policy Selection tutorial page has been added

Enhancements

cloudpickle and GPUUtil dependencies have been removed.

gaussian likelihood computation for MOPO becomes more mathematically right (thanks @tominku )

Source code(tar.gz)
Source code(zip)
d3rlpy-1.1.1-cp37-cp37m-macosx_10_15_x86_64.whl(367.38 KB)
d3rlpy-1.1.1-cp37-cp37m-manylinux1_x86_64.whl(1.12 MB)
d3rlpy-1.1.1-cp37-cp37m-win_amd64.whl(328.16 KB)
d3rlpy-1.1.1-cp38-cp38-macosx_10_15_x86_64.whl(369.85 KB)
d3rlpy-1.1.1-cp38-cp38-manylinux1_x86_64.whl(1.27 MB)
d3rlpy-1.1.1-cp38-cp38-win_amd64.whl(334.73 KB)
d3rlpy-1.1.1-cp39-cp39-macosx_10_15_x86_64.whl(372.02 KB)
d3rlpy-1.1.1-cp39-cp39-manylinux1_x86_64.whl(1.19 MB)
d3rlpy-1.1.1-cp39-cp39-win_amd64.whl(333.80 KB)
v1.1.0(Apr 27, 2022)
MDPDataset

The timestep alignment is now exactly the same as D4RL:

# observations = [o_1, o_2, ..., o_n] observations = np.random.random((1000, 10)) # actions = [a_1, a_2, ..., a_n] actions = np.random.random((1000, 10)) # rewards = [r(o_1, a_1), r(o_2, a_2), ...] rewards = np.random.random(1000) # terminals = [t(o_1, a_1), t(o_2, a_2), ...] terminals = ...

where r(o, a) is the reward function and t(o, a) is the terminal function.

The reason of this change is that the many users were confused with the difference between d3rlpy and D4RL. But, now it's aligned in the same way. This change might break your dataset.

Algorithms

Neural Fitted Q-iteration (NFQ)

https://link.springer.com/chapter/10.1007/11564096_32

Enhancements

AWAC, CRR and IQL use a non-squashed gaussian policy function.

The more tutorial pages have been added to the documentation.

The software design page has been added to the documentation.

The reproduction script for IQL has been added.

The progress bar in online training is visually improved in Jupyter Notebook #161 (thanks, @aiueola )

The nan checks have been added to MDPDataset.

The target_reduction_type and bootstrap options have been removed.

Bugfix

The unnecessary test conditions have been removed

Typo in dataset.pyx has been fixed #167 (thanks, @zbzhu99 )

The details of IQL implementation have been fixed.

Source code(tar.gz)
Source code(zip)
d3rlpy-1.1.0-cp37-cp37m-macosx_10_14_x86_64.whl(366.87 KB)
d3rlpy-1.1.0-cp37-cp37m-manylinux1_x86_64.whl(1.12 MB)
d3rlpy-1.1.0-cp37-cp37m-win_amd64.whl(327.39 KB)
d3rlpy-1.1.0-cp38-cp38-macosx_10_14_x86_64.whl(369.57 KB)
d3rlpy-1.1.0-cp38-cp38-manylinux1_x86_64.whl(1.26 MB)
d3rlpy-1.1.0-cp38-cp38-win_amd64.whl(333.94 KB)
d3rlpy-1.1.0-cp39-cp39-macosx_10_15_x86_64.whl(371.74 KB)
d3rlpy-1.1.0-cp39-cp39-manylinux1_x86_64.whl(1.19 MB)
d3rlpy-1.1.0-cp39-cp39-win_amd64.whl(332.61 KB)
v1.0.0(Dec 18, 2021)
It's proud to announce that v1.0.0 has been finally released! The first version was released in Aug 2020 under the support of the IPA MITOU program. At the first release, d3rlpy only supported a few algorithms and did not even support online training. After months of constructive feedbacks and insights from the users and the community, d3rlpy has been established as the first offline deep RL library with many online and offline algorithms support and unique features. The next chapter also starts towards the ambitious v2.0.0 today. Please stay tuned for the next announcement!

NeurIPS 2021 Offline RL Workshop

The workshop paper about d3rlpy has been presented at the NeurIPS 2021 Offline RL Workshop. URL: https://arxiv.org/abs/2111.03788

Benchmarks

The full benchmark results are finally available at d3rlpy-benchmarks.

Algorithms

Implicit Q-Learning (IQL)

https://arxiv.org/abs/2110.06169

Enhancements

deterministic option is added to collect method

rollout_return metrics is added to online training

random_steps is added to fit_online method

--save option is added to d3rlpy CLI commands (thanks, @pstansell )

multiplier option is added to reward normalizers

many reproduction scripts are added

policy_type option is added to BC

get_atari_transition function is added for the Atari 2600 offline benchmark procedure

Bugfix

document fix (thanks, @araffin )

Fix TD3+BC's actor loss function

Fix gaussian noise for TD3 exploration

Roadmap towards v2.0.0

Sophisticated config system using dataclasses

Dump configuration and model parameters in a single file

Change MDPDataset format to align with D4RL datasets

Support large dataset

Support tuple observation

Support large-scale data-parallel offline training

Support large-scale distributed online training

Support Transformer architecture (e.g. Decision Transformer)

Speed up training with torch.jit.script and CUDA Graphs

Change library name to represent the unification of offline and online

Source code(tar.gz)
Source code(zip)
d3rlpy-1.0.0-cp36-cp36m-macosx_10_14_x86_64.whl(402.09 KB)
d3rlpy-1.0.0-cp36-cp36m-manylinux1_x86_64.whl(1.30 MB)
d3rlpy-1.0.0-cp36-cp36m-win_amd64.whl(354.85 KB)
d3rlpy-1.0.0-cp37-cp37m-macosx_10_14_x86_64.whl(402.28 KB)
d3rlpy-1.0.0-cp37-cp37m-manylinux1_x86_64.whl(1.29 MB)
d3rlpy-1.0.0-cp37-cp37m-win_amd64.whl(354.84 KB)
d3rlpy-1.0.0-cp38-cp38-macosx_10_14_x86_64.whl(403.86 KB)
d3rlpy-1.0.0-cp38-cp38-manylinux1_x86_64.whl(1.45 MB)
d3rlpy-1.0.0-cp38-cp38-win_amd64.whl(361.37 KB)
v0.91(Jul 25, 2021)
Algorithm

TD3+BC

https://arxiv.org/abs/2106.06860

RewardScaler

From this version, the preprocessors are available for the rewards, which allow you to normalize, standardize and clip the reward values.

import d3rlpy # normalize cql = d3rlpy.algos.CQL(reward_scaler="min_max") # standardize cql = d3rlpy.algos.CQL(reward_scaler="standardize") # clip (you can't use string alias) cql = d3rlpy.algos.CQL(reward_scaler=d3rlpy.preprocessing.ClipRewardScaler(-1.0, 1.0))

copy_policy_from and copy_q_function_from methods

In the scenario of finetuning, you might want to initialize SAC's policy function with the pretrained CQL's policy function to boost the initial performance. From this version, you can do that as follows:

import d3rlpy # pretrain with static dataset cql = d3rlpy.algos.CQL() cql.fit(...) # transfer the policy function sac = d3rlpy.algos.SAC() sac.copy_policy_from(cql) # you can also transfer the Q-function sac.copy_q_function_from(cql) # finetuning with online algorithm sac.fit_online(...)

Enhancements

show messages for skipping model builds

add alpha parameter option to DiscreteCQL

keep counting the number of gradient steps

allow expanding MDPDataset with the larger discrete actions (thanks, @jamartinh )

callback function is called every gradient step (previously, it's called every epoch)

Bugfix

FQE's loss function has been fixed (thanks for the report, @guyk1971)

fix documentation build (thanks, @astrojuanlu)

fix d4rl dataset conversion for MDPDataset (this will have a significant impact on the performance for d4rl dataset)

Source code(tar.gz)
Source code(zip)
d3rlpy-0.91-cp36-cp36m-macosx_10_14_x86_64.whl(394.79 KB)
d3rlpy-0.91-cp36-cp36m-manylinux1_x86_64.whl(1013.21 KB)
d3rlpy-0.91-cp36-cp36m-win_amd64.whl(347.29 KB)
d3rlpy-0.91-cp37-cp37m-macosx_10_14_x86_64.whl(394.90 KB)
d3rlpy-0.91-cp37-cp37m-manylinux1_x86_64.whl(1011.68 KB)
d3rlpy-0.91-cp37-cp37m-win_amd64.whl(347.24 KB)
d3rlpy-0.91-cp38-cp38-macosx_10_14_x86_64.whl(396.51 KB)
d3rlpy-0.91-cp38-cp38-manylinux1_x86_64.whl(1.07 MB)
d3rlpy-0.91-cp38-cp38-win_amd64.whl(353.83 KB)
v0.90(May 28, 2021)
Algorithm

Conservative Offline Model-Based Optimization (COMBO)

https://arxiv.org/abs/2102.08363

Drop data augmentation feature

From this version, the data augmentation feature has been dropped. The reason for this is that the feature introduces a lot of code complexity. In order to make d3rlpy support many algorithms and keep it as simple as possible, the feature was dropped. Instead, TorchMiniBatch was internally introduced, and all algorithms become more simple.

collect method

In offline RL experiments, data collection plays an important role especially when you try new tasks. From this version, collect method is finally available.

import d3rlpy import gym # prepare environment env = gym.make('Pendulum-v0') # prepare algorithm sac = d3rlpy.algos.SAC() # prepare replay buffer buffer = d3rlpy.online.buffers.ReplayBuffer(maxlen=100000, env=env) # start data collection without updates sac.collect(env, buffer) # export to MDPDataset dataset = buffer.to_mdp_dataset() # save as file dataset.dump('pendulum.h5')

Along with this change, random policies are also introduced. These are useful to collect dataset with random policy.

# continuous action-space policy = d3rlpy.algos.RandomPolicy() # discrete action-space policy = d3rlpy.algos.DiscreteRandomPolicy()

Enhancements

CQL and BEAR become closer to the official implementations

callback argument has been added to algorithms

random dataset has been added to cartpole and pendulum dataset

you can specify it via dataset_type='random' at get_cartpole and get_pendulum method

Bugfix

fix action normalization at predict_value method (thanks, @navidmdn )

fix seed settings at reproduction codes

What's missing before v1.00?

Currently, I'm benchmarking all algorithms with d4rl dataset. Through the experiments, I realized that it's very difficult to reproduce the table reported in the paper because they actually didn't reveal full hyper-parameters, which are tuned to each dataset. So I gave up reproducing the table, and start producing numbers with the official codes to see if d3rlpy's result matches.
Source code(tar.gz)
Source code(zip)
d3rlpy-0.90-cp36-cp36m-macosx_10_14_x86_64.whl(384.14 KB)
d3rlpy-0.90-cp36-cp36m-manylinux1_x86_64.whl(1008.55 KB)
d3rlpy-0.90-cp36-cp36m-win_amd64.whl(335.40 KB)
d3rlpy-0.90-cp37-cp37m-macosx_10_14_x86_64.whl(384.35 KB)
d3rlpy-0.90-cp37-cp37m-manylinux1_x86_64.whl(1005.62 KB)
d3rlpy-0.90-cp37-cp37m-win_amd64.whl(335.31 KB)
d3rlpy-0.90-cp38-cp38-macosx_10_14_x86_64.whl(385.72 KB)
d3rlpy-0.90-cp38-cp38-manylinux1_x86_64.whl(1.07 MB)
d3rlpy-0.90-cp38-cp38-win_amd64.whl(341.48 KB)
v0.80(Apr 24, 2021)
Algorithms

New algorithms are introduced in this version.

Critic Regularized Regression (CRR)

https://arxiv.org/abs/2006.15134

Model-based Offline Policy Optimization (MOPO)

https://arxiv.org/abs/2005.13239

Model-based RL

Previously, model-based RL has been supported. The model-based specific logic was implemented in dynamics side. This approach enabled us to combine model-based algorithms with arbitrary model-free algorithms. However, this requires complex designs to implement the recent model-based RL. So, the dynamics interface was refactored and the MOPO is the first algorithm to show how d3rlpy supports model-based RL algorithms.

# train dynamics model from d3rlpy.datasets import get_pendulum from d3rlpy.dynamics import ProbabilisticEnsembleDynamics from d3rlpy.metrics.scorer import dynamics_observation_prediction_error_scorer from d3rlpy.metrics.scorer import dynamics_reward_prediction_error_scorer from d3rlpy.metrics.scorer import dynamics_prediction_variance_scorer from sklearn.model_selection import train_test_split dataset, _ = get_pendulum() train_episodes, test_episodes = train_test_split(dataset) dynamics = d3rlpy.dynamics.ProbabilisticEnsembleDynamics(learning_rate=1e-4, use_gpu=True) dynamics.fit(train_episodes, eval_episodes=test_episodes, n_epochs=100, scorers={ 'observation_error': dynamics_observation_prediction_error_scorer, 'reward_error': dynamics_reward_prediction_error_scorer, 'variance': dynamics_prediction_variance_scorer, }) # train Model-based RL algorithm from d3rlpy.algos import MOPO # give mopo as generator argument. mopo = MOPO(dynamics=dynamics) mopo.fit(dataset, n_steps=100000)

enhancements

fitter method has been implemented (thanks @jamartinh )

tensorboard_dir repleces tensorboard flag at fit method (thanks @navidmdn )

show warning messages when the unused arguments are passed

show comprehensive error messages when action-space is not compatible

fit method accepts MDPDataset object

dropout option has been implemented in encoders

add appropriate __repr__ methods to show pretty outputs when print(algo)

metrics collection is refactored

bugfix

fix core dumped errors by fixing numpy version

fix CQL backup

Source code(tar.gz)
Source code(zip)
d3rlpy-0.80-cp36-cp36m-macosx_10_14_x86_64.whl(384.96 KB)
d3rlpy-0.80-cp36-cp36m-manylinux1_x86_64.whl(1009.37 KB)
d3rlpy-0.80-cp36-cp36m-win_amd64.whl(336.20 KB)
d3rlpy-0.80-cp37-cp37m-macosx_10_14_x86_64.whl(385.17 KB)
d3rlpy-0.80-cp37-cp37m-manylinux1_x86_64.whl(1006.44 KB)
d3rlpy-0.80-cp37-cp37m-win_amd64.whl(336.11 KB)
d3rlpy-0.80-cp38-cp38-macosx_10_14_x86_64.whl(386.54 KB)
d3rlpy-0.80-cp38-cp38-manylinux1_x86_64.whl(1.07 MB)
d3rlpy-0.80-cp38-cp38-win_amd64.whl(342.29 KB)
v0.70(Feb 18, 2021)
Command Line Interface

New commands are added in this version.

record

You can record the video of the evaluation episodes without coding anything.

$ d3rlpy record d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0 # record wrapped environment $ d3rlpy record d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \ --env-header 'import gym; env = d3rlpy.envs.Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'

play

You can run the evaluation episodes with rendering images.

# record simple environment $ d3rlpy play d3rlpy_logs/CQL_20201224224314/model_100.pt --env-id HopperBulletEnv-v0 # record wrapped environment $ d3rlpy play d3rlpy_logs/Discrete_CQL_20201224224314/model_100.pt \ --env-header 'import gym; env = d3rlpy.envs.Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'

data-point mask for bootstrapping

Ensemble training for Q-functions has been shown as a powerful method to achieve robust training. Previously, bootstrap option has been available for algorithms. But, the mask for Q-function loss is randomly created every time when the batch is sampled.

In this version, create_mask option is available for MDPDataset and ReplayBuffer, which will create a unique mask at each data-point.

# offline training dataset = d3rlpy.dataset.MDPDataset(observations, actions, rewards, terminals, create_mask=True, mask_size=5) cql = d3rlpy.algos.CQL(n_critics=5, bootstrap=True, target_reduction_type='none') cql.fit(dataset) # online training buffer = d3rlpy.online.buffers.ReplayBuffer(1000000, create_mask=True, mask_size=5) sac = d3rlpy.algos.SAC(n_critics=5, bootstrap=True, target_reduction_type='none') sac.fit_online(env, buffer)

As you noticed above, target_reduction_type is newly introduced to specify how to aggregate target Q values. In the standard Soft Actor-Critic, the target_reduction_type='min'. If you choose none, each ensemble Q-function uses its own target value, which is similar to what Bootstrapped DQN does.

better module access

From this version, you can navigate to all modules through d3rlpy.

# previously from d3rlpy.datasets import get_cartpole dataset = get_cartpole() # v0.70 import d3rlpy dataset = d3rlpy.datasets.get_cartpole()

new logger style

From this version, structlog is internally used to print information instead of raw print function. This allows us to emit more structural information. Furthermore, you can control what to show and what to save to the file if you overwrite logger configuration.

enhancements

soft_q_backup option is added to CQL.

Paper Reproduction page has been added to the documentation in order to show the performance with the paper configuration.

commit method at D3RLPyLogger returns metrics (thanks, @jamartinh )

bugfix

fix epoch count in offline training.

fix total_step count in online training.

fix typos at documentation (thanks, @pstansell )

Source code(tar.gz)
Source code(zip)
d3rlpy-0.70-cp36-cp36m-macosx_10_14_x86_64.whl(382.01 KB)
d3rlpy-0.70-cp36-cp36m-manylinux1_x86_64.whl(998.44 KB)
d3rlpy-0.70-cp36-cp36m-win_amd64.whl(323.96 KB)
d3rlpy-0.70-cp37-cp37m-macosx_10_14_x86_64.whl(373.73 KB)
d3rlpy-0.70-cp37-cp37m-manylinux1_x86_64.whl(997.64 KB)
d3rlpy-0.70-cp37-cp37m-win_amd64.whl(324.24 KB)
d3rlpy-0.70-cp38-cp38-macosx_10_14_x86_64.whl(375.13 KB)
d3rlpy-0.70-cp38-cp38-manylinux1_x86_64.whl(1.07 MB)
d3rlpy-0.70-cp38-cp38-win_amd64.whl(330.57 KB)
v0.61(Jan 31, 2021)
CLI

record command is newly introduced in this version. You can record videos of evaluation episodes with the saved model.

$ d3rlpy record d3rlpy_logs/CQL_20210131144357/model_100.pt --env-id Hopper-v2

You can also use the wrapped environment.

$ d3rlpy record d3rlpy_logs/DQN_online_20210130170041/model_1000.pt \ --env-header 'import gym; from d3rlpy.envs import Atari; env = Atari(gym.make("BreakoutNoFrameskip-v4"), is_eval=True)'

bugfix

fix saving models every step in fit_online method

fix Atari wrapper to reproduce the paper result

fix CQL and BEAR algorithms

Source code(tar.gz)
Source code(zip)
d3rlpy-0.61-cp36-cp36m-macosx_10_14_x86_64.whl(358.75 KB)
d3rlpy-0.61-cp36-cp36m-manylinux1_x86_64.whl(920.50 KB)
d3rlpy-0.61-cp36-cp36m-win_amd64.whl(308.90 KB)
d3rlpy-0.61-cp37-cp37m-macosx_10_14_x86_64.whl(351.76 KB)
d3rlpy-0.61-cp37-cp37m-manylinux1_x86_64.whl(919.87 KB)
d3rlpy-0.61-cp37-cp37m-win_amd64.whl(309.19 KB)
d3rlpy-0.61-cp38-cp38-macosx_10_14_x86_64.whl(352.81 KB)
d3rlpy-0.61-cp38-cp38-manylinux1_x86_64.whl(1005.85 KB)
d3rlpy-0.61-cp38-cp38-win_amd64.whl(315.47 KB)
v0.60(Jan 27, 2021)
logo

New logo images are made for d3rlpy 🎉

| standard | inverted | |:-:|:-:| |||

ActionScaler

ActionScaler provides action scaling pre/post-processing for continuous control algorithms. Previously actions must be in between [-1.0, 1.0]. From now on, you don't need to care about the range of actions.

from d3rlpy.cql import CQL cql = CQL(action_scaler='min_max') # just pass action_scaler argument

handling timeout episodes

Episodes terminated by timeouts should not be clipped at bootstrapping. From this version, you can specify episode boundaries as well as the terminal flags.

from d3rlpy.dataset import MDPDataset observations = ... actions = ... rewards = ... terminals = ... # this indicates the environmental termination episode_terminals = ... # this indicates episode boundaries datasets = MDPDataset(observations, actions, rewards, terminals, episode_terminals) # if episode_terminals are omitted, terminals will be used to specify episode boundaries # datasets = MDPDataset(observations, actions, rewards, terminals)

In online training, you can specify this option via timelimit_aware flag.

from d3rlpy.sac import SAC env = gym.make('Hopper-v2') # make sure if the environment is wrapped by gym.wrappers.Timelimit sac = SAC() sac.fit_online(env, timelimit_aware=True) # this flag is True by default

reference: https://arxiv.org/abs/1712.00378

batch online training

When training with computationally expensive environments such as robotics simulators or rich 3D games, it will take a long time to finish due to the slow environment steps. To solve this, d3rlpy supports batch online training.

from d3rlpy.algos import SAC from d3rlpy.envs import AsyncBatchEnv if __name__ == '__main__': # this is necessary if you use AsyncBatchEnv env = AsyncBatchEnv([lambda: gym.make('Hopper-v2') for _ in range(10)]) # distributing 10 environments in different processes sac = SAC(use_gpu=True) sac.fit_batch_online(env) # train with 10 environments concurrently

docker image

Pre-built d3rlpy docker image is available in DockerHub.

$ docker run -it --gpus all --name d3rlpy takuseno/d3rlpy:latest bash

enhancements

BEAR algorithm is updated based on the official implementation

new mmd_kernel option is available

to_mdp_dataset method is added to ReplayBuffer

ConstantEpsilonGreedy explorer is added

d3rlpy.envs.ChannelFirst wrapper is added (thanks for reporting, @feyza-droid )

new dataset utility function d3rlpy.datasets.get_d4rl is added

this is handling timeouts inside the function

offline RL paper reproduction codes are added

smoothed moving average plot at d3rlpy plot CLI function (thanks, @pstansell )

user-friendly messages for assertion errors

better memory consumption

save_interval argument is added to fit_online

bugfix

core dumps are fixed in Google Colaboratory tutorials

typos in some documentations (thanks for reporting, @pstansell )

Source code(tar.gz)
Source code(zip)
d3rlpy-0.60-cp36-cp36m-macosx_10_14_x86_64.whl(355.58 KB)
d3rlpy-0.60-cp36-cp36m-manylinux1_x86_64.whl(917.33 KB)
d3rlpy-0.60-cp36-cp36m-win_amd64.whl(305.70 KB)
d3rlpy-0.60-cp37-cp37m-macosx_10_14_x86_64.whl(348.14 KB)
d3rlpy-0.60-cp37-cp37m-manylinux1_x86_64.whl(916.23 KB)
d3rlpy-0.60-cp37-cp37m-win_amd64.whl(305.68 KB)
d3rlpy-0.60-cp38-cp38-macosx_10_14_x86_64.whl(349.18 KB)
d3rlpy-0.60-cp38-cp38-manylinux1_x86_64.whl(1002.16 KB)
d3rlpy-0.60-cp38-cp38-win_amd64.whl(311.90 KB)
v0.51(Jan 10, 2021)
minor fix

add typing-extensions depdency

update MANIFEST.in

Source code(tar.gz)
Source code(zip)
v0.50(Jan 9, 2021)
typing

Now, d3rlpy is fully type-annotated not only for the better use of this library but also for the better contribution experiences.

mypy and pylint check the type consistency and code quality.

due to a lot of changes to add type annotations, there might be degradation that is not detected by linters.

CLI

v0.50 introduces the new command-line interface, d3rlpy command that helps you to do more without any efforts. For now, d3rlpy provides the following commands.

# plot CSV data $ d3rlpy plot d3rlpy_logs/XXX/YYY.csv # plot CSV data $ d3rlpy plot-all d3rlpy_logs/XXX # export the save model as inference formats (e.g. ONNX, TorchScript) $ d3rlpy export d3rlpy_logs/XXX/model_YYY.pt

enhancements

faster CPU to GPU transfer

this change makes online training x2 faster

make IQN Q function more precise based on the paper

documentation

Add doc about SB3 integration ( thanks, @araffin )

Source code(tar.gz)
Source code(zip)
d3rlpy-0.50-cp36-cp36m-macosx_10_14_x86_64.whl(334.52 KB)
d3rlpy-0.50-cp36-cp36m-manylinux1_x86_64.whl(872.68 KB)
d3rlpy-0.50-cp36-cp36m-win_amd64.whl(285.67 KB)
d3rlpy-0.50-cp37-cp37m-macosx_10_14_x86_64.whl(327.62 KB)
d3rlpy-0.50-cp37-cp37m-manylinux1_x86_64.whl(871.29 KB)
d3rlpy-0.50-cp37-cp37m-win_amd64.whl(285.78 KB)
d3rlpy-0.50-cp38-cp38-macosx_10_14_x86_64.whl(329.02 KB)
d3rlpy-0.50-cp38-cp38-manylinux1_x86_64.whl(949.80 KB)
d3rlpy-0.50-cp38-cp38-win_amd64.whl(291.23 KB)
v0.41(Dec 20, 2020)
Algorithm

Policy in Latent Action Space (PLAS)

https://arxiv.org/abs/2011.07213

Off-Policy Evaluation

Off-policy evaluation (OPE) is a method to evaluate policy performance only with the offline dataset.

# train policy from d3rlpy.algos import CQL from d3rlpy.datasets import get_pybullet dataset, env = get_pybullet('hopper-bullet-mixed-v0') cql = CQL() cql.fit(dataset.episodes) # Off-Policy Evaluation from d3rlpy.ope import FQE from d3rlpy.metrics.scorer import soft_opc_scorer from d3rlpy.metrics.scorer import initial_state_value_estimation_scorer fqe = FQE(algo=cql) fqe.fit(dataset.episodes, eval_episodes=dataset.episodes scorers={ 'soft_opc': soft_opc_scorer(1000), 'init_value': initial_state_value_estimation_scorer })

Fitted Q-Evaluation

https://arxiv.org/abs/2007.09055

Q Function Factory

d3rlpy provides flexible controls over Q functions through Q function factory. Following this change, the previous q_func_type argument was renamed to q_func_factory.

from d3rlpy.algos import DQN from d3rlpy.q_functions import QRQFunctionFactory # initialize Q function factory q_func_factory = QRQFunctionFactory(n_quantiles=32) # give it to algorithm object dqn = DQN(q_func_factory=q_func_factory)

You can pass Q function name as string too.

dqn = DQN(q_func_factory='qr')

You can also make your own Q function factory. Currently, these are the supported Q function factory.

MeanQFunctionFactory

QRQFunctionFactory

IQNQFunctionFactory

FQFQFunctionFactory

EncoderFactory

DenseNet architecture (only for vector observation)

https://arxiv.org/abs/2010.09163

from d3rlpy.algos import DQN dqn = DQN(encoder_factory='dense')

N-step TD calculation

d3rlpy supports N-step TD calculation for ALL algorithms. You can pass n_steps arugment to configure this parameters.

from d3rlpy.algos import DQN dqn = DQN(n_steps=5) # n_steps=1 by default

Paper reproduction scripts

d3rlpy supports many algorithms including online and offline paradigms. Originally, d3rlpy is designed for industrial practitioners. But, academic research is still important to push deep reinforcement learning forward. Currently, there are online DQN-variant reproduction codes.

DQN

Double DQN

QR-DQN

IQN

FQF

The evaluation results will be also available soon.

enhancements

build_with_dataset and build_with_env methods are added to algorithm objects

shuffle flag is added to fit method (thanks, @jamartinh )

Source code(tar.gz)
Source code(zip)
d3rlpy-0.41-cp36-cp36m-macosx_10_14_x86_64.whl(312.56 KB)
d3rlpy-0.41-cp36-cp36m-manylinux1_x86_64.whl(850.70 KB)
d3rlpy-0.41-cp36-cp36m-win_amd64.whl(263.43 KB)
d3rlpy-0.41-cp37-cp37m-macosx_10_14_x86_64.whl(305.67 KB)
d3rlpy-0.41-cp37-cp37m-manylinux1_x86_64.whl(849.35 KB)
d3rlpy-0.41-cp37-cp37m-win_amd64.whl(263.54 KB)
d3rlpy-0.41-cp38-cp38-macosx_10_14_x86_64.whl(307.05 KB)
d3rlpy-0.41-cp38-cp38-manylinux1_x86_64.whl(927.82 KB)
d3rlpy-0.41-cp38-cp38-win_amd64.whl(269.01 KB)
v0.40(Nov 26, 2020)
Algorithms

Support the discrete version of Soft Actor-Critic

https://arxiv.org/abs/1910.07207

fit_online has n_steps argument instead of n_epochs for the complete reproduction of the papers.

OptimizerFactory

d3rlpy provides more flexible controls for optimizer configuration via OptimizerFactory.

from d3rlpy.optimizers import AdamFactory from d3rlpy.algos import DQN dqn = DQN(optim_factory=AdamFactory(weight_decay=1e-4))

See more at https://d3rlpy.readthedocs.io/en/v0.40/references/optimizers.html .

EncoderFactory

d3rlpy provides more flexible controls for the neural network architecture via EncoderFactory.

from d3rlpy.algos import DQN from d3rlpy.encoders import VectorEncoderFactory # encoder factory encoder_factory = VectorEncoderFactory(hidden_units=[300, 400], activation='tanh') # set OptimizerFactory dqn = DQN(encoder_factory=encoder_factory)

Also you can build your own encoders.

import torch import torch.nn as nn from d3rlpy.encoders import EncoderFactory # your own neural network class CustomEncoder(nn.Module): def __init__(self, obsevation_shape, feature_size): self.feature_size = feature_size self.fc1 = nn.Linear(observation_shape[0], 64) self.fc2 = nn.Linear(64, feature_size) def forward(self, x): h = torch.relu(self.fc1(x)) h = torch.relu(self.fc2(h)) return h # THIS IS IMPORTANT! def get_feature_size(self): return self.feature_size # your own encoder factory class CustomEncoderFactory(EncoderFactory): TYPE = 'custom' # this is necessary def __init__(self, feature_size): self.feature_size = feature_size def create(self, observation_shape, action_size=None, discrete_action=False): return CustomEncoder(observation_shape, self.feature_size) def get_params(self, deep=False): return { 'feature_size': self.feature_size } dqn = DQN(encoder_factory=CustomEncoderFactory(feature_size=64))

See more at https://d3rlpy.readthedocs.io/en/v0.40/references/network_architectures.html .

Stable Baselines 3 wrapper

Now d3rlpy is partially compatible with Stable Baselines 3.

https://github.com/takuseno/d3rlpy/blob/master/d3rlpy/wrappers/sb3.py

More documentations will be available soon.

bugfix

fix the memory leak problem at fit_online.

Now, you can train online algorithms with the big replay buffer size for the image observation.

fix preprocessing at CQL.

fix ColorJitter augmentation.

installation

PyPi

From this version, d3rlpy officially supports Windows.

The binary packages for each platform are built in GitHub Actions. And they are uploaded, which means that you don't have to install Cython to install this package from PyPi.

Anaconda

From previous version, d3rlpy is available in conda-forge.

Source code(tar.gz)
Source code(zip)
d3rlpy-0.40-cp36-cp36m-macosx_10_14_x86_64.whl(303.75 KB)
d3rlpy-0.40-cp36-cp36m-manylinux1_x86_64.whl(837.64 KB)
d3rlpy-0.40-cp36-cp36m-win_amd64.whl(254.35 KB)
d3rlpy-0.40-cp37-cp37m-macosx_10_14_x86_64.whl(296.20 KB)
d3rlpy-0.40-cp37-cp37m-manylinux1_x86_64.whl(837.38 KB)
d3rlpy-0.40-cp37-cp37m-win_amd64.whl(254.40 KB)
d3rlpy-0.40-cp38-cp38-macosx_10_14_x86_64.whl(297.77 KB)
d3rlpy-0.40-cp38-cp38-manylinux1_x86_64.whl(916.62 KB)
d3rlpy-0.40-cp38-cp38-win_amd64.whl(259.74 KB)
v0.32(Oct 31, 2020)
This version introduces hotfix.

⚠️ Fix the significant bug in the case of online training with image observation.

Source code(tar.gz)
Source code(zip)
v0.31(Oct 28, 2020)
This version introduces minor changes.

Move n_epochs arguments to fit method.

Fix scikit-learn compatibility issues.

Fix zero-division error during online training.

Source code(tar.gz)
Source code(zip)
v0.30(Oct 27, 2020)
Algorithm

Support Advantage-Weighted Actor-Critic (AWAC)

https://arxiv.org/abs/2006.09359

fit_online method is available as a convenient alias to d3rlpy.online.iterators.train function.

unnormalizing action problem is fixed at AWR.

Metrics

The following metrics are available.

initial_state_value_estimation_scorer

https://arxiv.org/abs/1906.01624

soft_opc_scorer

https://arxiv.org/abs/2007.09055

⚠️ MDPDataset

d3rlpy.dataset module is now implemented with Cython in order to speed up memory copies.

Following operations are significantly faster than the previous version.

creating TransitionMiniBatch object

frame stacking via n_frames argument

lambda return calculation at AWR algorithms

This change approximately makes Atari training 6% faster.

Source code(tar.gz)
Source code(zip)
v0.23(Sep 8, 2020)
Algorithm

Support Advantage-Weighted Regression (AWR)

https://arxiv.org/abs/1910.00177

n_frames option is added to all algorithms

n_frames option controls frame stacking for image observation

eval_results_ property is added to all algorithms

evaluation results can be retrieved from eval_results_ after training.

MDPDataset

prev_transition and next_transition properties are added to d3rlpy.dataset.Transition.

these properties are used for frame stacking and Monte-Carlo returns calculation at AWR.

Document

new tutorial page is added

Source code(tar.gz)
Source code(zip)
v0.22(Aug 28, 2020)
Support ONNX export

Now, the trained policy can be exported as ONNX as well as TorchScript

cql.save_policy('policy.onnx', as_onnx=True)

Support more data augmentations

data augmentations for vector obsrevation

ColorJitter augmentation for image observation

Source code(tar.gz)
Source code(zip)
v0.2(Aug 10, 2020)
support model-based algorithm

Model-based Offline Policy Optimization

support data augmentation (for image observation)

Data-reguralized Q-learning

a lot of improvements

more dataset statistics

more options to customize neural network architecture

optimize default learning rates

etc

Source code(tar.gz)
Source code(zip)
v0.1(Jul 31, 2020)
online algorithms

Deep Q-Network (DQN)

Double DQN

Deep Deterministic Policy Gradients (DDPG)

Twin Delayed Deep Deterministic Policy Gradients (TD3)

Soft Actor-Critic (SAC)

data-driven algorithms

Batch-Constrained Q-leearning (BCQ)

Bootstrapping Error Accumulation Reduction (BEAR)

Conservative Q-Learning (CQL)

Q functions

mean

Quantile Regression

Implicit Quantile Network

Fully-parametrized Quantile Function (experimental)

Source code(tar.gz)
Source code(zip)