Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Iffi

Last update: Dec 24, 2022

Related tags

Deep Learning python reinforcement-learning deep-learning pytorch distributed dqn ddpg sac ppo prioritized-experience-replay td3 pytorch-lightning pytorch-reinforcement-learning a3c-pytorch

Overview

Automatic, Readable, Reusable, Extendable

Machin is a reinforcement library designed for pytorch.

Build status

Platform	Status
Linux
Windows

Supported Models

Anything, including recurrent networks.

Supported algorithms

Currently Machin has implemented the following algorithms, the list is still growing:

Single agent algorithms:

Multi-agent algorithms:

Multi-agent DDPG (MADDPG)

Immitation learning algorithms (Behavioral Cloning, Inverse RL, GAIL)

Generative Adversarial Imitation Learning (GAIL)

Massively parallel algorithms:

Enhancements:

Algorithms to be supported:

Evolution Strategies
QMIX (multi agent)
Model-based methods

Features

1. Automatic

Starting from version 0.4.0, Machin now supports automatic config generation, you can get a configuration through:

python -m machin.auto generate --algo DQN --env openai_gym --output config.json

And automatically launch the experiment with pytorch lightning:

python -m machin.auto launch --config config.json

2. Readable

Compared to other reinforcement learning libraries such as the famous rlpyt, ray, and baselines. Machin tries to just provide a simple, clear implementation of RL algorithms.

All algorithms in Machin are designed with minimial abstractions and have very detailed documents, as well as various helpful tutorials.

3. Reusable

Machin takes a similar approach to that of pytorch, encasulating algorithms, data structures in their own classes. Users do not need to setup a series of data collectors, trainers, runners, samplers... to use them, just import.

The only restriction placed on your models is their input / output format, however, these restrictions are minimal, making it easy to adapt algorithms to your custom environments.

4. Extendable

Machin is built upon pytorch, it and thanks to its powerful rpc api, we may construct complex distributed programs. Machin provides implementations for enhanced parallel execution pools, automatic model assignment, role based rpc scaling, rpc service discovery and registration, etc.

Upon these core functions, Machin is able to provide tested high-performance distributed training algorithm implementations, such as A3C, APEX, IMPALA, to ease your design.

5. Reproducible

Machin is weakly reproducible, for each release, our test framework will directly train every RL framework, if any framework cannot reach the target score, the test will fail directly.

However, currently, the tests are not guaranteed to be exactly the same as the tests in original papers, due to the large variety of different environments used in original research papers.

Documentation

See here. Examples are located in examples.

Installation

Machin is hosted on PyPI. Python >= 3.6 and PyTorch >= 1.6.0 is required. You may install the Machin library by simply typing:

pip install machin

You are suggested to create a virtual environment first if you are using conda to manage your environments, to prevent PIP changes your packages without letting conda know.

conda create -n some_env pip
conda activate some_env
pip install machin

Note: Currently only a fraction of all functions is supported on platforms other than linux (mainly distributed algorithms), to test whether the code is running correctly, you can run the corresponding test script for your platform in the root directory:

run_win_test.bat
run_linux_test.sh
run_macos_test.sh

Some errors may occur due to incorrect setup of libraries, make sure you have installed graphviz etc.

Contributing

Any contribution would be great, don't hesitate to submit a PR request to us! Please follow the instructions in this file.

Issues

If you have any issues, please use the template markdown files in .github/ISSUE_TEMPLATE folder and format your issue before opening a new one. We would try our best to respond to your feature requests and problems.

Citing

We would be very grateful if you can cite our work in your publications:

@misc{machin,
  author = {Muhan Li},
  title = {Machin},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/iffiX/machin}},
}

Roadmap

Please see Roadmap for the exciting new features we are currently working on!

Comments

AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync' when running tutorials

Hello, when I am trying to run a tutorial script, e.g. the your_first_program example, I always encounter this AttributeError during the imports:

AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync'

However, I fulfill the listed requirements. Is there anything I am missing or have can I solve this?

opened by MarWaltz 16
Multi Discrete Action Spaces

Hello,

Does machin support Multi Discrete Action Spaces? (two different actions in the same time step) I've looked through the documentation but cannot find anything related to that

João
enhancement

opened by joaomatoscf 7
[Question] Hybrid action space
Hey, I'm trying to implement hybrid action space with A2C agent, maybe you have some advice. My expected output are two actions: one discrete, one continuous. Network predicts 3 things:

logits for discrete action (sampling from Categorical dist)

mean for distribution of continuous action

std for same (sampling from Normal)

Net outputs sum of log probabilities of actions from both distributions (same for entropy). Network successfully learns the mean and std but the weight for the logits layers are not updated at all. What can be the reason?
opened by Misterion777 7
ImportError: cannot import name 'FileStore'

File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\rendezvous.py", line 9, in from . import FileStore, TCPStore ImportError: cannot import name 'FileStore'

After installing machin, run PPO. py，than report an error, and try others to report the same error.As follow:

D:\Anaconda3\envs\universe\python.exe F:/machin/machin-master/examples/framework_examples/dqn.py Traceback (most recent call last): File "F:/machin/machin-master/examples/framework_examples/dqn.py", line 1, in from machin.frame.algorithms import DQN File "D:\Anaconda3\envs\universe\lib\site-packages\machin_init_.py", line 1, in from . import env, frame, model, parallel, utils File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env_init_.py", line 1, in from . import utils, wrappers File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env\wrappers_init_.py", line 1, in from . import base, openai_gym File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env\wrappers\openai_gym.py", line 8, in from machin.parallel.exception import ExceptionWithTraceback File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel_init_.py", line 2, in from . import distributed, server, assigner, exception, pickle, thread, pool, queue File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel\distributed_init_.py", line 1, in from .world import ( File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel\distributed\world.py", line 14, in import torch.distributed.distributed_c10d as dist_c10d File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\distributed_c10d.py", line 10, in from .rendezvous import rendezvous, register_rendezvous_handler # noqa: F401 File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\rendezvous.py", line 9, in from . import FileStore, TCPStore ImportError: cannot import name 'FileStore'

Process finished with exit code 1

opened by pengxiang-pang 5
Error importing PPO

Hi,

I was using machin until today, but I think it there was a super recent update and now my code stopped working (only for PPO, DQN still works).

With version 0.4.0 I get the error (when importing "from machin.frame.algorithms import PPO"): ModuleNotFoundError: No module named 'machin.frame.helpers'

And if I install version 0.3.4 I get the error (when updating PPO): AttributeError: 'PPO' object has no attribute 'grad_max'

Joao

opened by joaomatoscf 4
[ALTER] Readability - black standard formatting
Where to alter It's best to alter throughout the entire library, but my specific pain started with ppo.py.

Why to alter

Custom formatting is less readable than standard black formatting.

Standard formatting means minimal diffs on any future edits - easier to review

How to alter

Run black https://github.com/psf/black on entire repository

[OPTIONAL] add this commit to list of ignored commits for git blame as described here https://akrabat.com/ignoring-revisions-with-git-blame/ - I think this can be skipped because so far you are the main author of the library and git history is not that rich.

[OPTIONAL] add pre-commit hook to run black on all future commits automatically https://black.readthedocs.io/en/stable/version_control_integration.html

Example: originally, this is how code looks that I needed to debug:

batch_size, (state, action, advantage) = \ self.replay_buffer.sample_batch(self.batch_size, sample_method="random_unique", concatenate=concatenate_samples, sample_attrs=[ "state", "action", "gae"], additional_concat_attrs=[ "gae" ])

that's a festival of different indentation levels.

Here is how it looks after black reformatting:

batch_size, (state, target_value) = self.replay_buffer.sample_batch( self.batch_size, sample_method="random_unique", concatenate=concatenate_samples, sample_attrs=["state", "value"], additional_concat_attrs=["value"], )
allter
opened by ikamensh 4
Variable lengths samples in batch in update()

Hey, i'm using DQN and my q-values are variable lengths sequences as I have different amount of actions for each state (my states have different shapes also). When sampling batches default Buffer concatenating them which leads to a tensor error. But when using update() with concatenate_samples=False it stills doesn't solve the problem as now samples are just lists, and all torch operations fail. Of course, I can pad the sequences, but then argmax() can return one of the padded indexes, as it's not possible to pass original lengths of each sample in batch in update() function. Is there any way to solve the problem right now, or it yet to be implemented?
allter

opened by Misterion777 4
[FEATURE] Is there a tutorial for maddpg.py

Hi, this project is really awesome and the codes is well structured!

Is your feature request related to a problem? Please describe. I have run some codes in examples/tutorials, but can not find some about MARL algorithms such as maddpg.

Describe the solution you'd like Since you have already implemented the maddpg.py and test_maddpg.py, I am wondering could you implement a tutorial for maddpg.py too? (It would be better to implement more MARL algorithms, such as COMA, QMIX, VDN)
enhancement

opened by ConnLiu 4
[FEATURE] Large Transition Batch Size
Is your feature request related to a problem? Please describe.

Transition batch size

I read the tutorial about RL in spiningup. I found that for on-policy RL, they have a step to collect a set of trajectories in their pseudocode. However, in your documentation Data flow in Machin, you point out that

Currently, the constructor of the default transition implementation Transition requires batch size to be 1

and

Buffer.store_episode(): If you pass in a dict type transition object, it will be automatically converted to Transition

In your PPO examples/ppo.py it seems that you only save one trajectory per iteration and update it. What should I do if I want to save a set of trajectories? Will such a change affect the update() part?

Multiple trajectories with one reward

In general scenarios, one trajectory (episode) will have one total reward. However, I encountered a case where multiple trajectories with only one total reward. For example:

(trajectory1: [s,a,0,s,a,0,...,s,a], trajectory2 [s,a,0,s,a,0,...,s,a], trajectory3[s,a,0,s,a,0,...,s,a] ) ---> final reward Imagine that many football players are playing the same game. They receive the same reward only when goal, Imagine that generating a batch of noise sequences to attack the neural network will get only one reward which indicates the degraded performance of the neural network.

I give two solutions to this problem in the next section, but I am not sure which one is better. Could you give me some advice?

Describe the solution you'd like For feature 1, it may realized as below(I am not sure whether this affect the update() part):

while episode< max_episodes: for i in range(sub_episode_size) # add a loop here episode += 1 tmp_observations = [] while not terminal and step <= max_steps: # .... tmp_observations.append(...) # store transition ppo.store_episode(tmp_observations) ppo.update() # clean buffer

For feature 2, it may have two solutions. One is to assign the final reward to every trajectory:

from collection import defaultdict while episode< max_episodes: batch_trajectory = defaultdict[List] episode += batch_size while not terminal and step <= max_steps: # using batch state to generate batch action reward = env.step(batch_action) for i in range(batch_size): batch_trajectory[i] += batch_state[i] + batch_action[i] + reward # assign the same reward to each trajectory for i in range(batch_size): ppo.store_episode(batch_trajectory[i]) ppo.update() # clean buffer

Describe alternatives you've considered

For feature 2, another solution may resort to the multi-agent RL. Each agent manages one trajectory and they receive the same reward from the environment. I found that Machin has a multi-agent algorithm implementation called MADDPG. From spiningup I found that this algorithm is only for continuous action space. Is there any plan to implement other multi-agent RL algorithms such as multi-agent PPO for discrete action space?

Additional context
enhancement
opened by cjfcsjt 3

Hierarchical discrete action space

Hi! Thanks for your excellent work! I tried several RL frameworks based on PyTorch, machin is one of the few libraries that discusses hybrid action space.

I'm diving into a complex environment which is a hierarchical action space problem. I hope you could give me some advice!

To explain the meaning of hierarchical action space more clearly, here is an example in the paper Generalising Discrete Action Spaces with Conditional Action Trees. Figure2 in the paper shows that the actions are decomposed as an action tree. One should first select the first level actions, then select the second level actions. The action space of the first level is 3 and the action space of the second level depends on the first level.

I try to give one possible solution to solve this:

change the transition part:

transition = { "state": {"some_state": old_state} , ...  } # old
transition = {"state": {"some_state": old_state, "valid_actions": valid_action_set } , ... }

here, the "valid_actions" contains all of the possible second-level actions based on the first-level action.

change the agent sampling(explore) flow:

state = env.reset() # state have key words "some_state" and "valid_actions", here, for initial, the valid_actions are pre-defined manually. For example, we choose the first-level action 'use', and then the valid_actions will be 'food'
while not done:
    action2 = agent.act2(state) # choose action2 from valid action set. Here, action2 is 'food'
    action1 = agent.act1(state) # choose action1 which denotes the first level action space. Here, suppose action1 is 'move'.
    env.first_level(action1) # tell env, the valid action set in next step is 'up, down, right, left' under the 'move' branch
    next_state, reward, done = env.step(action2) # next_state have key words "some_state" and "valid_actions", the valid_actions have 'up, down, right, left'.

change the actor network agent.act2 to:

class act2(nn.Module):
  def forward(self, state = {'some_state': <...> , 'valid_actions': <...>}  ):
    state, second_level_actions = state.state, state.valid_actions
    ....  # calculate the similarity between state and valid actions, and output logits
    return (...), None

Do you think this solution is reasonable? Is there any better way to support such a conditional hierarchical action space?

opened by cjfcsjt 3

A2C entropy minimized instead of maximized

Hi,

I guess the entropy in A2C is wrong:

if new_action_entropy is not None:
    act_policy_loss += self.entropy_weight * new_action_entropy.mean()

instead it should be:

if new_action_entropy is not None:
    act_policy_loss -= self.entropy_weight * new_action_entropy.mean()

Best,

Lorenzo

opened by lorenzosteccanella 2

len(tmp_observations) < 2 on PPO raise ValueError: The parameter probs has invalid values

It seems that your code produce error if the len of your trajectory < 2 ( len(tmp_observations) < 2). I tested this on PPO I don't know if this happens with all algorithms.

The error:

ValueError: The parameter probs has invalid values

opened by lorenzosteccanella 3

Owner

Iffi

CS student, interested in AI. Currently studying at Northwestern University.

GitHub

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

pytorch-a2c-ppo-acktr Update (April 12th, 2021) PPO is great, but Soft Actor Critic can be better for many continuous control tasks. Please check out

3k Jan 9, 2023

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

3k Dec 31, 2022

This is a clean and robust Pytorch implementation of DQN and Double DQN.

DQN/DDQN-Pytorch This is a clean and robust Pytorch implementation of DQN and Double DQN. Here is the training curve: All the experiments are trained

15 Dec 27, 2022

Tackling Obstacle Tower Challenge using PPO & A2C combined with ICM.

Obstacle Tower Challenge using Deep Reinforcement Learning Unity Obstacle Tower is a challenging realistic 3D, third person perspective and procedural

5 Feb 10, 2022

Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Off-Policy Multi-Agent Reinforcement Learning (MARL) Algorithms This repository contains implementations of various off-policy multi-agent reinforceme

183 Dec 28, 2022

A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

Multi-Agent-Deep-Deterministic-Policy-Gradients A Pytorch implementation of the multi agent deep deterministic policy gradients(MADDPG) algorithm This

159 Dec 28, 2022

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

80 Sep 17, 2022

Implementation of algorithms for continuous control (DDPG and NAF).

DEPRECATION This repository is deprecated and is no longer maintaned. Please see a more recent implementation of RL for continuous control at jax-sac.

288 Dec 31, 2022

PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage async actor-critic Algorithms (A3C) in PyTorch @inproceedings{mnih2016asynchronous, title={Asynchronous methods for deep reinforcement lea

111 Dec 8, 2022

PPO is a very popular Reinforcement Learning algorithm at present.

PPO is a very popular Reinforcement Learning algorithm at present. OpenAI takes PPO as the current baseline algorithm. We use the PPO algorithm to train a policy to give the best action in any situation.

11 Aug 23, 2021

A3C LSTM Atari with Pytorch plus A3G design

NEWLY ADDED A3G A NEW GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING!! RL A3C Pytorch NEWLY ADDED A3G!! New implementation of A3C

532 Jan 2, 2023

Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

70 Dec 12, 2022

Advantage Actor Critic (A2C): jax + flax implementation

Advantage Actor Critic (A2C): jax + flax implementation Current version supports only environments with continious action spaces and was tested on muj

3 Jan 23, 2022

A clean and robust Pytorch implementation of PPO on continuous action space.

PPO-Continuous-Pytorch I found the current implementation of PPO on continuous action space is whether somewhat complicated or not stable. And this is

56 Dec 16, 2022

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services. This library is developed by Bandit ML and ex-authors of Facebook's applied reinforcement learning platform, Reagent.

51 Dec 22, 2022

Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Related tags

Overview

Build status

Supported Models

Supported algorithms

Single agent algorithms:

Multi-agent algorithms:

Immitation learning algorithms (Behavioral Cloning, Inverse RL, GAIL)

Massively parallel algorithms:

Enhancements:

Algorithms to be supported:

Features

1. Automatic

2. Readable

3. Reusable

4. Extendable

5. Reproducible

Documentation

Installation

Contributing

Issues

Citing

Roadmap

Comments

Owner

Iffi

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

This is a clean and robust Pytorch implementation of DQN and Double DQN.

Tackling Obstacle Tower Challenge using PPO & A2C combined with ICM.

Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Implementation of algorithms for continuous control (DDPG and NAF).

PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

PPO is a very popular Reinforcement Learning algorithm at present.

A3C LSTM Atari with Pytorch plus A3G design

Implement A3C for Mujoco gym envs

Advantage Actor Critic (A2C): jax + flax implementation

A clean and robust Pytorch implementation of PPO on continuous action space.

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

This is the code of using DQN to play Sekiro .

A very short and easy implementation of Quantile Regression DQN

A working implementation of the Categorical DQN (Distributional RL).

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.