Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Overview


Automatic, Readable, Reusable, Extendable

Machin is a reinforcement library designed for pytorch.


Build status

Platform Status
Linux Jenkins build
Windows Windows build

Supported Models


Anything, including recurrent networks.

Supported algorithms


Currently Machin has implemented the following algorithms, the list is still growing:

Single agent algorithms:

Multi-agent algorithms:

Immitation learning algorithms (Behavioral Cloning, Inverse RL, GAIL)

Massively parallel algorithms:

Enhancements:

Algorithms to be supported:

Features


1. Automatic

Starting from version 0.4.0, Machin now supports automatic config generation, you can get a configuration through:

python -m machin.auto generate --algo DQN --env openai_gym --output config.json

And automatically launch the experiment with pytorch lightning:

python -m machin.auto launch --config config.json

2. Readable

Compared to other reinforcement learning libraries such as the famous rlpyt, ray, and baselines. Machin tries to just provide a simple, clear implementation of RL algorithms.

All algorithms in Machin are designed with minimial abstractions and have very detailed documents, as well as various helpful tutorials.

3. Reusable

Machin takes a similar approach to that of pytorch, encasulating algorithms, data structures in their own classes. Users do not need to setup a series of data collectors, trainers, runners, samplers... to use them, just import.

The only restriction placed on your models is their input / output format, however, these restrictions are minimal, making it easy to adapt algorithms to your custom environments.

4. Extendable

Machin is built upon pytorch, it and thanks to its powerful rpc api, we may construct complex distributed programs. Machin provides implementations for enhanced parallel execution pools, automatic model assignment, role based rpc scaling, rpc service discovery and registration, etc.

Upon these core functions, Machin is able to provide tested high-performance distributed training algorithm implementations, such as A3C, APEX, IMPALA, to ease your design.

5. Reproducible

Machin is weakly reproducible, for each release, our test framework will directly train every RL framework, if any framework cannot reach the target score, the test will fail directly.

However, currently, the tests are not guaranteed to be exactly the same as the tests in original papers, due to the large variety of different environments used in original research papers.

Documentation


See here. Examples are located in examples.

Installation


Machin is hosted on PyPI. Python >= 3.6 and PyTorch >= 1.6.0 is required. You may install the Machin library by simply typing:

pip install machin

You are suggested to create a virtual environment first if you are using conda to manage your environments, to prevent PIP changes your packages without letting conda know.

conda create -n some_env pip
conda activate some_env
pip install machin

Note: Currently only a fraction of all functions is supported on platforms other than linux (mainly distributed algorithms), to test whether the code is running correctly, you can run the corresponding test script for your platform in the root directory:

run_win_test.bat
run_linux_test.sh
run_macos_test.sh

Some errors may occur due to incorrect setup of libraries, make sure you have installed graphviz etc.

Contributing


Any contribution would be great, don't hesitate to submit a PR request to us! Please follow the instructions in this file.

Issues


If you have any issues, please use the template markdown files in .github/ISSUE_TEMPLATE folder and format your issue before opening a new one. We would try our best to respond to your feature requests and problems.

Citing


We would be very grateful if you can cite our work in your publications:

@misc{machin,
  author = {Muhan Li},
  title = {Machin},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/iffiX/machin}},
}

Roadmap


Please see Roadmap for the exciting new features we are currently working on!

Comments
  • AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync' when running tutorials

    AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync' when running tutorials

    Hello, when I am trying to run a tutorial script, e.g. the your_first_program example, I always encounter this AttributeError during the imports:

    AttributeError: module 'torch.distributed.rpc' has no attribute 'rpc_sync'

    However, I fulfill the listed requirements. Is there anything I am missing or have can I solve this?

    opened by MarWaltz 16
  • Multi Discrete Action Spaces

    Multi Discrete Action Spaces

    Hello,

    Does machin support Multi Discrete Action Spaces? (two different actions in the same time step) I've looked through the documentation but cannot find anything related to that

    João

    enhancement 
    opened by joaomatoscf 7
  • [Question] Hybrid action space

    [Question] Hybrid action space

    Hey, I'm trying to implement hybrid action space with A2C agent, maybe you have some advice. My expected output are two actions: one discrete, one continuous. Network predicts 3 things:

    • logits for discrete action (sampling from Categorical dist)
    • mean for distribution of continuous action
    • std for same (sampling from Normal)

    Net outputs sum of log probabilities of actions from both distributions (same for entropy). Network successfully learns the mean and std but the weight for the logits layers are not updated at all. What can be the reason?

    opened by Misterion777 7
  • ImportError: cannot import name 'FileStore'

    ImportError: cannot import name 'FileStore'

    File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\rendezvous.py", line 9, in from . import FileStore, TCPStore ImportError: cannot import name 'FileStore'

    After installing machin, run PPO. py,than report an error, and try others to report the same error.As follow:

    D:\Anaconda3\envs\universe\python.exe F:/machin/machin-master/examples/framework_examples/dqn.py Traceback (most recent call last): File "F:/machin/machin-master/examples/framework_examples/dqn.py", line 1, in from machin.frame.algorithms import DQN File "D:\Anaconda3\envs\universe\lib\site-packages\machin_init_.py", line 1, in from . import env, frame, model, parallel, utils File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env_init_.py", line 1, in from . import utils, wrappers File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env\wrappers_init_.py", line 1, in from . import base, openai_gym File "D:\Anaconda3\envs\universe\lib\site-packages\machin\env\wrappers\openai_gym.py", line 8, in from machin.parallel.exception import ExceptionWithTraceback File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel_init_.py", line 2, in from . import distributed, server, assigner, exception, pickle, thread, pool, queue File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel\distributed_init_.py", line 1, in from .world import ( File "D:\Anaconda3\envs\universe\lib\site-packages\machin\parallel\distributed\world.py", line 14, in import torch.distributed.distributed_c10d as dist_c10d File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\distributed_c10d.py", line 10, in from .rendezvous import rendezvous, register_rendezvous_handler # noqa: F401 File "D:\Anaconda3\envs\universe\lib\site-packages\torch\distributed\rendezvous.py", line 9, in from . import FileStore, TCPStore ImportError: cannot import name 'FileStore'

    Process finished with exit code 1

    opened by pengxiang-pang 5
  • Error importing PPO

    Error importing PPO

    Hi,

    I was using machin until today, but I think it there was a super recent update and now my code stopped working (only for PPO, DQN still works).

    With version 0.4.0 I get the error (when importing "from machin.frame.algorithms import PPO"): ModuleNotFoundError: No module named 'machin.frame.helpers'

    And if I install version 0.3.4 I get the error (when updating PPO): AttributeError: 'PPO' object has no attribute 'grad_max'

    Joao

    opened by joaomatoscf 4
  • [ALTER] Readability - black standard formatting

    [ALTER] Readability - black standard formatting

    Where to alter It's best to alter throughout the entire library, but my specific pain started with ppo.py.

    Why to alter

    • Custom formatting is less readable than standard black formatting.
    • Standard formatting means minimal diffs on any future edits - easier to review

    How to alter

    1. Run black https://github.com/psf/black on entire repository
    2. [OPTIONAL] add this commit to list of ignored commits for git blame as described here https://akrabat.com/ignoring-revisions-with-git-blame/ - I think this can be skipped because so far you are the main author of the library and git history is not that rich.
    3. [OPTIONAL] add pre-commit hook to run black on all future commits automatically https://black.readthedocs.io/en/stable/version_control_integration.html

    Example: originally, this is how code looks that I needed to debug:

                batch_size, (state, action, advantage) = \
                    self.replay_buffer.sample_batch(self.batch_size,
                                                    sample_method="random_unique",
                                                    concatenate=concatenate_samples,
                                                    sample_attrs=[
                                                        "state", "action", "gae"],
                                                    additional_concat_attrs=[
                                                        "gae"
                                                    ])
    
    • that's a festival of different indentation levels.

    Here is how it looks after black reformatting:

                batch_size, (state, target_value) = self.replay_buffer.sample_batch(
                    self.batch_size,
                    sample_method="random_unique",
                    concatenate=concatenate_samples,
                    sample_attrs=["state", "value"],
                    additional_concat_attrs=["value"],
                )
    
    allter 
    opened by ikamensh 4
  • Variable lengths samples in batch in update()

    Variable lengths samples in batch in update()

    Hey, i'm using DQN and my q-values are variable lengths sequences as I have different amount of actions for each state (my states have different shapes also). When sampling batches default Buffer concatenating them which leads to a tensor error. But when using update() with concatenate_samples=False it stills doesn't solve the problem as now samples are just lists, and all torch operations fail. Of course, I can pad the sequences, but then argmax() can return one of the padded indexes, as it's not possible to pass original lengths of each sample in batch in update() function. Is there any way to solve the problem right now, or it yet to be implemented?

    allter 
    opened by Misterion777 4
  • [FEATURE] Is there a tutorial for maddpg.py

    [FEATURE] Is there a tutorial for maddpg.py

    Hi, this project is really awesome and the codes is well structured!

    Is your feature request related to a problem? Please describe. I have run some codes in examples/tutorials, but can not find some about MARL algorithms such as maddpg.

    Describe the solution you'd like Since you have already implemented the maddpg.py and test_maddpg.py, I am wondering could you implement a tutorial for maddpg.py too? (It would be better to implement more MARL algorithms, such as COMA, QMIX, VDN)

    enhancement 
    opened by ConnLiu 4
  • [FEATURE] Large Transition Batch Size

    [FEATURE] Large Transition Batch Size

    Is your feature request related to a problem? Please describe.

    1. Transition batch size

    I read the tutorial about RL in spiningup. I found that for on-policy RL, they have a step to collect a set of trajectories in their pseudocode. However, in your documentation Data flow in Machin, you point out that

    Currently, the constructor of the default transition implementation Transition requires batch size to be 1

    and

    Buffer.store_episode(): If you pass in a dict type transition object, it will be automatically converted to Transition

    In your PPO examples/ppo.py it seems that you only save one trajectory per iteration and update it. What should I do if I want to save a set of trajectories? Will such a change affect the update() part?

    1. Multiple trajectories with one reward

    In general scenarios, one trajectory (episode) will have one total reward. However, I encountered a case where multiple trajectories with only one total reward. For example:

    (trajectory1: [s,a,0,s,a,0,...,s,a], trajectory2 [s,a,0,s,a,0,...,s,a], trajectory3[s,a,0,s,a,0,...,s,a] ) ---> final reward Imagine that many football players are playing the same game. They receive the same reward only when goal, Imagine that generating a batch of noise sequences to attack the neural network will get only one reward which indicates the degraded performance of the neural network.

    I give two solutions to this problem in the next section, but I am not sure which one is better. Could you give me some advice?

    Describe the solution you'd like For feature 1, it may realized as below(I am not sure whether this affect the update() part):

    while episode< max_episodes:
        for i in range(sub_episode_size) # add a loop here
            episode += 1
            tmp_observations = []
            while not terminal and step <= max_steps:
                   # ....
                   tmp_observations.append(...) # store transition
             ppo.store_episode(tmp_observations)
         ppo.update()
         # clean buffer
    

    For feature 2, it may have two solutions. One is to assign the final reward to every trajectory:

    from collection import defaultdict
    while episode< max_episodes:
        batch_trajectory = defaultdict[List]
        episode += batch_size
        while not terminal and step <= max_steps:
               # using batch state to generate batch action
               reward = env.step(batch_action)
               for i in range(batch_size):
                   batch_trajectory[i] += batch_state[i] + batch_action[i] + reward # assign the same reward to each trajectory
    
         for i in range(batch_size):
             ppo.store_episode(batch_trajectory[i])
         ppo.update()
         # clean buffer
    

    Describe alternatives you've considered

    For feature 2, another solution may resort to the multi-agent RL. Each agent manages one trajectory and they receive the same reward from the environment. I found that Machin has a multi-agent algorithm implementation called MADDPG. From spiningup I found that this algorithm is only for continuous action space. Is there any plan to implement other multi-agent RL algorithms such as multi-agent PPO for discrete action space?

    Additional context

    enhancement 
    opened by cjfcsjt 3
  • Hierarchical discrete action space

    Hierarchical discrete action space

    Hi! Thanks for your excellent work! I tried several RL frameworks based on PyTorch, machin is one of the few libraries that discusses hybrid action space.

    I'm diving into a complex environment which is a hierarchical action space problem. I hope you could give me some advice!

    To explain the meaning of hierarchical action space more clearly, here is an example in the paper Generalising Discrete Action Spaces with Conditional Action Trees. Figure2 in the paper shows that the actions are decomposed as an action tree. One should first select the first level actions, then select the second level actions. The action space of the first level is 3 and the action space of the second level depends on the first level.

    I try to give one possible solution to solve this:

    1. change the transition part:
    transition = { "state": {"some_state": old_state} , ...  } # old
    transition = {"state": {"some_state": old_state, "valid_actions": valid_action_set } , ... }
    

    here, the "valid_actions" contains all of the possible second-level actions based on the first-level action.

    1. change the agent sampling(explore) flow:
    state = env.reset() # state have key words "some_state" and "valid_actions", here, for initial, the valid_actions are pre-defined manually. For example, we choose the first-level action 'use', and then the valid_actions will be 'food'
    while not done:
        action2 = agent.act2(state) # choose action2 from valid action set. Here, action2 is 'food'
        action1 = agent.act1(state) # choose action1 which denotes the first level action space. Here, suppose action1 is 'move'.
        env.first_level(action1) # tell env, the valid action set in next step is 'up, down, right, left' under the 'move' branch
        next_state, reward, done = env.step(action2) # next_state have key words "some_state" and "valid_actions", the valid_actions have 'up, down, right, left'.
    
    1. change the actor network agent.act2 to:
    class act2(nn.Module):
      def forward(self, state = {'some_state': <...> , 'valid_actions': <...>}  ):
        state, second_level_actions = state.state, state.valid_actions
        ....  # calculate the similarity between state and valid actions, and output logits
        return (...), None
    

    Do you think this solution is reasonable? Is there any better way to support such a conditional hierarchical action space?

    opened by cjfcsjt 3
  • A2C entropy minimized instead of maximized

    A2C entropy minimized instead of maximized

    Hi,

    I guess the entropy in A2C is wrong:

    if new_action_entropy is not None:
        act_policy_loss += self.entropy_weight * new_action_entropy.mean()
    

    instead it should be:

    if new_action_entropy is not None:
        act_policy_loss -= self.entropy_weight * new_action_entropy.mean()
    

    Best,

    Lorenzo

    opened by lorenzosteccanella 2
  • len(tmp_observations) < 2 on PPO raise ValueError: The parameter probs has invalid values

    len(tmp_observations) < 2 on PPO raise ValueError: The parameter probs has invalid values

    It seems that your code produce error if the len of your trajectory < 2 ( len(tmp_observations) < 2). I tested this on PPO I don't know if this happens with all algorithms.

    The error:

    ValueError: The parameter probs has invalid values

    opened by lorenzosteccanella 3
Owner
Iffi
CS student, interested in AI. Currently studying at Northwestern University.
Iffi
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Ilya Kostrikov 3k Dec 31, 2022
This is a clean and robust Pytorch implementation of DQN and Double DQN.

DQN/DDQN-Pytorch This is a clean and robust Pytorch implementation of DQN and Double DQN. Here is the training curve: All the experiments are trained

XinJingHao 15 Dec 27, 2022
Tackling Obstacle Tower Challenge using PPO & A2C combined with ICM.

Obstacle Tower Challenge using Deep Reinforcement Learning Unity Obstacle Tower is a challenging realistic 3D, third person perspective and procedural

Zhuoyu Feng 5 Feb 10, 2022
Pytorch implementations of popular off-policy multi-agent reinforcement learning algorithms, including QMix, VDN, MADDPG, and MATD3.

Off-Policy Multi-Agent Reinforcement Learning (MARL) Algorithms This repository contains implementations of various off-policy multi-agent reinforceme

null 183 Dec 28, 2022
A Pytorch implementation of the multi agent deep deterministic policy gradients (MADDPG) algorithm

Multi-Agent-Deep-Deterministic-Policy-Gradients A Pytorch implementation of the multi agent deep deterministic policy gradients(MADDPG) algorithm This

Phil Tabor 159 Dec 28, 2022
Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

Arsenii Senya Ashukha 80 Sep 17, 2022
Implementation of algorithms for continuous control (DDPG and NAF).

DEPRECATION This repository is deprecated and is no longer maintaned. Please see a more recent implementation of RL for continuous control at jax-sac.

Ilya Kostrikov 288 Dec 31, 2022
PyTorch implementation of Advantage async actor-critic Algorithms (A3C) in PyTorch

Advantage async actor-critic Algorithms (A3C) in PyTorch @inproceedings{mnih2016asynchronous, title={Asynchronous methods for deep reinforcement lea

LEI TAI 111 Dec 8, 2022
PPO is a very popular Reinforcement Learning algorithm at present.

PPO is a very popular Reinforcement Learning algorithm at present. OpenAI takes PPO as the current baseline algorithm. We use the PPO algorithm to train a policy to give the best action in any situation.

Rosefintech 11 Aug 23, 2021
A3C LSTM Atari with Pytorch plus A3G design

NEWLY ADDED A3G A NEW GPU/CPU ARCHITECTURE OF A3C FOR SUBSTANTIALLY ACCELERATED TRAINING!! RL A3C Pytorch NEWLY ADDED A3G!! New implementation of A3C

David Griffis 532 Jan 2, 2023
Implement A3C for Mujoco gym envs

pytorch-a3c-mujoco Disclaimer: my implementation right now is unstable (you ca refer to the learning curve below), I'm not sure if it's my problems. A

Andrew 70 Dec 12, 2022
Advantage Actor Critic (A2C): jax + flax implementation

Advantage Actor Critic (A2C): jax + flax implementation Current version supports only environments with continious action spaces and was tested on muj

Andrey 3 Jan 23, 2022
A clean and robust Pytorch implementation of PPO on continuous action space.

PPO-Continuous-Pytorch I found the current implementation of PPO on continuous action space is whether somewhat complicated or not stable. And this is

XinJingHao 56 Dec 16, 2022
banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services.

banditml is a lightweight contextual bandit & reinforcement learning library designed to be used in production Python services. This library is developed by Bandit ML and ex-authors of Facebook's applied reinforcement learning platform, Reagent.

Bandit ML 51 Dec 22, 2022
This is the code of using DQN to play Sekiro .

Update for using DQN to play sekiro 2021.2.2(English Version) This is the code of using DQN to play Sekiro . I am very glad to tell that I have writen

null 144 Dec 25, 2022
A very short and easy implementation of Quantile Regression DQN

Quantile Regression DQN Quantile Regression DQN a Minimal Working Example, Distributional Reinforcement Learning with Quantile Regression (https://arx

Arsenii Senya Ashukha 80 Sep 17, 2022
A working implementation of the Categorical DQN (Distributional RL).

Categorical DQN. Implementation of the Categorical DQN as described in A distributional Perspective on Reinforcement Learning. Thanks to @tudor-berari

Florin Gogianu 98 Sep 20, 2022
RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow. They have a parallel sampling feature in order to increase computation speed (especially in high-performance computing (HPC)).

Fangjian Li 3 Dec 28, 2021