This is the official implementation of Multi-Agent PPO.

Overview

MAPPO

Chao Yu*, Akash Velu*, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu.

Website: https://sites.google.com/view/mappo

This repository implements MAPPO, an multi-agent variant of PPO. The implementation in this repositorory is used in the paper "The Surprising Effectiveness of MAPPO in Cooperative Multi-Agent Games" (https://arxiv.org/abs/2103.01955). This repository is heavily based on https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail.

Environments supported:

1. Usage

All core code is located within the onpolicy folder. The algorithms/ subfolder contains algorithm-specific code for MAPPO.

  • The envs/ subfolder contains environment wrapper implementations for the MPEs, SMAC, and Hanabi.

  • Code to perform training rollouts and policy updates are contained within the runner/ folder - there is a runner for each environment.

  • Executable scripts for training with default hyperparameters can be found in the scripts/ folder. The files are named in the following manner: train_algo_environment.sh. Within each file, the map name (in the case of SMAC and the MPEs) can be altered.

  • Python training scripts for each environment can be found in the scripts/train/ folder.

  • The config.py file contains relevant hyperparameter and env settings. Most hyperparameters are defaulted to the ones used in the paper; however, please refer to the appendix for a full list of hyperparameters used.

2. Installation

Here we give an example installation on CUDA == 10.1. For non-GPU & other CUDA version installation, please refer to the PyTorch website.

# create conda environment
conda create -n marl python==3.6.1
conda activate marl
pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html
# install on-policy package
cd on-policy
pip install -e .

Even though we provide requirement.txt, it may have redundancy. We recommend that the user try to install other required packages by running the code and finding which required package hasn't installed yet.

2.1 Install StarCraftII 4.10

unzip SC2.4.10.zip
# password is iagreetotheeula
echo "export SC2PATH=~/StarCraftII/" > ~/.bashrc

2.2 Hanabi

Environment code for Hanabi is developed from the open-source environment code, but has been slightly modified to fit the algorithms used here.
To install, execute the following:

pip install cffi
cd envs/hanabi
mkdir build & cd build
cmake ..
make -j

2.3 Install MPE

# install this package first
pip install seaborn

There are 3 Cooperative scenarios in MPE:

  • simple_spread
  • simple_speaker_listener, which is 'Comm' scenario in paper
  • simple_reference

3.Train

Here we use train_mpe.sh as an example:

cd onpolicy/scripts
chmod +x ./train_mpe.sh
./train_mpe.sh

Local results are stored in subfold scripts/results. Note that we use Weights & Bias as the default visualization platform; to use Weights & Bias, please register and login to the platform first. More instructions for using Weights&Bias can be found in the official documentation. Adding the --use_wandb in command line or in the .sh file will use Tensorboard instead of Weights & Biases.

We additionally provide ./eval_hanabi_forward.sh for evaluating the hanabi score over 100k trials.

4. Publication

If you find this repository useful, please cite our paper:

@misc{yu2021surprising,
      title={The Surprising Effectiveness of MAPPO in Cooperative Multi-Agent Games}, 
      author={Chao Yu and Akash Velu and Eugene Vinitsky and Yu Wang and Alexandre Bayen and Yi Wu},
      year={2021},
      eprint={2103.01955},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}
Comments
  •  continuous action space

    continuous action space

    Hi, can MAPPO be used for continuous action space? How can I do this?When I change discrete_action under environment.py to False, the following error will appear. image

    opened by ollehhello 10
  • Why use self.buffer[agent_id].after_update()?

    Why use self.buffer[agent_id].after_update()?

    def train(self): train_infos = [] for agent_id in range(self.num_agents): self.trainer[agent_id].prep_training() train_info = self.trainer[agent_id].train(self.buffer[agent_id]) train_infos.append(train_info)
    self.buffer[agent_id].after_update()

        return train_infos
    

    why use after_update() after train()?

    opened by LiZhYun 6
  • Hyperparameters of IPPO

    Hyperparameters of IPPO

    Are the choices of IPPO hyperparameters the same as MAPPO shown in Table 12? The only difference is the value of "use_centralized_V" (False for IPPO, True for MAPPO), right? Thanks!

    opened by gbyuHub 6
  • About QMix(MG)

    About QMix(MG)

    Hello! I found that in your new version of the MAPPO paper, you use a concatenation of the default environment global state, as well as all agents’ local observations, as the mixer network input. But why don't you instead concatenate the Feature-Pruned Agent-Specific Global States which is used in MAPPO to build the input of the mixer network? Is this unfair for the comparison?

    opened by Henry668 5
  • 很好的工作,但是... Good work, but ...

    很好的工作,但是... Good work, but ...

    如果你们让QMIX用8个进程,增大Batch Size和或者每次训练epoch次数,最后加上 TD(lambda)<=0.5 QMIX能把这些算法干趴下 参考我们的简略调参 https://arxiv.org/abs/2102.03479 我真的发现 MARL这个领域 由于调参的问题导致一大堆错误的结论和实验甚至motivation出发就错了,涉及十篇+ CCFA顶会paper 尤其是 AAAI 这个会议的文章连证明是错的都能 accept

    opened by hijkzzz 5
  • In the definition of FixedNormal class have some wrong code.

    In the definition of FixedNormal class have some wrong code.

    # Normal class FixedNormal(torch.distributions.Normal): def log_probs(self, actions): return super().log_prob(actions).sum(-1, keepdim=True) def entrop(self): return super.entropy().sum(-1) def mode(self): return self.mean

    opened by chillybird 4
  • Centralized-V between IPPO and MAPPO

    Centralized-V between IPPO and MAPPO

    Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu

    Thanks for your work!

    Just a quick question: is turning on or off the use_centralized_V only affects if the input is from local observations or the centralized state? Does it affect the structure of actual value network used? from the code what I can see is no matter use_centralized_V is true or false, the input size is always (num_agents, input_dim), and the output values are with size (num_agents, 1). So the networks are not affected by use_centralized_V. And centralized value outputs will be the same for "num_agents", right?

    Look forward to your reply! Thank you!

    Best, Xubo

    opened by xubo92 4
  • AssertionError: check recurrent policy!

    AssertionError: check recurrent policy!

    I have run your code:

    ./train_smac.sh
    

    however, the following error occured:

    env is StarCraft2, map is corridor, algo is mappo, exp is mlp, max seed is 1
    seed is 1:
    Traceback (most recent call last):
      File "train/train_smac.py", line 175, in <module>
        main(sys.argv[1:])
      File "train/train_smac.py", line 82, in main
        "check recurrent policy!")
    AssertionError: check recurrent policy!
    

    I changed the argument "algo" in file "tran_smac.sh"

    algo="rmappo"
    

    I don't know whether this modification is suitable or not. It did work, but the results were not satisfactory: image image

    Do you have some good parameters for trainnning?

    opened by YaoweiFan 4
  • Evaluation ONLY mode in MAPE environment

    Evaluation ONLY mode in MAPE environment

    From what I understand, in MAPE, evaluation is "entangled" in the training mode (there is alteration between training and evaluation phase). Is there any way (or a script) that I can evaluate only a trained agent and save gifs/videos etc?

    opened by ConstantinosM 3
  • envs reset in data-collecting and evaluation period

    envs reset in data-collecting and evaluation period

    Both in data collecting and evaluation period, when an episode terminated, the model just take the last obs of corresponding env as input. But i think the envs should be reset if reach termination. Or i missed something in the code?

    opened by gbyuHub 3
  • An error occurs when I run rmappo on football

    An error occurs when I run rmappo on football

    The output of python is listed here:

    Traceback (most recent call last): File "train/train_football.py", line 203, in main(sys.argv[1:]) File "train/train_football.py", line 188, in main runner.run() File "/onpolicy/runner/shared/football_runner.py", line 43, in run self.insert(data) File /onpolicy/runner/shared/football_runner.py", line 141, in insert masks=masks TypeError: insert() got an unexpected keyword argument 'rnn_states'

    opened by cugbbaiyun 2
  • Time consumption of Hanabi

    Time consumption of Hanabi

    When I tried to run Hanabi experiments, I was shocked by the time consumption. The CPU I use is 12th Gen Intel(R) Core(TM) i9-12900K and GPU is NVIDIA RTX A5000, neither of which is running at full capacity (approx. 25% - 70%). I use the default parameters in train_hanabi_forward.sh

    python train/train_hanabi_forward.py --env_name ${env} --algorithm_name ${algo} --experiment_name ${exp} --hanabi_name ${hanabi} --num_agents ${num_agents} --seed 4 --n_training_threads 128 --n_rollout_threads 1000 --n_eval_rollout_threads 32 --num_mini_batch 1 --episode_length 100 --num_env_steps 10000000000000 --ppo_epoch 15 --gain 0.01 --lr 7e-4 --critic_lr 1e-3 --hidden_size 512 --layer_N 2 --use_eval --use_recurrent_policy --entropy_coef 0.015 
    

    With this setting, the running FPS is about 3000, meaning it takes 900+ hours to reach the 10B timesteps in the text, i.e. over a month. Is this time consumption consistent with your experience? If not, is there any advice that may fix my problem and improve efficiency?

    opened by Nickydusk 0
  • cannot reproduce the performance of MPE

    cannot reproduce the performance of MPE

    Hi, I have an issue when reproducing the performance of simple_spread in MPE.

    The only modifications on your code:

    1. use --use_wandb to disable wandb in train_mpe.sh
    2. add self.envs.reset() before line 26 in mpe_runner.py
    opened by hccz95 3
Owner
This is a benchmark of popular multi-agent reinforcement learning algorithms & environments
null
Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

null 404 Dec 25, 2022
A general-purpose multi-agent training framework.

MALib A general-purpose multi-agent training framework. Installation step1: build environment conda create -n malib python==3.7 -y conda activate mali

MARL @ SJTU 346 Jan 3, 2023
A customisable 3D platform for agent-based AI research

DeepMind Lab is a 3D learning environment based on id Software's Quake III Arena via ioquake3 and other open source software. DeepMind Lab provides a

DeepMind 6.8k Jan 5, 2023
Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!

Serpent.AI - Game Agent Framework (Python) Update: Revival (May 2020) Development work has resumed on the framework with the aim of bringing it into 2

Serpent.AI 6.4k Jan 5, 2023
An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

Reinforcement Learning Working Group 823 Jan 6, 2023
🖍️This is a feature-complete clone of the awesome Chalk (JavaScript) library.

Terminal string styling done right This is a feature-complete clone of the awesome Chalk (JavaScript) library. All credits go to Sindre Sorhus. Highli

Fabian Keller 132 Dec 27, 2022
This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

Deepender Singla 1.4k Dec 22, 2022
A clean and robust Pytorch implementation of PPO on continuous action space.

PPO-Continuous-Pytorch I found the current implementation of PPO on continuous action space is whether somewhat complicated or not stable. And this is

XinJingHao 56 Dec 16, 2022
PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

Ilya Kostrikov 3k Dec 31, 2022
Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

UPDeT Official Implementation of UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers (ICLR 2021 spotlight) The

hhhusiyi 96 Dec 22, 2022
A multi-entity Transformer for multi-agent spatiotemporal modeling.

baller2vec This is the repository for the paper: Michael A. Alcorn and Anh Nguyen. baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotempor

Michael A. Alcorn 56 Nov 15, 2022
Multi-task Multi-agent Soft Actor Critic for SMAC

Multi-task Multi-agent Soft Actor Critic for SMAC Overview The CARE formulti-task: Multi-Task Reinforcement Learning with Context-based Representation

RuanJingqing 8 Sep 30, 2022
PPO is a very popular Reinforcement Learning algorithm at present.

PPO is a very popular Reinforcement Learning algorithm at present. OpenAI takes PPO as the current baseline algorithm. We use the PPO algorithm to train a policy to give the best action in any situation.

Rosefintech 11 Aug 23, 2021
Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Automatic, Readable, Reusable, Extendable Machin is a reinforcement library designed for pytorch. Build status Platform Status Linux Windows Supported

Iffi 348 Dec 24, 2022
RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow. They have a parallel sampling feature in order to increase computation speed (especially in high-performance computing (HPC)).

Fangjian Li 3 Dec 28, 2021
It is Keqin Wang first project in CMU, trying to use DRL(PPO) to control a 5-dof manipulator to draw line in space.

5dof-robot-writing this project aim to use PPO control a 5 dof manipulator to draw lines in 3d space. Introduction to the files the pybullet environme

Keqin Wang 4 Aug 22, 2022
Independent and minimal implementations of some reinforcement learning algorithms using PyTorch (including PPO, A3C, A2C, ...).

PyTorch RL Minimal Implementations There are implementations of some reinforcement learning algorithms, whose characteristics are as follow: Less pack

Gemini Light 4 Dec 31, 2022
PPO Lagrangian in JAX

PPO Lagrangian in JAX This repository implements PPO in JAX. Implementation is tested on the safety-gym benchmark. Usage Install dependencies using th

Karush Suri 2 Sep 14, 2022