This is the official implementation of Multi-Agent PPO.

Last update: Jan 6, 2023

Related tags

Reinforcement Learning algorithms multi-agent hanabi smac ppo mpes starcraftii mappo

Overview

MAPPO

Chao Yu*, Akash Velu*, Eugene Vinitsky, Yu Wang, Alexandre Bayen, and Yi Wu.

Website: https://sites.google.com/view/mappo

This repository implements MAPPO, an multi-agent variant of PPO. The implementation in this repositorory is used in the paper "The Surprising Effectiveness of MAPPO in Cooperative Multi-Agent Games" (https://arxiv.org/abs/2103.01955). This repository is heavily based on https://github.com/ikostrikov/pytorch-a2c-ppo-acktr-gail.

Environments supported:

1. Usage

All core code is located within the onpolicy folder. The algorithms/ subfolder contains algorithm-specific code for MAPPO.

The envs/ subfolder contains environment wrapper implementations for the MPEs, SMAC, and Hanabi.
Code to perform training rollouts and policy updates are contained within the runner/ folder - there is a runner for each environment.
Executable scripts for training with default hyperparameters can be found in the scripts/ folder. The files are named in the following manner: train_algo_environment.sh. Within each file, the map name (in the case of SMAC and the MPEs) can be altered.
Python training scripts for each environment can be found in the scripts/train/ folder.
The config.py file contains relevant hyperparameter and env settings. Most hyperparameters are defaulted to the ones used in the paper; however, please refer to the appendix for a full list of hyperparameters used.

2. Installation

Here we give an example installation on CUDA == 10.1. For non-GPU & other CUDA version installation, please refer to the PyTorch website.

# create conda environment
conda create -n marl python==3.6.1
conda activate marl
pip install torch==1.5.1+cu101 torchvision==0.6.1+cu101 -f https://download.pytorch.org/whl/torch_stable.html

# install on-policy package
cd on-policy
pip install -e .

Even though we provide requirement.txt, it may have redundancy. We recommend that the user try to install other required packages by running the code and finding which required package hasn't installed yet.

2.1 Install StarCraftII 4.10

unzip SC2.4.10.zip
# password is iagreetotheeula
echo "export SC2PATH=~/StarCraftII/" > ~/.bashrc

download SMAC Maps, and move it to ~/StarCraftII/Maps/.
To use a stableid, copy stableid.json from https://github.com/Blizzard/s2client-proto.git to ~/StarCraftII/.

2.2 Hanabi

Environment code for Hanabi is developed from the open-source environment code, but has been slightly modified to fit the algorithms used here.
To install, execute the following:

pip install cffi
cd envs/hanabi
mkdir build & cd build
cmake ..
make -j

2.3 Install MPE

# install this package first
pip install seaborn

There are 3 Cooperative scenarios in MPE:

simple_spread
simple_speaker_listener, which is 'Comm' scenario in paper
simple_reference

3.Train

Here we use train_mpe.sh as an example:

cd onpolicy/scripts
chmod +x ./train_mpe.sh
./train_mpe.sh

Local results are stored in subfold scripts/results. Note that we use Weights & Bias as the default visualization platform; to use Weights & Bias, please register and login to the platform first. More instructions for using Weights&Bias can be found in the official documentation. Adding the --use_wandb in command line or in the .sh file will use Tensorboard instead of Weights & Biases.

We additionally provide ./eval_hanabi_forward.sh for evaluating the hanabi score over 100k trials.

4. Publication

If you find this repository useful, please cite our paper:

@misc{yu2021surprising,
      title={The Surprising Effectiveness of MAPPO in Cooperative Multi-Agent Games}, 
      author={Chao Yu and Akash Velu and Eugene Vinitsky and Yu Wang and Alexandre Bayen and Yi Wu},
      year={2021},
      eprint={2103.01955},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}

Comments

continuous action space

Hi, can MAPPO be used for continuous action space? How can I do this?When I change discrete_action under environment.py to False, the following error will appear.

opened by ollehhello 10
Why use self.buffer[agent_id].after_update()?
def train(self): train_infos = [] for agent_id in range(self.num_agents): self.trainer[agent_id].prep_training() train_info = self.trainer[agent_id].train(self.buffer[agent_id]) train_infos.append(train_info)
self.buffer[agent_id].after_update()

return train_infos

why use after_update() after train()?
opened by LiZhYun 6
Hyperparameters of IPPO

Are the choices of IPPO hyperparameters the same as MAPPO shown in Table 12? The only difference is the value of "use_centralized_V" (False for IPPO, True for MAPPO), right? Thanks!

opened by gbyuHub 6
About QMix(MG)

Hello! I found that in your new version of the MAPPO paper, you use a concatenation of the default environment global state, as well as all agents’ local observations, as the mixer network input. But why don't you instead concatenate the Feature-Pruned Agent-Specific Global States which is used in MAPPO to build the input of the mixer network? Is this unfair for the comparison?

opened by Henry668 5
很好的工作，但是... Good work, but ...

如果你们让QMIX用8个进程，增大Batch Size和或者每次训练epoch次数，最后加上 TD(lambda)<=0.5 QMIX能把这些算法干趴下参考我们的简略调参 https://arxiv.org/abs/2102.03479 我真的发现 MARL这个领域由于调参的问题导致一大堆错误的结论和实验甚至motivation出发就错了，涉及十篇+ CCFA顶会paper 尤其是 AAAI 这个会议的文章连证明是错的都能 accept

opened by hijkzzz 5
In the definition of FixedNormal class have some wrong code.

# Normal class FixedNormal(torch.distributions.Normal): def log_probs(self, actions): return super().log_prob(actions).sum(-1, keepdim=True) def entrop(self): return super.entropy().sum(-1) def mode(self): return self.mean

opened by chillybird 4
Centralized-V between IPPO and MAPPO

Hi @jxwuyi @eugenevinitsky @zoeyuchao @akashvelu

Thanks for your work!

Just a quick question: is turning on or off the use_centralized_V only affects if the input is from local observations or the centralized state? Does it affect the structure of actual value network used? from the code what I can see is no matter use_centralized_V is true or false, the input size is always (num_agents, input_dim), and the output values are with size (num_agents, 1). So the networks are not affected by use_centralized_V. And centralized value outputs will be the same for "num_agents", right?

Look forward to your reply! Thank you!

Best, Xubo

opened by xubo92 4

AssertionError: check recurrent policy!

I have run your code:

./train_smac.sh

however, the following error occured:

env is StarCraft2, map is corridor, algo is mappo, exp is mlp, max seed is 1
seed is 1:
Traceback (most recent call last):
  File "train/train_smac.py", line 175, in <module>
    main(sys.argv[1:])
  File "train/train_smac.py", line 82, in main
    "check recurrent policy!")
AssertionError: check recurrent policy!

I changed the argument "algo" in file "tran_smac.sh"

algo="rmappo"

I don't know whether this modification is suitable or not. It did work, but the results were not satisfactory:

Do you have some good parameters for trainnning?

opened by YaoweiFan 4

Evaluation ONLY mode in MAPE environment

From what I understand, in MAPE, evaluation is "entangled" in the training mode (there is alteration between training and evaluation phase). Is there any way (or a script) that I can evaluate only a trained agent and save gifs/videos etc?

opened by ConstantinosM 3
envs reset in data-collecting and evaluation period

Both in data collecting and evaluation period, when an episode terminated, the model just take the last obs of corresponding env as input. But i think the envs should be reset if reach termination. Or i missed something in the code?

opened by gbyuHub 3
An error occurs when I run rmappo on football

The output of python is listed here:

Traceback (most recent call last): File "train/train_football.py", line 203, in main(sys.argv[1:]) File "train/train_football.py", line 188, in main runner.run() File "/onpolicy/runner/shared/football_runner.py", line 43, in run self.insert(data) File /onpolicy/runner/shared/football_runner.py", line 141, in insert masks=masks TypeError: insert() got an unexpected keyword argument 'rnn_states'

opened by cugbbaiyun 2
Time consumption of Hanabi
When I tried to run Hanabi experiments, I was shocked by the time consumption. The CPU I use is 12th Gen Intel(R) Core(TM) i9-12900K and GPU is NVIDIA RTX A5000, neither of which is running at full capacity (approx. 25% - 70%). I use the default parameters in train_hanabi_forward.sh：

python train/train_hanabi_forward.py --env_name ${env} --algorithm_name ${algo} --experiment_name ${exp} --hanabi_name ${hanabi} --num_agents ${num_agents} --seed 4 --n_training_threads 128 --n_rollout_threads 1000 --n_eval_rollout_threads 32 --num_mini_batch 1 --episode_length 100 --num_env_steps 10000000000000 --ppo_epoch 15 --gain 0.01 --lr 7e-4 --critic_lr 1e-3 --hidden_size 512 --layer_N 2 --use_eval --use_recurrent_policy --entropy_coef 0.015

With this setting, the running FPS is about 3000, meaning it takes 900+ hours to reach the 10B timesteps in the text, i.e. over a month. Is this time consumption consistent with your experience? If not, is there any advice that may fix my problem and improve efficiency?
opened by Nickydusk 0
cannot reproduce the performance of MPE
Hi, I have an issue when reproducing the performance of simple_spread in MPE.

The only modifications on your code:

use --use_wandb to disable wandb in train_mpe.sh

add self.envs.reset() before line 26 in mpe_runner.py
opened by hccz95 3

Owner

This is a benchmark of popular multi-agent reinforcement learning algorithms & environments

GitHub https://sites.google.com/view/mappo

Rethinking the Importance of Implementation Tricks in Multi-Agent Reinforcement Learning

MARL Tricks Our codes for RIIT: Rethinking the Importance of Implementation Tricks in Multi-AgentReinforcement Learning. We implemented and standardiz

404 Dec 25, 2022

A general-purpose multi-agent training framework.

MALib A general-purpose multi-agent training framework. Installation step1: build environment conda create -n malib python==3.7 -y conda activate mali

346 Jan 3, 2023

A customisable 3D platform for agent-based AI research

DeepMind Lab is a 3D learning environment based on id Software's Quake III Arena via ioquake3 and other open source software. DeepMind Lab provides a

6.8k Jan 5, 2023

Game Agent Framework. Helping you create AIs / Bots that learn to play any game you own!

Serpent.AI - Game Agent Framework (Python) Update: Revival (May 2020) Development work has resumed on the framework with the aim of bringing it into 2

6.4k Jan 5, 2023

An open source robotics benchmark for meta- and multi-task reinforcement learning

Meta-World Meta-World is an open-source simulated benchmark for meta-reinforcement learning and multi-task learning consisting of 50 distinct robotic

823 Jan 6, 2023

🖍️This is a feature-complete clone of the awesome Chalk (JavaScript) library.

Terminal string styling done right This is a feature-complete clone of the awesome Chalk (JavaScript) library. All credits go to Sindre Sorhus. Highli

132 Dec 27, 2022

This project uses reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can learn to read tape. The project is dedicated to hero in life great Jesse Livermore.

Reinforcement-trading This project uses Reinforcement learning on stock market and agent tries to learn trading. The goal is to check if the agent can

1.4k Dec 22, 2022

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

pytorch-a2c-ppo-acktr Update (April 12th, 2021) PPO is great, but Soft Actor Critic can be better for many continuous control tasks. Please check out

3k Jan 9, 2023

A clean and robust Pytorch implementation of PPO on continuous action space.

PPO-Continuous-Pytorch I found the current implementation of PPO on continuous action space is whether somewhat complicated or not stable. And this is

56 Dec 16, 2022

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

PyTorch implementation of Advantage Actor Critic (A2C), Proximal Policy Optimization (PPO), Scalable trust-region method for deep reinforcement learning using Kronecker-factored approximation (ACKTR) and Generative Adversarial Imitation Learning (GAIL).

3k Dec 31, 2022

Official Implementation of 'UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers' ICLR 2021(spotlight)

UPDeT Official Implementation of UPDeT: Universal Multi-agent Reinforcement Learning via Policy Decoupling with Transformers (ICLR 2021 spotlight) The

96 Dec 22, 2022

A multi-entity Transformer for multi-agent spatiotemporal modeling.

baller2vec This is the repository for the paper: Michael A. Alcorn and Anh Nguyen. baller2vec: A Multi-Entity Transformer For Multi-Agent Spatiotempor

56 Nov 15, 2022

Multi-task Multi-agent Soft Actor Critic for SMAC

Multi-task Multi-agent Soft Actor Critic for SMAC Overview The CARE formulti-task: Multi-Task Reinforcement Learning with Context-based Representation

8 Sep 30, 2022

A collection of various RL algorithms like policy gradients, DQN and PPO. The goal of this repo will be to make it a go-to resource for learning about RL. How to visualize, debug and solve RL problems. I've additionally included playground.py for learning more about OpenAI gym, etc.

Reinforcement Learning (PyTorch) ?? + ?? = ❤️ This repo will contain PyTorch implementation of various fundamental RL algorithms. It's aimed at making

123 Dec 23, 2022

PPO is a very popular Reinforcement Learning algorithm at present.

PPO is a very popular Reinforcement Learning algorithm at present. OpenAI takes PPO as the current baseline algorithm. We use the PPO algorithm to train a policy to give the best action in any situation.

11 Aug 23, 2021

Reinforcement learning library(framework) designed for PyTorch, implements DQN, DDPG, A2C, PPO, SAC, MADDPG, A3C, APEX, IMPALA ...

Automatic, Readable, Reusable, Extendable Machin is a reinforcement library designed for pytorch. Build status Platform Status Linux Windows Supported

348 Dec 24, 2022

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow.

RL algorithm PPO and IRL algorithm AIRL written with Tensorflow. They have a parallel sampling feature in order to increase computation speed (especially in high-performance computing (HPC)).

3 Dec 28, 2021

It is Keqin Wang first project in CMU, trying to use DRL(PPO) to control a 5-dof manipulator to draw line in space.

5dof-robot-writing this project aim to use PPO control a 5 dof manipulator to draw lines in 3d space. Introduction to the files the pybullet environme

4 Aug 22, 2022

Independent and minimal implementations of some reinforcement learning algorithms using PyTorch (including PPO, A3C, A2C, ...).

PyTorch RL Minimal Implementations There are implementations of some reinforcement learning algorithms, whose characteristics are as follow: Less pack

4 Dec 31, 2022

PPO Lagrangian in JAX

PPO Lagrangian in JAX This repository implements PPO in JAX. Implementation is tested on the safety-gym benchmark. Usage Install dependencies using th

2 Sep 14, 2022