A general-purpose multi-agent training framework.



step1: build environment

conda create -n malib python==3.7 -y
conda activate malib
pip install -e .

# for development
pip install -e .[dev]

step2: install openspiel

installation guides: openspiel

Quick Start

"""PSRO with PPO for Leduc Holdem"""

from malib.envs.poker import poker_aec_env as leduc_holdem
from malib.runner import run
from malib.rollout import rollout_func

env = leduc_holdem.env(fixed_player=True)

    agent_mapping_func=lambda agent_id: agent_id,
        "creator": leduc_holdem.env,
        "config": {"fixed_player": True},
        "id": "leduc_holdem",
        "possible_agents": env.possible_agents,
        "interface": {
            "type": "independent",
            "observation_spaces": env.observation_spaces,
            "action_spaces": env.action_spaces
        "PSRO_PPO": {
            "name": "PPO",
            "custom_config": {
                "gamma": 1.0,
                "eps_min": 0,
                "eps_max": 1.0,
                "eps_decay": 100,
        "type": "async",
        "stopper": "simple_rollout",
        "callback": rollout_func.sequential
  • 'List' from 'malib.utils.typing'

    'List' from 'malib.utils.typing'

    Hi I am trying to run a basic MARL setup using MAPPO.

    Here's my yaml config file

    name: "mappo_payload_carry"
        type: "centralized"
        population_size: -1
        # control the frequency of remote parameter update
        update_interval: 1
        saving_interval: 100
        batch_size: 32
        optimizer: "Adam"
        actor_lr: 5.e-4
        critic_lr: 5.e-4
        opti_eps: 1.e-5
        weight_decay: 0.0
      type: "async"
      stopper: "simple_rollout"
        max_step: 10000
      metric_type: "simple"
      fragment_length: 100
      num_episodes: 4
      episode_seg: 1
      terminate: "any"
      num_env_per_worker: 1
        - copy_next_frame
      #  scenario_name: "simple_spread"
      creator: "Gym"
        env_id: "urdf-env-v0"
        name: "MAPPO"
            use_orthogonal: True
            gain: 1.
            network: mlp
              - units: 256
                activation: ReLU
              - units: 128
                activation: ReLU
              - units: 64
                activation: ReLU
              activation: False
            network: mlp
              - units: 256
                activation: ReLU
              - units: 128
                activation: ReLU
              - units: 64
                activation: ReLU
              activation: False
        # set hyper parameter
          gamma: 0.99
          use_cuda: False  # enable cuda or not
          use_q_head: False
          ppo_epoch: 4
          num_mini_batch: 1  # the number of mini-batches
          return_mode: gae
            gae_lambda: 0.95
            clip_rho_threshold: 1.0
            clip_pg_rho_threshold: 1.0
          use_rnn: False
          # this is not used, instead it is fixed to last hidden in actor/critic
          rnn_layer_num: 1
          rnn_data_chunk_length: 16
          use_feature_normalization: True
          use_popart: True
          popart_beta: 0.99999
          entropy_coef: 1.e-2
      name: "generic"
      episode_capacity: 100
      fragment_length: 3001```
    I have a custom environment where I created the env. 
    env = gym.make("urdf-env-v0", dt=0.01, robots=robots, render=render)
    possible_agents = env.possible_agents
    action_spaces = env.possible_actions
    observation_spaces = env.observation_spaces
    env_desc = {"creator": env, "possible_agents": possible_agents, "action_spaces": action_spaces, "observation_spaces": observation_spaces}
        agent_mapping_func=lambda agent: agent[
                                         ],  # e.g. "team_0_player_0" -> "team_0"
        evaluation=config.get("evaluation", {}),
        dataset_config=config.get("dataset_config", {}),
        parameter_server=config.get("parameter_server", {}),
        # worker_config=config["worker_config"],

    I tried to see if malib.utils.typing had the List, Dict types, but it looks like they're non existent there, how do I fix this?

  • 3s5z in PyMARL can quickly reached to win_rate of 1 and where is the config for SMAC of malib?

    3s5z in PyMARL can quickly reached to win_rate of 1 and where is the config for SMAC of malib?

    Thanks for this nice repo. I'm interested in MARL for smac. I have some problem about this repo.

    1. In the page 18 of your arxiv paper, you mentioned that "For the scenario 3s5z, however, both of MALib and PyMARL cannot reach 80% win rate." However, I run the PyMARL original code with his default config python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=3s5z, the win rate quickly reached 1 for these two smac versions (4.10 and 4.6.2). image

    2. I would like to use this repo to run smac, but I can't find the corresponding config in examples, will this section be opend source? Thank you.

  • [Question] How to debug `malib` (infinity loop when running `ray` in local mode)

    [Question] How to debug `malib` (infinity loop when running `ray` in local mode)

    Hi, I'm trying to run the PSRO algorithm with the quick start example on https://malib.io (PSRO PPO with leduc_holdem.env). I get the following errors:

    2022-03-27 23:04:33,013    ERROR worker.py:79 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::RolloutWorker.simulation() (pid=185570, ip=, repr=<malib.rollout.rollout_worker.RolloutWorker object at 0x7f1c52c37490>)
      File "/home/panxuehai/Projects/malib/malib/rollout/base_worker.py", line 441, in simulation
        raw_statistics, num_frames = self.sample(
      File "/home/panxuehai/Projects/malib/malib/rollout/rollout_worker.py", line 161, in sample
        for ret in rets:
      File "/home/panxuehai/Miniconda3/envs/malib/lib/python3.8/site-packages/ray/util/actor_pool.py", line 65, in map
        yield self.get_next()
      File "/home/panxuehai/Miniconda3/envs/malib/lib/python3.8/site-packages/ray/util/actor_pool.py", line 178, in get_next
        return ray.get(future)
    ray.exceptions.RayTaskError(TypeError): ray::Stepping.run() (pid=185523, ip=, repr=<malib.rollout.rollout_func.Stepping object at 0x7f8814a20dc0>)
      File "/home/panxuehai/Projects/malib/malib/rollout/rollout_func.py", line 431, in run
        rollout_results = env_runner(
      File "/home/panxuehai/Projects/malib/malib/rollout/rollout_func.py", line 243, in env_runner
        rets = env.reset(
      File "/home/panxuehai/Projects/malib/malib/envs/vector_env.py", line 192, in reset
        _ret = env.reset(max_step=max_step, custom_reset_config=custom_reset_config)
    TypeError: reset() got an unexpected keyword argument 'max_step'

    Then I'd debug this myself with ray.init(local_mode=True) in my debugger. All Ray actors will run sequentially when the "local mode" is on.


    Local Mode

    By default, Ray will parallelize its workload and run tasks on multiple processes and multiple nodes. However, if you need to debug your Ray program, it may be easier to do everything on a single process. You can force all Ray functions to occur on a single process by enabling local mode

    The program is stuck into an infinity loop here:


    I wonder what's the best practice for debugging with malib? Thanks very much!

  • How to run in cluster or cross multiply machines?

    How to run in cluster or cross multiply machines?

    Dear MALib support,

    My question as following:

    1. Can MALib run in cluster or multiply machines? How to set the config?
    2. When running in single machines, how to set the config about agent number?

    Thank you.

  • For SMAC Qimx/MADDPG config

    For SMAC Qimx/MADDPG config

    It's really a nice work and according to your paper, running Qmix/MADDPG is really fast. But we didn't find a config about Qmix/MADDPG algorithm, We don't know how to run your Qmix/MADDPG program,so can you give us a config for the Qmix/MADDPG algorithm? WeChat Work Screenshot_20211227124527

  • A question for quick start

    A question for quick start

    Thanks for the nice work. I want to ask for some help about the quick start. First, could you please give some instruction of running single rl in gym-based env, such as cartpole? I tried to run the file-run_gym.py and load yaml, but it doesnt work. Second, do we have to install open-spiel? Can we skip it?

    Look forward to hearing from you. Cheers~

    Best, Yutong

  • How to deploy malib in GPU cluster server

    How to deploy malib in GPU cluster server

    I have a login node and four computing nodes. Does malib need to be deployed on each node? Is the job task submitted on the login node? Is there a detailed example of cluster server usage? Looking forward to your help

  • A Minor Error in Quick Start Demo Code

    A Minor Error in Quick Start Demo Code

    In README Quick Start part

            "creator": leduc_holdem.env,
            "config": {"scenario_configs": {"fixed_player": True}, "env_id": "leduc_holdem"}
            "possible_agents": env.possible_agents,

    A missing , after "config": {"scenario_configs": {"fixed_player": True}, "env_id": "leduc_holdem"}

  • Turn based env

    Turn based env

    • refactore open_spiel env
    • remove async vector env implementation
    • turn-based rollout is supported
    • PSRO and marl scenarios have been tested
    • remove third-party replay
    • refractor episode collecting and sending
  • How to use the malib to rollout based on a trained model?

    How to use the malib to rollout based on a trained model?

    excuse me, when the training is done, where is the model saved, and how to use the model to rollout? Besides, how to replay the training process, can we render the envs like mpe, magent

  • Can not run the examples on GPU

    Can not run the examples on GPU

    When I run the examples such as psro_poker, maddpg_mpe, I set the config "use_cuda" as True, but got the error as follow:

    2022-04-20 19:30:20,818 ERROR worker.py:80 -- Unhandled error (suppress with RAY_IGNORE_UNHANDLED_ERRORS=1): ray::RolloutWorker.simulation() (pid=32314, ip=, repr=<matf.rollout.rollout_worker.RolloutWorker object at 0x7f083bb10fd0>) (pid=32320) File "/home/qianmd/work/test/matf/matf/rollout/base_worker.py", line 447, in simulation (pid=32320) role="simulation", (pid=32320) File "/home/qianmd/work/test/matf/matf/rollout/rollout_worker.py", line 161, in sample (pid=32320) for ret in rets: (pid=32320) File "/home/qianmd/anaconda3/envs/malib/lib/python3.7/site-packages/ray/util/actor_pool.py", line 65, in map (pid=32320) yield self.get_next() (pid=32320) File "/home/qianmd/anaconda3/envs/malib/lib/python3.7/site-packages/ray/util/actor_pool.py", line 178, in get_next (pid=32320) return ray.get(future) (pid=32320) ray.exceptions.RayTaskError(RuntimeError): ray::Stepping.run() (pid=32309, ip=, repr=<matf.rollout.rollout_func.Stepping object at 0x7f243ce38190>) (pid=32320) File "/home/qianmd/work/test/matf/matf/rollout/rollout_func.py", line 447, in run (pid=32320) dataset_server=self._dataset_server if task_type == "rollout" else None, (pid=32320) File "/home/qianmd/work/test/matf/matf/rollout/rollout_func.py", line 276, in env_runner (pid=32320) active_policy_inputs, agent_interfaces, episodes (pid=32320) File "/home/qianmd/work/test/matf/matf/rollout/rollout_func.py", line 155, in _do_policy_eval (pid=32320) ) = interface.compute_action(**inputs) # 根据每个env_agent_id的态势信息,rnn_state, done 计算动作 (pid=32320) File "/home/qianmd/work/test/matf/matf/envs/agent_interface.py", line 268, in compute_action (pid=32320) rets = self.policies[policy_id].compute_action(*args, **kwargs) (pid=32320) File "/home/qianmd/work/test/matf/matf/algorithm/dqn/policy.py", line 88, in compute_action (pid=32320) logits = torch.softmax(self.critic(observation), dim=-1) (pid=32320) File "/home/qianmd/anaconda3/envs/malib/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call (pid=32320) result = self.forward(*input, **kwargs) (pid=32320) File "/home/qianmd/work/test/matf/matf/algorithm/common/model.py", line 106, in forward (pid=32320) pi = self.net(obs) (pid=32320) File "/home/qianmd/anaconda3/envs/malib/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call (pid=32320) result = self.forward(*input, **kwargs) (pid=32320) File "/home/qianmd/anaconda3/envs/malib/lib/python3.7/site-packages/torch/nn/modules/container.py", line 100, in forward (pid=32320) input = module(input) (pid=32320) File "/home/qianmd/anaconda3/envs/malib/lib/python3.7/site-packages/torch/nn/modules/module.py", line 532, in call (pid=32320) result = self.forward(*input, **kwargs) (pid=32320) File "/home/qianmd/anaconda3/envs/malib/lib/python3.7/site-packages/torch/nn/modules/linear.py", line 87, in forward (pid=32320) return F.linear(input, self.weight, self.bias) (pid=32320) File "/home/qianmd/anaconda3/envs/malib/lib/python3.7/site-packages/torch/nn/functional.py", line 1370, in linear (pid=32320) ret = torch.addmm(bias, input, weight.t()) (pid=32320) RuntimeError: Expected object of device type cuda but got device type cpu for argument #2 'mat1' in call to _th_addmm (pid=32314) Exception ignored in: 'ray._raylet.task_execution_handler'

  • Performance Results

    Performance Results

    Throughput Comparison

    All the experiment results listed are obtained with one of the following hardware settings: (1) System # 1: a 32-core computing node with dual graphics cards. (2) System # 2: a two-node cluster with each node owning 128-core and a single graphics card. All the GPUs mentioned are of the same model (NVIDIA RTX3090).

    Throughput comparison among the existing RL frameworks and MALib. Due to resource limitation (32 cores, 256G RAM), RLlib fails under heavy loads (CPU case: #workers >32, GPU case: #workers > 8). MALib outperforms other frameworks with only CPU and achieves comparable performance with the highly tailored framework Sample-Factory with GPU despite higher abstraction introduced. To better illustrate the scalability of MALib, we show the MA-Atari and SC2 throughput on System # 2 under different worker settings, the 512-workers group on SC2 fails due to resource limitation.


    Additional comparisons between MALib and other distributed RL training frameworks. (Left): System # 3 cluster throughput of MALib in 2-player MA-Atari and 3-player SC2. (Middle): 4-player MA-Atari throughput comparison on System # 1 without GPU. (Right)} 4-player MA-Atari throughput comparison on System # 1 with GPU.


    Wall-time & Performance of PB-MARL Algorithm

    Comparisons of PSRO between MALib and OpenSpiel. (a) indicates that MALib achieves the same performance on exploitability as OpenSpiel; (b) shows that the convergence rate of MALib is 3x faster than OpenSpiel; (c) shows that MALib achieves a higher execution efficiency than OpenSpiel, since it requires less time consumption to iterate the same learning steps, which means MALib has the potential to scale up in more complex tasks that need to run for much more steps.


    Typical MARL Algorithms

    Results on Multi-agent Particle Environments

    Comparisons of MADDPG in simple adversary under different rollout worker settings. Figures in the top row depict each agent's episode reward w.r.t. the number of sampled episodes, which indicates that MALib converges faster than RLlib with equal sampled episodes. Figures in the bottom row show the average time and average episode reward at the same number of sampled episodes, which indicates that MALib achieves 5x speedup than RLlib.


    Scenario Crypto


    Simple Push


    Simple Reference


    Simple Speaker Listener


    Simple Tag


