Ecosyste.ms: Awesome

An open API service indexing awesome lists of open source software.

Awesome Lists | Featured Topics | Projects

https://github.com/hijkzzz/pymarl2

Fine-tuned MARL algorithms on SMAC (100% win rates on most scenarios)
https://github.com/hijkzzz/pymarl2

marl reinforcement-learning smac sota starcraft

Last synced: about 2 months ago
JSON representation

Fine-tuned MARL algorithms on SMAC (100% win rates on most scenarios)

Awesome Lists containing this project

README

        

```diff
- If you want high sample efficiency, please use qmix_high_sample_efficiency.yaml
- which uses 4 processes for training, slower but higher sample efficiency.
- Performance is *not* comparable of models trained with different number of processes.
```

# PyMARL2

Open-source code for [Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning](https://arxiv.org/abs/2102.03479).

This repository is fine-tuned for StarCraft Multi-agent Challenge (SMAC). For other multi-agent tasks, we also recommend an optimized implementation of QMIX: https://github.com/marlbenchmark/off-policy.

**StarCraft 2 version: SC2.4.10. difficulty: 7.**

```
2022.10.10 update: add qmix_high_sample_efficiency.yaml, which uses 4 processes for training, slower but higher sample efficiency.

2021.10.28 update: add Google Football Environments [vdn_gfootball.yaml] (use `simple115 features`).

2021.10.4 update: add QMIX with attention (qmix_att.yaml) as a baseline for Communication tasks.
```

## Finetuned-QMIX

There are so many code-level tricks in the Multi-agent Reinforcement Learning (MARL), such as:

- Value function clipping (clip max Q values for QMIX)
- Value Normalization
- Reward scaling
- Orthogonal initialization and layer scaling
- **Adam**
- **Neural networks hidden size**
- learning rate annealing
- Reward Clipping
- Observation Normalization
- Gradient Clipping
- **Large Batch Size**
- **N-step Returns(including GAE($\lambda$) and Q($\lambda$) ...)**
- **Rollout Process Number**
- **$\epsilon$-greedy annealing steps**
- Death Agent Masking

**Related Works**

- Implementation Matters in Deep RL: A Case Study on PPO and TRPO
- What Matters In On-Policy Reinforcement Learning? A Large-Scale Empirical Study
- The Surprising Effectiveness of MAPPO in Cooperative, Multi-Agent Games

Using a few of tricks above (bold texts), we enabled QMIX (qmix.yaml) to solve almost all hard scenarios of SMAC (Fine-tuned hyperparameters **for each scenarios**).

| Senarios | Difficulty | QMIX (batch_size=128) | Finetuned-QMIX |
| ------------ | :--------: | :-------------------: | :------------------------------------------------: |
| 8m | Easy | - | **100\%** |
| 2c_vs_1sc | Easy | - | **100\%** |
| 2s3z | Easy | - | **100\%** |
| 1c3s5z | Easy | - | **100\%** |
| 3s5z | Easy | - | **100\%** |
| 8m_vs_9m | Hard | 84% | **100\%** |
| 5m_vs_6m | Hard | 84% | **90\%** |
| 3s_vs_5z | Hard | 96% | **100\%** |
| bane_vs_bane | Hard | **100\%** | **100\%** |
| 2c_vs_64zg | Hard | **100\%** | **100\%** |
| corridor | Super Hard | 0% | **100\%** |
| MMM2 | Super Hard | 98% | **100\%** |
| 3s5z_vs_3s6z | Super Hard | 3% | **93\%**(hidden_size = 256, qmix_large.yaml) |
| 27m_vs_30m | Super Hard | 56% | **100\%** |
| 6h_vs_8z | Super Hard | 0% | **93\%**($\lambda$ = 0.3, epsilon_anneal_time = 500000) |

## Re-Evaluation

Afterwards, we re-evaluate numerous QMIX variants with normalized the tricks (a **general** set of hyperparameters), and find that QMIX achieves the SOTA.

| Scenarios | Difficulty | Value-based | | | | | Policy-based | | | |
| ------------ | ---------- | :-------------: | :------------: | :------------: | :------------: | :------------: | :------------: | ----- | :------------: | :------------: |
| | | QMIX | VDNs | Qatten | QPLEX | WQMIX | LICA | VMIX | DOP | RIIT |
| 2c_vs_64zg | Hard | **100%** | **100%** | **100%** | **100%** | **100%** | **100%** | 98% | 84% | **100%** |
| 8m_vs_9m | Hard | **100%** | **100%** | **100%** | 95% | 95% | 48% | 75% | 96% | 95% |
| 3s_vs_5z | Hard | **100%** | **100%** | **100%** | **100%** | **100%** | 96% | 96% | **100%** | 96% |
| 5m_vs_6m | Hard | **90%** | **90%** | **90%** | **90%** | **90%** | 53% | 9% | 63% | 67% |
| 3s5z_vs_3s6z | S-Hard | **75%** | 43% | 62% | 68% | 56% | 0% | 56% | 0% | **75%** |
| corridor | S-Hard | **100%** | 98% | **100%** | 96% | 96% | 0% | 0% | 0% | **100%** |
| 6h_vs_8z | S-Hard | 84% | **87%** | 82% | 78% | 75% | 4% | 80% | 0% | 19% |
| MMM2 | S-Hard | **100%** | 96% | **100%** | **100%** | 96% | 0% | 70% | 3% | **100%** |
| 27m_vs_30m | S-Hard | **100%** | **100%** | **100%** | **100%** | **100%** | 9% | 93% | 0% | 93% |
| Discrete PP | - | **40** | 39 | - | 39 | 39 | 30 | 39 | 38 | 38 |
| Avg. Score | Hard+ | **94.9%** | 91.2% | 92.7% | 92.5% | 90.5% | 29.2% | 67.4% | 44.1% | 84.0% |

## Communication

We also tested our QMIX-with-attention (qmix_att.yaml, $\lambda=0.3$, attention\_heads=4) on some maps (from [NDQ](https://github.com/TonghanWang/NDQ)) that require communication.

| Senarios (200w steps) | Difficulty | Finetuned-QMIX (No Communication) | QMIX-with-attention ( Communication) |
| --------------------- | :--------: | :-------------------------------: | :----------------------------------: |
| 1o_10b_vs_1r | - | 56% | **87\%** |
| 1o_2r_vs_4r | - | 50% | **95\%** |
| bane_vs_hM | - | 0% | **0\%** |

## Google Football

We also tested VDN (vdn_gfootball.yaml) on some maps (from [Google Football](https://github.com/google-research/football)). Specially, we use `simple115 features` to train the model (The Google Football original paper use complex `CNN features`). We did not test QMIX because this environment does not provide global status information.

| Senarios | Difficulty | VDN ($\lambda=1.0$) |
| -------------------------- | :--------: | :-------------------: |
| academy_counterattack_hard | - | 0.71 (Test Score) |
| academy_counterattack_easy | - | 0.87 (Test Score) |

# Usage

PyMARL is [WhiRL](http://whirl.cs.ox.ac.uk)'s framework for deep multi-agent reinforcement learning and includes implementations of the following algorithms:

Value-based Methods:

- [**QMIX**: QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning](https://arxiv.org/abs/1803.11485)
- [**VDN**: Value-Decomposition Networks For Cooperative Multi-Agent Learning](https://arxiv.org/abs/1706.05296)
- [**IQL**: Independent Q-Learning](https://arxiv.org/abs/1511.08779)
- [**QTRAN**: Learning to Factorize with Transformation for Cooperative Multi-Agent Reinforcement Learning](https://arxiv.org/abs/1905.05408)
- [**Qatten**: Qatten: A general framework for cooperative multiagent reinforcement learning](https://arxiv.org/abs/2002.03939)
- [**QPLEX**: Qplex: Duplex dueling multi-agent q-learning](https://arxiv.org/abs/2008.01062)
- [**WQMIX**: Weighted QMIX: Expanding Monotonic Value Function Factorisation](https://arxiv.org/abs/2006.10800)

Actor Critic Methods:

- [**COMA**: Counterfactual Multi-Agent Policy Gradients](https://arxiv.org/abs/1705.08926)
- [**VMIX**: Value-Decomposition Multi-Agent Actor-Critics](https://arxiv.org/abs/2007.12306)
- [**LICA**: Learning Implicit Credit Assignment for Cooperative Multi-Agent Reinforcement Learning](https://arxiv.org/abs/2007.02529)
- [**DOP**: Off-Policy Multi-Agent Decomposed Policy Gradients](https://arxiv.org/abs/2007.12322)
- [**RIIT**: Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning.](https://arxiv.org/abs/2102.03479)

## Installation instructions

Install Python packages

```shell
# require Anaconda 3 or Miniconda 3
conda create -n pymarl python=3.8 -y
conda activate pymarl

bash install_dependecies.sh
```

Set up StarCraft II (2.4.10) and SMAC:

```shell
bash install_sc2.sh
```

This will download SC2.4.10 into the 3rdparty folder and copy the maps necessary to run over.

Set up Google Football:

```shell
bash install_gfootball.sh
```

## Command Line Tool

**Run an experiment**

```shell
# For SMAC
python3 src/main.py --config=qmix --env-config=sc2 with env_args.map_name=corridor
```

```shell
# For Difficulty-Enhanced Predator-Prey
python3 src/main.py --config=qmix_predator_prey --env-config=stag_hunt with env_args.map_name=stag_hunt
```

```shell
# For Communication tasks
python3 src/main.py --config=qmix_att --env-config=sc2 with env_args.map_name=1o_10b_vs_1r
```

```shell
# For Google Football (Insufficient testing)
# map_name: academy_counterattack_easy, academy_counterattack_hard, five_vs_five...
python3 src/main.py --config=vdn_gfootball --env-config=gfootball with env_args.map_name=academy_counterattack_hard env_args.num_agents=4
```

The config files act as defaults for an algorithm or environment.

They are all located in `src/config`.
`--config` refers to the config files in `src/config/algs`
`--env-config` refers to the config files in `src/config/envs`

**Run n parallel experiments**

```shell
# bash run.sh config_name env_config_name map_name_list (arg_list threads_num gpu_list experinments_num)
bash run.sh qmix sc2 6h_vs_8z epsilon_anneal_time=500000,td_lambda=0.3 2 0 5
```

`xxx_list` is separated by `,`.

All results will be stored in the `Results` folder and named with `map_name`.

**Kill all training processes**

```shell
# all python and game processes of current user will quit.
bash clean.sh
```

# Citation

```
@article{hu2021rethinking,
title={Rethinking the Implementation Tricks and Monotonicity Constraint in Cooperative Multi-Agent Reinforcement Learning},
author={Jian Hu and Siyang Jiang and Seth Austin Harding and Haibin Wu and Shih-wei Liao},
year={2021},
eprint={2102.03479},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
```