{"id":13528724,"url":"https://github.com/kwai/DouZero","last_synced_at":"2025-04-01T14:32:54.089Z","repository":{"id":37684315,"uuid":"373010014","full_name":"kwai/DouZero","owner":"kwai","description":"[ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning | 斗地主AI","archived":false,"fork":false,"pushed_at":"2024-06-26T23:02:16.000Z","size":189,"stargazers_count":4116,"open_issues_count":30,"forks_count":595,"subscribers_count":53,"default_branch":"main","last_synced_at":"2024-10-29T15:37:52.289Z","etag":null,"topics":["doudizhu","game-ai","poker","reinforcement-learning"],"latest_commit_sha":null,"homepage":"https://douzero.org/","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/kwai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-02T01:55:56.000Z","updated_at":"2024-10-28T12:06:20.000Z","dependencies_parsed_at":"2024-06-27T02:52:19.984Z","dependency_job_id":null,"html_url":"https://github.com/kwai/DouZero","commit_stats":{"total_commits":82,"total_committers":8,"mean_commits":10.25,"dds":"0.24390243902439024","last_synced_commit":"d731ca2ca507f2a53d6ca19a2acdb0b284046d0c"},"previous_names":[],"tags_count":3,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwai%2FDouZero","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwai%2FDouZero/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwai%2FDouZero/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/kwai%2FDouZero/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/kwai","download_url":"https://codeload.github.com/kwai/DouZero/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":246655230,"owners_count":20812603,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["doudizhu","game-ai","poker","reinforcement-learning"],"created_at":"2024-08-01T07:00:23.332Z","updated_at":"2025-04-01T14:32:53.164Z","avatar_url":"https://github.com/kwai.png","language":"Python","funding_links":[],"categories":["Environments","Python","时间序列","Open-Source Projects"],"sub_categories":["网络服务_其他","Dou Dizhu Projects"],"readme":"# [ICML 2021] DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning\n\u003cimg width=\"500\" src=\"https://raw.githubusercontent.com/kwai/DouZero/main/imgs/douzero_logo.jpg\" alt=\"Logo\" /\u003e\n\n[![Building](https://github.com/kwai/DouZero/actions/workflows/python-package.yml/badge.svg)](https://github.com/kwai/DouZero/actions/workflows/python-package.yml)\n[![PyPI version](https://badge.fury.io/py/douzero.svg)](https://badge.fury.io/py/douzero)\n[![Downloads](https://pepy.tech/badge/douzero)](https://pepy.tech/project/douzero)\n[![Downloads](https://pepy.tech/badge/douzero/month)](https://pepy.tech/project/douzero)\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/daochenzha/douzero-colab/blob/main/douzero-colab.ipynb)\n\n[中文文档](README.zh-CN.md)\n\nDouZero is a reinforcement learning framework for [DouDizhu](https://en.wikipedia.org/wiki/Dou_dizhu) ([斗地主](https://baike.baidu.com/item/%E6%96%97%E5%9C%B0%E4%B8%BB/177997)), the most popular card game in China. It is a shedding-type game where the player’s objective is to empty one’s hand of all cards before other players. DouDizhu is a very challenging domain with competition, collaboration, imperfect information, large state space, and particularly a massive set of possible actions where the legal actions vary significantly from turn to turn. DouZero is developed by AI Platform, Kwai Inc. (快手).\n\n*   Online Demo: [https://www.douzero.org/](https://www.douzero.org/)\n       * :loudspeaker: New Version with Bid（叫牌版本）: [https://www.douzero.org/bid](https://www.douzero.org/bid)\n*   Run the Demo Locally: [https://github.com/datamllab/rlcard-showdown](https://github.com/datamllab/rlcard-showdown)\n*   Video: [YouTube](https://youtu.be/inHIi8sej7Y)\n*   Paper: [https://arxiv.org/abs/2106.06135](https://arxiv.org/abs/2106.06135) \n*   Related Project: [RLCard Project](https://github.com/datamllab/rlcard)\n*   Related Resources: [Awesome-Game-AI](https://github.com/datamllab/awesome-game-ai)\n*   Google Colab: [jupyter notebook](https://github.com/daochenzha/douzero-colab/blob/main/douzero-colab.ipynb)\n*   Unofficial improved versions of DouZero by the community: [[DouZero ResNet]](https://github.com/Vincentzyx/Douzero_Resnet) [[DouZero FullAuto]](https://github.com/Vincentzyx/DouZero_For_HLDDZ_FullAuto)\n*   Zhihu: [https://zhuanlan.zhihu.com/p/526723604](https://zhuanlan.zhihu.com/p/526723604)\n*   Miscellaneous Resources:\n\t*   Check out our open-sourced [Large Time Series Model (LTSM)](https://github.com/daochenzha/ltsm)!\n\t*   Have you heard of data-centric AI? Please check out our [data-centric AI survey](https://arxiv.org/abs/2303.10158) and [awesome data-centric AI resources](https://github.com/daochenzha/data-centric-AI)!\n\n**Community:**\n*  **Slack**: Discuss in [DouZero](https://join.slack.com/t/douzero/shared_invite/zt-rg3rygcw-ouxxDk5o4O0bPZ23vpdwxA) channel.\n*  **QQ Group**: Join our QQ group to discuss. Password: douzeroqqgroup\n\n\t*  Group 1: 819204202\n\t*  Group 2: 954183174\n\t*  Group 3: 834954839\n\t*  Group 4: 211434658\n\t*  Group 5: 189203636\n\n**News:**\n*   Thanks for the contribution of [@Vincentzyx](https://github.com/Vincentzyx) for enabling CPU training. Now Windows users can train with CPUs.\n\n\u003cimg width=\"500\" src=\"https://douzero.org/public/demo.gif\" alt=\"Demo\" /\u003e\n\n## Cite this Work\nIf you find this project helpful in your research, please cite our paper:\n\nZha, Daochen et al. “DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning.” ICML (2021).\n\n```bibtex\n@InProceedings{pmlr-v139-zha21a,\n  title = \t {DouZero: Mastering DouDizhu with Self-Play Deep Reinforcement Learning},\n  author =       {Zha, Daochen and Xie, Jingru and Ma, Wenye and Zhang, Sheng and Lian, Xiangru and Hu, Xia and Liu, Ji},\n  booktitle = \t {Proceedings of the 38th International Conference on Machine Learning},\n  pages = \t {12333--12344},\n  year = \t {2021},\n  editor = \t {Meila, Marina and Zhang, Tong},\n  volume = \t {139},\n  series = \t {Proceedings of Machine Learning Research},\n  month = \t {18--24 Jul},\n  publisher =    {PMLR},\n  pdf = \t {http://proceedings.mlr.press/v139/zha21a/zha21a.pdf},\n  url = \t {http://proceedings.mlr.press/v139/zha21a.html},\n  abstract = \t {Games are abstractions of the real world, where artificial agents learn to compete and cooperate with other agents. While significant achievements have been made in various perfect- and imperfect-information games, DouDizhu (a.k.a. Fighting the Landlord), a three-player card game, is still unsolved. DouDizhu is a very challenging domain with competition, collaboration, imperfect information, large state space, and particularly a massive set of possible actions where the legal actions vary significantly from turn to turn. Unfortunately, modern reinforcement learning algorithms mainly focus on simple and small action spaces, and not surprisingly, are shown not to make satisfactory progress in DouDizhu. In this work, we propose a conceptually simple yet effective DouDizhu AI system, namely DouZero, which enhances traditional Monte-Carlo methods with deep neural networks, action encoding, and parallel actors. Starting from scratch in a single server with four GPUs, DouZero outperformed all the existing DouDizhu AI programs in days of training and was ranked the first in the Botzone leaderboard among 344 AI agents. Through building DouZero, we show that classic Monte-Carlo methods can be made to deliver strong results in a hard domain with a complex action space. The code and an online demo are released at https://github.com/kwai/DouZero with the hope that this insight could motivate future work.}\n}\n```\n\n## What Makes DouDizhu Challenging?\nIn addition to the challenge of imperfect information, DouDizhu has huge state and action spaces. In particular, the action space of DouDizhu is 10^4 (see [this table](https://github.com/datamllab/rlcard#available-environments)). Unfortunately, most reinforcement learning algorithms can only handle very small action spaces. Moreover, the players in DouDizhu need to both compete and cooperate with others in a partially-observable environment with limited communication, i.e., two Peasants players will play as a team to fight against the Landlord player. Modeling both competing and cooperation is an open research challenge.\n\nIn this work, we propose Deep Monte Carlo (DMC) algorithm with action encoding and parallel actors. This leads to a very simple yet surprisingly effective solution for DouDizhu. Please read [our paper](https://arxiv.org/abs/2106.06135) for more details.\n\n## Installation\nThe training code is designed for GPUs. Thus, you need to first install CUDA if you want to train models. You may refer to [this guide](https://docs.nvidia.com/cuda/index.html#installation-guides). For evaluation, CUDA is optional and you can use CPU for evaluation.\n\nFirst, clone the repo with (if you are in China and Github is slow, you can use the mirror in [Gitee](https://gitee.com/daochenzha/DouZero)):\n```\ngit clone https://github.com/kwai/DouZero.git\n```\nMake sure you have python 3.6+ installed. Install dependencies:\n```\ncd douzero\npip3 install -r requirements.txt\n```\nWe recommend installing the stable version of DouZero with\n```\npip3 install douzero\n```\nIf you are in China and the above command is too slow, you can use the mirror provided by Tsinghua University:\n```\npip3 install douzero -i https://pypi.tuna.tsinghua.edu.cn/simple\n```\nor install the up-to-date version (it could be not stable) with\n```\npip3 install -e .\n```\nNote that Windows users can only use CPU as actors. See [Issues in Windows](README.md#issues-in-windows) about why GPUs are not supported. Nonetheless, Windows users can still [run the demo locally](https://github.com/datamllab/rlcard-showdown).  \n\n## Training\nTo use GPU for training, run\n```\npython3 train.py\n```\nThis will train DouZero on one GPU. To train DouZero on multiple GPUs. Use the following arguments.\n*   `--gpu_devices`: what gpu devices are visible\n*   `--num_actor_devices`: how many of the GPU deveices will be used for simulation, i.e., self-play\n*   `--num_actors`: how many actor processes will be used for each device\n*   `--training_device`: which device will be used for training DouZero\n\nFor example, if we have 4 GPUs, where we want to use the first 3 GPUs to have 15 actors each for simulating and the 4th GPU for training, we can run the following command:\n```\npython3 train.py --gpu_devices 0,1,2,3 --num_actor_devices 3 --num_actors 15 --training_device 3\n```\nTo use CPU training or simulation (Windows can only use CPU for actors), use the following arguments:\n*   `--training_device cpu`: Use CPU to train the model\n*   `--actor_device_cpu`: Use CPU as actors\n\nFor example, use the following command to run everything on CPU:\n```\npython3 train.py --actor_device_cpu --training_device cpu\n```\nThe following command only runs actors on CPU:\n```\npython3 train.py --actor_device_cpu\n```\nFor more customized configuration of training, see the following optional arguments:\n```\n--xpid XPID           Experiment id (default: douzero)\n--save_interval SAVE_INTERVAL\n                      Time interval (in minutes) at which to save the model\n--objective {adp,wp}  Use ADP or WP as reward (default: ADP)\n--actor_device_cpu    Use CPU as actor device\n--gpu_devices GPU_DEVICES\n                      Which GPUs to be used for training\n--num_actor_devices NUM_ACTOR_DEVICES\n                      The number of devices used for simulation\n--num_actors NUM_ACTORS\n                      The number of actors for each simulation device\n--training_device TRAINING_DEVICE\n                      The index of the GPU used for training models. `cpu`\n                \t  means using cpu\n--load_model          Load an existing model\n--disable_checkpoint  Disable saving checkpoint\n--savedir SAVEDIR     Root dir where experiment data will be saved\n--total_frames TOTAL_FRAMES\n                      Total environment frames to train for\n--exp_epsilon EXP_EPSILON\n                      The probability for exploration\n--batch_size BATCH_SIZE\n                      Learner batch size\n--unroll_length UNROLL_LENGTH\n                      The unroll length (time dimension)\n--num_buffers NUM_BUFFERS\n                      Number of shared-memory buffers\n--num_threads NUM_THREADS\n                      Number learner threads\n--max_grad_norm MAX_GRAD_NORM\n                      Max norm of gradients\n--learning_rate LEARNING_RATE\n                      Learning rate\n--alpha ALPHA         RMSProp smoothing constant\n--momentum MOMENTUM   RMSProp momentum\n--epsilon EPSILON     RMSProp epsilon\n```\n\n## Evaluation\nThe evaluation can be performed with GPU or CPU (GPU will be much faster). Pretrained model is available at [Google Drive](https://drive.google.com/drive/folders/1NmM2cXnI5CIWHaLJeoDZMiwt6lOTV_UB?usp=sharing) or [百度网盘](https://pan.baidu.com/s/18g-JUKad6D8rmBONXUDuOQ), 提取码: 4624. Put pre-trained weights in `baselines/`. The performance is evaluated through self-play. We have provided pre-trained models and some heuristics as baselines:\n*   [random](douzero/evaluation/random_agent.py): agents that play randomly (uniformly)\n*   [rlcard](douzero/evaluation/rlcard_agent.py): the rule-based agent in [RLCard](https://github.com/datamllab/rlcard)\n*   SL (`baselines/sl/`): the pre-trained deep agents on human data\n*   DouZero-ADP (`baselines/douzero_ADP/`): the pretrained DouZero agents with Average Difference Points (ADP) as objective\n*   DouZero-WP (`baselines/douzero_WP/`): the pretrained DouZero agents with Winning Percentage (WP) as objective\n\n### Step 1: Generate evaluation data\n```\npython3 generate_eval_data.py\n```\nSome important hyperparameters are as follows.\n*   `--output`: where the pickled data will be saved\n*   `--num_games`: how many random games will be generated, default 10000\n\n### Step 2: Self-Play\n```\npython3 evaluate.py\n```\nSome important hyperparameters are as follows.\n*   `--landlord`: which agent will play as Landlord, which can be random, rlcard, or the path of the pre-trained model\n*   `--landlord_up`: which agent will play as LandlordUp (the one plays before the Landlord), which can be random, rlcard, or the path of the pre-trained model\n*   `--landlord_down`: which agent will play as LandlordDown (the one plays after the Landlord), which can be random, rlcard, or the path of the pre-trained model\n*   `--eval_data`: the pickle file that contains evaluation data\n*   `--num_workers`: how many subprocesses will be used\n*   `--gpu_device`: which GPU to use. It will use CPU by default\n\nFor example, the following command evaluates DouZero-ADP in Landlord position against random agents\n```\npython3 evaluate.py --landlord baselines/douzero_ADP/landlord.ckpt --landlord_up random --landlord_down random\n```\nThe following command evaluates DouZero-ADP in Peasants position against RLCard agents\n```\npython3 evaluate.py --landlord rlcard --landlord_up baselines/douzero_ADP/landlord_up.ckpt --landlord_down baselines/douzero_ADP/landlord_down.ckpt\n```\nBy default, our model will be saved in `douzero_checkpoints/douzero` every half an hour. We provide a script to help you identify the most recent checkpoint. Run\n```\nsh get_most_recent.sh douzero_checkpoints/douzero/\n```\nThe most recent model will be in `most_recent_model`.\n\n## Issues in Windows\nYou may encounter `operation not supported` error if you use a Windows system to train with GPU as actors. This is because doing multiprocessing on CUDA tensors is not supported in Windows. However, our code extensively operates on the CUDA tensors since the code is optimized for GPUs. Please contact us if you find any solutions!\n\n## Core Team\n*   Algorithm: [Daochen Zha](https://github.com/daochenzha), [Jingru Xie](https://github.com/karoka), Wenye Ma, Sheng Zhang, [Xiangru Lian](https://xrlian.com/), Xia Hu, [Ji Liu](http://jiliu-ml.org/)\n*   GUI Demo: [Songyi Huang](https://github.com/hsywhu)\n*   Community contributors: [@Vincentzyx](https://github.com/Vincentzyx)\n\n## Acknowlegements\n*   The demo is largely based on [RLCard-Showdown](https://github.com/datamllab/rlcard-showdown)\n*   Code implementation is inspired by [TorchBeast](https://github.com/facebookresearch/torchbeast)\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkwai%2FDouZero","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fkwai%2FDouZero","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fkwai%2FDouZero/lists"}