{"id":13658368,"url":"https://github.com/StepNeverStop/RLs","last_synced_at":"2025-04-24T08:32:11.690Z","repository":{"id":37318518,"uuid":"183474075","full_name":"StepNeverStop/RLs","owner":"StepNeverStop","description":"Reinforcement Learning Algorithms Based on PyTorch","archived":false,"fork":false,"pushed_at":"2021-10-21T15:23:48.000Z","size":12287,"stargazers_count":446,"open_issues_count":24,"forks_count":96,"subscribers_count":18,"default_branch":"master","last_synced_at":"2024-08-02T05:08:35.616Z","etag":null,"topics":["deep-reinforcement-learning","gym","ml-agents","pytorch","reinforcement-learning","reinforcement-learning-algorithm","sac","training-agents","unity3d"],"latest_commit_sha":null,"homepage":"https://stepneverstop.github.io","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/StepNeverStop.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2019-04-25T16:44:54.000Z","updated_at":"2024-07-01T15:20:50.000Z","dependencies_parsed_at":"2022-09-03T18:12:01.781Z","dependency_job_id":null,"html_url":"https://github.com/StepNeverStop/RLs","commit_stats":null,"previous_names":[],"tags_count":1,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StepNeverStop%2FRLs","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StepNeverStop%2FRLs/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StepNeverStop%2FRLs/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/StepNeverStop%2FRLs/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/StepNeverStop","download_url":"https://codeload.github.com/StepNeverStop/RLs/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":223947363,"owners_count":17230017,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-reinforcement-learning","gym","ml-agents","pytorch","reinforcement-learning","reinforcement-learning-algorithm","sac","training-agents","unity3d"],"created_at":"2024-08-02T05:00:59.055Z","updated_at":"2024-11-10T11:31:48.193Z","avatar_url":"https://github.com/StepNeverStop.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"\u003cdiv align=\"center\"\u003e\r\n\t\u003ca href=\"https://github.com/StepNeverStop/RLs\"\u003e\r\n\t\t\u003cimg width=\"auto\" height=\"200px\" src=\"./pics/logo.png\"\u003e\r\n\t\u003c/a\u003e\r\n\t\u003cbr/\u003e\r\n\t\u003cbr/\u003e\r\n\t\u003ca href=\"https://github.com/StepNeverStop/RLs\"\u003e\r\n\t\t\u003cimg width=\"auto\" height=\"20px\" src=\"./pics/font.png\"\u003e\r\n\t\u003c/a\u003e\r\n\u003c/div\u003e\r\n\r\n\u003cdiv align=\"center\"\u003e\r\n\u003cp\u003e\u003cstrong\u003eRLs:\u003c/strong\u003e Reinforcement Learning Algorithm Based On PyTorch.\u003c/p\u003e \r\n\u003c/div\u003e\r\n\r\n# RLs\r\n\r\nThis project includes SOTA or classic reinforcement learning (single and multi-agent) algorithms used for training\r\nagents by interacting with Unity through [ml-agents](https://github.com/Unity-Technologies/ml-agents/tree/release_18)\r\nRelease 18 or with [gym](https://github.com/openai/gym).\r\n\r\n## About\r\n\r\nThe goal of this framework is to provide stable implementations of standard RL algorithms and simultaneously enable fast\r\nprototyping of new methods. It aims to fill the need for a small, easily grokked codebase in which users can freely\r\nexperiment with wild ideas (speculative research).\r\n\r\n## Characteristics\r\n\r\nThis project supports:\r\n\r\n- Suitable for Windows, Linux, and OSX\r\n- Single- and Multi-Agent training.\r\n- Multiple type of observation sensors as input.\r\n- Only need 3 steps to implement a new algorithm:\r\n    1. **policy** write `.py` in `rls/algorithms/{single/multi}` directory and make the policy inherit from super-class\r\n       defined in `rls/algorithms/base`\r\n    2. **config** write `.yaml` in `rls/configs/algorithms/` directory and specify the super config type defined\r\n       in `rls/configs/algorithms/general.yaml`\r\n    3. **register** register new algorithm in `rls/algorithms/__init__.py`\r\n- Only need 3 steps to adapt to a new training environment:\r\n    1. **wrapper** write environment wrappers in `rls/envs/{new platform}` directory and make it inherit from\r\n       super-class defined in `rls/envs/env_base.py`\r\n    2. **config** write default configuration in `rls/configs/{new platform}`\r\n    3. **register** register new environment platform in `rls/envs/__init__.py`\r\n- Compatible with several environment platforms\r\n    - [Unity3D ml-agents](https://github.com/Unity-Technologies/ml-agents).\r\n    - [PettingZoo](https://www.pettingzoo.ml/#)\r\n    - [gym](https://github.com/openai/gym), for now only two data types are compatible——`[Box, Discrete]`. Support\r\n      parallel training using gym envs, just need to specify `--copies` to how many agents you want to train in\r\n      parallel.\r\n        - environments:\r\n            - [MuJoCo](https://github.com/openai/mujoco-py)(v2.0.2.13)\r\n            - [PyBullet](https://github.com/bulletphysics/bullet3)\r\n            - [gym_minigrid](https://github.com/maximecb/gym-minigrid)\r\n        - observation -\u003e action:\r\n            - Discrete -\u003e Discrete (observation type -\u003e action type)\r\n            - Discrete -\u003e Box\r\n            - Box -\u003e Discrete\r\n            - Box -\u003e Box\r\n            - Box/Discrete -\u003e Tuple(Discrete, Discrete, Discrete)\r\n- Four types of Replay Buffer, Default is ER:\r\n    - ER\r\n    - [Prioritized ER](https://arxiv.org/abs/1511.05952)\r\n- [Noisy Net](https://arxiv.org/abs/1706.10295) for better exploration.\r\n- [Intrinsic Curiosity Module](https://arxiv.org/abs/1705.05363) for almost all off-policy algorithms implemented.\r\n- Parallel training multiple scenes for Gym\r\n- Unified data format\r\n\r\n## Installation\r\n\r\nmethod 1:\r\n\r\n```bash\r\n$ git clone https://github.com/StepNeverStop/RLs.git\r\n$ cd RLs\r\n$ conda create -n rls python=3.8\r\n$ conda activate rls\r\n# Windows\r\n$ pip install -e .[windows]\r\n# Linux or Mac OS\r\n$ pip install -e .\r\n```\r\n\r\nmethod 1:\r\n\r\n```bash\r\nconda env create -f environment.yaml\r\n```\r\n\r\nIf using ml-agents:\r\n\r\n```bash\r\n$ pip install -e .[unity]\r\n```\r\n\r\nYou can download the builded docker image from [here](https://hub.docker.com/r/keavnn/rls):\r\n\r\n```bash\r\n$ docker pull keavnn/rls:latest\r\n```\r\n\r\nIf anyone who wants to send a PR, plz format all code-files first:\r\n\r\n```bash\r\n$ pip install -e .[pr]\r\n$ python auto_format.py -d ./\r\n```\r\n\r\n## Implemented Algorithms\r\n\r\nFor now, these algorithms are available:\r\n\r\n- Multi-Agent training algorithms:\r\n    - Independent-SARL, i.e. IQL, [I-DQN](http://arxiv.org/abs/1511.08779), etc.\r\n    - [Value-Decomposition Networks, VDN](http://arxiv.org/abs/1706.05296)\r\n    - [Monotonic Value Function Factorisation Networks, QMIX](http://arxiv.org/abs/1803.11485)\r\n    - [Multi-head Attention based Q-value Mixing Network, Qatten](http://arxiv.org/abs/2002.03939)\r\n    - [Factorize with Transformation, Qtran](https://arxiv.org/abs/1905.05408)\r\n    - [Duplex Dueling Multi-Agent Q-Learning, QPLEX](http://arxiv.org/abs/2008.01062)\r\n    - [Multi-Agent Deep Deterministic Policy Gradient, MADDPG](https://arxiv.org/abs/1706.02275)\r\n- Single-Agent training algorithms(Some algorithms that only support continuous space problems use Gumbel-softmax trick\r\n  to implement discrete versions, i.e. DDPG):\r\n    - Policy Gradient, PG\r\n    - Actor Critic, AC\r\n    - [Synchronous Advantage Actor Critic, A2C](http://arxiv.org/abs/1602.01783)\r\n    \u003c!-- - [Trust Region Policy Optimization, TRPO](https://arxiv.org/abs/1502.05477) --\u003e\r\n    - :boom:Proximal Policy Optimization, [PPO](https://arxiv.org/abs/1707.06347)\r\n      , [DPPO](http://arxiv.org/abs/1707.02286,)\r\n    - [Trust Region Policy Optimization, TRPO](https://arxiv.org/abs/1502.05477)\r\n    - [Natural Policy Gradient, NPG](https://proceedings.neurips.cc/paper/2001/file/4b86abe48d358ecf194c56c69108433e-Paper.pdf)\r\n    - [Deterministic Policy Gradient, DPG](https://hal.inria.fr/file/index/docid/938992/filename/dpg-icml2014.pdf)\r\n    - [Deep Deterministic Policy Gradient, DDPG](https://arxiv.org/abs/1509.02971)\r\n    - :fire:Soft Actor Critic, [SAC](https://arxiv.org/abs/1812.05905), [Discrete SAC](https://arxiv.org/abs/1910.07207)\r\n    - [Tsallis Actor Critic, TAC](https://arxiv.org/abs/1902.00137)\r\n    - :fire:[Twin Delayed Deep Deterministic Policy Gradient, TD3](https://arxiv.org/abs/1802.09477)\r\n    - Deep Q-learning Network, DQN, [2013](https://arxiv.org/pdf/1312.5602.pdf)\r\n      , [2015](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)\r\n    - [Double Deep Q-learning Network, DDQN](https://arxiv.org/abs/1509.06461)\r\n    - [Dueling Double Deep Q-learning Network, DDDQN](https://arxiv.org/abs/1511.06581)\r\n    - [Deep Recurrent Q-learning Network, DRQN](https://arxiv.org/abs/1507.06527)\r\n    - [Deep Recurrent Double Q-learning, DRDQN](https://arxiv.org/abs/1908.06040)\r\n    - [Category 51, C51](https://arxiv.org/abs/1707.06887)\r\n    - [Quantile Regression DQN, QR-DQN](https://arxiv.org/abs/1710.10044)\r\n    - [Implicit Quantile Networks, IQN](https://arxiv.org/abs/1806.06923)\r\n    - [Rainbow DQN](https://arxiv.org/abs/1710.02298)\r\n    - [MaxSQN](https://github.com/createamind/DRL/blob/master/spinup/algos/maxsqn/maxsqn.py)\r\n    - [Soft Q-Learning, SQL](https://arxiv.org/abs/1702.08165)\r\n    - [Bootstrapped DQN](http://arxiv.org/abs/1602.04621)\r\n    - [Averaged DQN](http://arxiv.org/abs/1611.01929)\r\n    - Hierachical training algorithms:\r\n        - [Option-Critic, OC](http://arxiv.org/abs/1609.05140)\r\n        - [Asynchronous Advantage Option-Critic, A2OC](http://arxiv.org/abs/1709.04571)\r\n        - [PPO Option-Critic, PPOC](http://arxiv.org/abs/1712.00004)\r\n        - [Interest-Option-Critic, IOC](http://arxiv.org/abs/2001.00271)\r\n    - Model-based algorithms:\r\n        - [Learning Latent Dynamics for Planning from Pixels, PlaNet](http://arxiv.org/abs/1811.04551)\r\n        - [Dream to Control, Dreamer](http://arxiv.org/abs/1912.01603)\r\n        - [Mastering Atari with Discrete World Models, DreamerV2](http://arxiv.org/abs/2010.02193)\r\n        - [Model-Based Value Estimation, MVE](http://arxiv.org/abs/1803.00101)\r\n    - Offline algorithms(**under implementation**):\r\n        - [Conservative Q-Learning for Offline Reinforcement Learning, CQL](http://arxiv.org/abs/2006.04779)\r\n        - BCQ\r\n            - Benchmarking Batch Deep Reinforcement Learning Algorithms, [Discrete](http://arxiv.org/abs/1910.01708)\r\n            - Off-Policy Deep Reinforcement Learning without Exploration, [Continuous](http://arxiv.org/abs/1812.02900)\r\n\r\n\r\n|           Algorithms            | Discrete | Continuous | Image | RNN  | Command parameter |\r\n| :-----------------------------: | :------: | :--------: | :---: | :--: | :---------------: |\r\n|               PG                |    ✓     |     ✓      |   ✓   |  ✓   |        pg         |\r\n|               AC                |    ✓     |     ✓      |   ✓   |  ✓   |        ac         |\r\n|               A2C               |    ✓     |     ✓      |   ✓   |  ✓   |        a2c        |\r\n|               NPG               |    ✓     |     ✓      |   ✓   |  ✓   |        npg        |\r\n|              TRPO               |    ✓     |     ✓      |   ✓   |  ✓   |       trpo        |\r\n|               PPO               |    ✓     |     ✓      |   ✓   |  ✓   |        ppo        |\r\n|               DQN               |    ✓     |            |   ✓   |  ✓   |        dqn        |\r\n|           Double DQN            |    ✓     |            |   ✓   |  ✓   |       ddqn        |\r\n|       Dueling Double DQN        |    ✓     |            |   ✓   |  ✓   |       dddqn       |\r\n|          Averaged DQN           |    ✓     |            |   ✓   |  ✓   |    averaged_dqn   |\r\n|        Bootstrapped DQN         |    ✓     |            |   ✓   |  ✓   |  bootstrappeddqn  |\r\n|         Soft Q-Learning         |    ✓     |            |   ✓   |  ✓   |        sql        |\r\n|               C51               |    ✓     |            |   ✓   |  ✓   |        c51        |\r\n|             QR-DQN              |    ✓     |            |   ✓   |  ✓   |       qrdqn       |\r\n|               IQN               |    ✓     |            |   ✓   |  ✓   |        iqn        |\r\n|             Rainbow             |    ✓     |            |   ✓   |  ✓   |      rainbow      |\r\n|               DPG               |    ✓     |     ✓      |   ✓   |  ✓   |        dpg        |\r\n|              DDPG               |    ✓     |     ✓      |   ✓   |  ✓   |       ddpg        |\r\n|               TD3               |    ✓     |     ✓      |   ✓   |  ✓   |        td3        |\r\n|       SAC(has V network)        |    ✓     |     ✓      |   ✓   |  ✓   |       sac_v       |\r\n|               SAC               |    ✓     |     ✓      |   ✓   |  ✓   |        sac        |\r\n|               TAC               |   sac    |     ✓      |   ✓   |  ✓   |        tac        |\r\n|             MaxSQN              |    ✓     |            |   ✓   |  ✓   |      maxsqn       |\r\n|               OC                |    ✓     |     ✓      |   ✓   |  ✓   |        oc         |\r\n|               AOC               |    ✓     |     ✓      |   ✓   |  ✓   |        aoc        |\r\n|              PPOC               |    ✓     |     ✓      |   ✓   |  ✓   |       ppoc        |\r\n|               IOC               |    ✓     |     ✓      |   ✓   |  ✓   |        ioc        |\r\n|             PlaNet              |    ✓     |            |   ✓   |  1   |      planet       |\r\n|             Dreamer             |    ✓     |     ✓      |   ✓   |  1   |      dreamer      |\r\n|            DreamerV2            |    ✓     |     ✓      |   ✓   |  1   |     dreamerv2     |\r\n|               VDN               |    ✓     |            |   ✓   |  ✓   |        vdn        |\r\n|              QMIX               |    ✓     |            |   ✓   |  ✓   |       qmix        |\r\n|             Qatten              |    ✓     |            |   ✓   |  ✓   |      qatten       |\r\n|              QPLEX              |    ✓     |            |   ✓   |  ✓   |       qplex       |\r\n|              QTRAN              |    ✓     |            |   ✓   |  ✓   |       qtran       |\r\n|             MADDPG              |    ✓     |     ✓      |   ✓   |  ✓   |      maddpg       |\r\n|              MASAC              |    ✓     |     ✓      |   ✓   |  ✓   |       masac       |\r\n|               CQL               |    ✓     |            |   ✓   |  ✓   |      cql_dqn      |\r\n|               BCQ               |    ✓     |     ✓      |   ✓   |  ✓   |        bcq        |\r\n|               MVE               |    ✓     |     ✓      |       |      |        mve        |\r\n\r\n*1 means must use rnn or rnn is used by default.*\r\n\r\n## Getting started\r\n\r\n```python\r\n\"\"\"\r\nusage: run.py [-h] [-c COPIES] [--seed SEED] [-r]\r\n              [-p {gym,unity,pettingzoo}]\r\n              [-a {maddpg,masac,vdn,qmix,qatten,qtran,qplex,aoc,ppoc,oc,ioc,planet,dreamer,dreamerv2,mve,cql_dqn,bcq,pg,npg,trpo,ppo,a2c,ac,dpg,ddpg,td3,sac_v,sac,tac,dqn,ddqn,dddqn,averaged_dqn,c51,qrdqn,rainbow,iqn,maxsqn,sql,bootstrappeddqn}]\r\n              [-i] [-l LOAD_PATH] [-m MODELS] [-n NAME]\r\n              [--config-file CONFIG_FILE] [--store-dir STORE_DIR]\r\n              [--episode-length EPISODE_LENGTH] [--hostname] [-e ENV_NAME]\r\n              [-f FILE_NAME] [-s] [-d DEVICE] [-t MAX_TRAIN_STEP]\r\n\r\noptional arguments:\r\n  -h, --help            show this help message and exit\r\n  -c COPIES, --copies COPIES\r\n                        nums of environment copies that collect data in\r\n                        parallel\r\n  --seed SEED           specify the random seed of module random, numpy and\r\n                        pytorch\r\n  -r, --render          whether render game interface\r\n  -p {gym,unity,pettingzoo}, --platform {gym,unity,pettingzoo}\r\n                        specify the platform of training environment\r\n  -a {maddpg,masac,vdn,qmix,qatten,qtran,qplex,aoc,ppoc,oc,ioc,planet,dreamer,dreamerv2,mve,cql_dqn,bcq,pg,npg,trpo,ppo,a2c,ac,dpg,ddpg,td3,sac_v,sac,tac,dqn,ddqn,dddqn,averaged_dqn,c51,qrdqn,rainbow,iqn,maxsqn,sql,bootstrappeddqn}, --algorithm {maddpg,masac,vdn,qmix,qatten,qtran,qplex,aoc,ppoc,oc,ioc,planet,dreamer,dreamerv2,mve,cql_dqn,bcq,pg,npg,trpo,ppo,a2c,ac,dpg,ddpg,td3,sac_v,sac,tac,dqn,ddqn,dddqn,averaged_dqn,c51,qrdqn,rainbow,iqn,maxsqn,sql,bootstrappeddqn}\r\n                        specify the training algorithm\r\n  -i, --inference       inference the trained model, not train policies\r\n  -l LOAD_PATH, --load-path LOAD_PATH\r\n                        specify the name of pre-trained model that need to\r\n                        load\r\n  -m MODELS, --models MODELS\r\n                        specify the number of trails that using different\r\n                        random seeds\r\n  -n NAME, --name NAME  specify the name of this training task\r\n  --config-file CONFIG_FILE\r\n                        specify the path of training configuration file\r\n  --store-dir STORE_DIR\r\n                        specify the directory that store model, log and\r\n                        others\r\n  --episode-length EPISODE_LENGTH\r\n                        specify the maximum step per episode\r\n  --hostname            whether concatenate hostname with the training name\r\n  -e ENV_NAME, --env-name ENV_NAME\r\n                        specify the environment name\r\n  -f FILE_NAME, --file-name FILE_NAME\r\n                        specify the path of builded training environment of\r\n                        UNITY3D\r\n  -s, --save            specify whether save models/logs/summaries while\r\n                        training or not\r\n  -d DEVICE, --device DEVICE\r\n                        specify the device that operate Torch.Tensor\r\n  -t MAX_TRAIN_STEP, --max-train-step MAX_TRAIN_STEP\r\n                        specify the maximum training steps\r\n\"\"\"\r\n```\r\n\r\nExample:\r\n\r\n```bash\r\npython run.py -s    # save model and log while train\r\npython run.py -p gym -a dqn -e CartPole-v0 -c 12 -n dqn_cartpole\r\npython run.py -p unity -a ppo -n run_with_unity -c 1\r\n```\r\n\r\nThe main training loop of **pseudo-code** in this repo is as:\r\n\r\n```python\r\n# noinspection PyUnresolvedReferences\r\nagent.episode_reset()  # initialize rnn hidden state or something else\r\n# noinspection PyUnresolvedReferences\r\nobs = env.reset()\r\nwhile True:\r\n    # noinspection PyUnresolvedReferences\r\n    env_rets = env.step(agent(obs))\r\n    # noinspection PyUnresolvedReferences\r\n    agent.episode_step(obs, env_rets)  # store experience, save model, and train off-policy algorithms\r\n    obs = env_rets['obs']\r\n    if env_rets['done']:\r\n        break\r\n# noinspection PyUnresolvedReferences\r\nagent.episode_end()  # train on-policy algorithms\r\n```\r\n\r\n## Giving credit\r\n\r\nIf using this repository for your research, please cite:\r\n\r\n```\r\n@misc{RLs,\r\n  author = {Keavnn},\r\n  title = {RLs: A Featureless Reinforcement Learning Repository},\r\n  year = {2019},\r\n  publisher = {GitHub},\r\n  journal = {GitHub repository},\r\n  howpublished = {\\url{https://github.com/StepNeverStop/RLs}},\r\n}\r\n```\r\n\r\n## Issues\r\n\r\nAny questions/errors about this project, please let me know in [here](https://github.com/StepNeverStop/RLs/issues/new).\r\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FStepNeverStop%2FRLs","html_url":"https://awesome.ecosyste.ms/projects/github.com%2FStepNeverStop%2FRLs","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2FStepNeverStop%2FRLs/lists"}