{"id":22000147,"url":"https://github.com/agilerl/agilerl","last_synced_at":"2025-05-16T11:02:44.010Z","repository":{"id":148761302,"uuid":"608227238","full_name":"AgileRL/AgileRL","owner":"AgileRL","description":"Streamlining reinforcement learning with RLOps. State-of-the-art RL algorithms and tools.","archived":false,"fork":false,"pushed_at":"2024-05-16T21:51:08.000Z","size":56017,"stargazers_count":501,"open_issues_count":1,"forks_count":38,"subscribers_count":8,"default_branch":"main","last_synced_at":"2024-05-17T17:04:14.516Z","etag":null,"topics":["agilerl","automl","deep-learning","deep-reinforcement-learning","distributed","evolutionary-algorithms","gym","hpo","hyperparameter-optimization","hyperparameter-tuning","machine-learning","mlops","multi-agent","multi-agent-reinforcement-learning","pettingzoo","python","pytorch","reinforcement-learning","rlops","training"],"latest_commit_sha":null,"homepage":"https://agilerl.com","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"apache-2.0","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/AgileRL.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":"CONTRIBUTING.md","funding":null,"license":"LICENSE","code_of_conduct":"CODE_OF_CONDUCT.md","threat_model":null,"audit":null,"citation":"CITATION.cff","codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2023-03-01T15:27:51.000Z","updated_at":"2024-06-21T18:20:12.362Z","dependencies_parsed_at":null,"dependency_job_id":"aae1bdde-22e4-481c-a142-42ad9fc6187e","html_url":"https://github.com/AgileRL/AgileRL","commit_stats":null,"previous_names":[],"tags_count":12,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgileRL%2FAgileRL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgileRL%2FAgileRL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgileRL%2FAgileRL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/AgileRL%2FAgileRL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/AgileRL","download_url":"https://codeload.github.com/AgileRL/AgileRL/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247974732,"owners_count":21026742,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agilerl","automl","deep-learning","deep-reinforcement-learning","distributed","evolutionary-algorithms","gym","hpo","hyperparameter-optimization","hyperparameter-tuning","machine-learning","mlops","multi-agent","multi-agent-reinforcement-learning","pettingzoo","python","pytorch","reinforcement-learning","rlops","training"],"created_at":"2024-11-29T23:09:13.557Z","updated_at":"2025-05-16T11:02:44.001Z","avatar_url":"https://github.com/AgileRL.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# AgileRL\n\u003cp align=\"center\"\u003e\n  \u003cimg src=https://user-images.githubusercontent.com/47857277/222710068-e09a4e3c-368c-458a-9e01-b68674806887.png height=\"120\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003e\u003cb\u003eReinforcement learning streamlined.\u003c/b\u003e\u003cbr\u003eEasier and faster reinforcement learning with RLOps. Visit our \u003ca href=\"https://agilerl.com\"\u003ewebsite\u003c/a\u003e. View \u003ca href=\"https://docs.agilerl.com\"\u003edocumentation\u003c/a\u003e.\u003cbr\u003eJoin the \u003ca href=\"https://discord.gg/eB8HyTA2ux\"\u003eDiscord Server\u003c/a\u003e for questions, help and collaboration.\u003c/p\u003e\n\n\u003cdiv align=\"center\"\u003e\n\n[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)\n[![Documentation Status](https://readthedocs.org/projects/agilerl/badge/?version=latest)](https://docs.agilerl.com/en/latest/?badge=latest)\n[![Downloads](https://static.pepy.tech/badge/agilerl)](https://pypi.python.org/pypi/agilerl/)\n[![Discord](https://dcbadge.vercel.app/api/server/eB8HyTA2ux?style=flat)](https://discord.gg/eB8HyTA2ux)\n[![Arena](./.github/badges/arena-github-badge.svg)](https://arena.agilerl.com)\n\u003cbr\u003e\n\u003ch3\u003e\u003ci\u003e✨ \u003cb\u003eAgileRL 2.0 is here! Check out the latest powerful \u003ca href=https://docs.agilerl.com/en/latest/get_started/agilerl2changes.html\u003eupdates\u003c/a\u003e✨ \u003c/b\u003e\u003c/i\u003e\u003c/h3\u003e\n\u003ch3\u003e\u003ci\u003e🚀 \u003cb\u003eTrain super-fast for free on \u003ca href=\"https://arena.agilerl.com\"\u003eArena\u003c/a\u003e, the RLOps platform from AgileRL 🚀\u003c/b\u003e\u003c/i\u003e\u003c/h3\u003e\n\u003c/div\u003e\n\u003cbr\u003e\n\nAgileRL is a Deep Reinforcement Learning library focused on improving development by introducing RLOps - MLOps for reinforcement learning.\n\nThis library is initially focused on reducing the time taken for training models and hyperparameter optimization (HPO) by pioneering [evolutionary HPO techniques](https://docs.agilerl.com/en/latest/evo_hyperparam_opt/index.html) for reinforcement learning.\u003cbr\u003e\nEvolutionary HPO has been shown to drastically reduce overall training times by automatically converging on optimal hyperparameters, without requiring numerous training runs.\u003cbr\u003e\nWe are constantly adding more algorithms and features. AgileRL already includes state-of-the-art evolvable [on-policy](https://docs.agilerl.com/en/latest/on_policy/index.html), [off-policy](https://docs.agilerl.com/en/latest/off_policy/index.html), [offline](https://docs.agilerl.com/en/latest/offline_training/index.html), [multi-agent](https://docs.agilerl.com/en/latest/multi_agent_training/index.html) and [contextual multi-armed bandit](https://docs.agilerl.com/en/latest/bandits/index.html) reinforcement learning algorithms with [distributed training](https://docs.agilerl.com/en/latest/distributed_training/index.html).\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=https://user-images.githubusercontent.com/47857277/236407686-21363eb3-ffcf-419f-b019-0be4ddf1ed4a.gif width=\"100%\" max-width=\"900\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eAgileRL offers 10x faster hyperparameter optimization than SOTA.\u003c/p\u003e\n\n## Table of Contents\n  * [Get Started](#get-started)\n  * [Benchmarks](#benchmarks)\n  * [Tutorials](#tutorials)\n  * [Algorithms implemented](#evolvable-algorithms-implemented-more-coming-soon)\n  * [Train an agent](#train-an-agent-to-beat-a-gym-environment)\n  * [Citing AgileRL](#citing-agilerl)\n\n## Get Started\n\nTo see the full AgileRL documentation, including tutorials, visit our [documentation site](https://docs.agilerl.com/). To ask questions and get help, collaborate, or discuss anything related to reinforcement learning, join the [AgileRL Discord Server](https://discord.gg/eB8HyTA2ux).\n\nInstall as a package with pip:\n```bash\npip install agilerl\n```\nOr install in development mode:\n```bash\ngit clone https://github.com/AgileRL/AgileRL.git \u0026\u0026 cd AgileRL\npip install -e .\n```\n\n## Benchmarks\n\nReinforcement learning algorithms and libraries are usually benchmarked once the optimal hyperparameters for training are known, but it often takes hundreds or thousands of experiments to discover these. This is unrealistic and does not reflect the true, total time taken for training. What if we could remove the need to conduct all these prior experiments?\n\nIn the charts below, a single AgileRL run, which automatically tunes hyperparameters, is benchmarked against Optuna's multiple training runs traditionally required for hyperparameter optimization, demonstrating the real time savings possible. Global steps is the sum of every step taken by any agent in the environment, including across an entire population.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=https://user-images.githubusercontent.com/47857277/227481592-27a9688f-7c0a-4655-ab32-90d659a71c69.png min-width=\"100%\" width=\"600\"\u003e\n\u003c/p\u003e\n\u003cp align=\"center\"\u003eAgileRL offers an order of magnitude speed up in hyperparameter optimization vs popular reinforcement learning training frameworks combined with Optuna. Remove the need for multiple training runs and save yourself hours.\u003c/p\u003e\n\nAgileRL also supports multi-agent reinforcement learning using the Petting Zoo-style (parallel API). The charts below highlight the performance of our MADDPG and MATD3 algorithms with evolutionary hyper-parameter optimisation (HPO), benchmarked against epymarl's MADDPG algorithm with grid-search HPO for the simple speaker listener and simple spread environments.\n\n\u003cp align=\"center\"\u003e\n  \u003cimg src=https://github-production-user-asset-6210df.s3.amazonaws.com/118982716/264712154-4965ea5f-b777-423c-989b-e4db86eda3bd.png  min-width=\"100%\" width=\"700\"\u003e\n\u003c/p\u003e\n\n## Tutorials\n\nWe are constantly updating our tutorials to showcase the latest features of AgileRL and how users can leverage our evolutionary HPO to achieve 10x faster hyperparameter optimization. Please see the available tutorials below.\n\n| Tutorial Type | Description | Tutorials |\n|---------------|-------------|-----------|\n| [Single-agent tasks](https://docs.agilerl.com/en/latest/tutorials/gymnasium/index.html) | Guides for training both on and off-policy agents to beat a variety of Gymnasium environments. | [PPO - Acrobot](https://docs.agilerl.com/en/latest/tutorials/gymnasium/agilerl_ppo_tutorial.html) \u003cbr\u003e [TD3 - Lunar Lander](https://docs.agilerl.com/en/latest/tutorials/gymnasium/agilerl_td3_tutorial.html) \u003cbr\u003e [Rainbow DQN - CartPole](https://docs.agilerl.com/en/latest/tutorials/gymnasium/agilerl_rainbow_dqn_tutorial.html) |\n| [Multi-agent tasks](https://docs.agilerl.com/en/latest/tutorials/pettingzoo/index.html) | Use of PettingZoo environments such as training DQN to play Connect Four with curriculum learning and self-play, and for multi-agent tasks in MPE environments. | [DQN - Connect Four](https://docs.agilerl.com/en/latest/tutorials/pettingzoo/dqn.html) \u003cbr\u003e [MADDPG - Space Invaders](https://docs.agilerl.com/en/latest/tutorials/pettingzoo/maddpg.html) \u003cbr\u003e [MATD3 - Speaker Listener](https://docs.agilerl.com/en/latest/tutorials/pettingzoo/matd3.html) |\n| [Hierarchical curriculum learning](https://docs.agilerl.com/en/latest/tutorials/skills/index.html) | Shows how to teach agents Skills and combine them to achieve an end goal. | [PPO - Lunar Lander](https://docs.agilerl.com/en/latest/tutorials/skills/index.html) |\n| [Contextual multi-arm bandits](https://docs.agilerl.com/en/latest/tutorials/bandits/index.html) | Learn to make the correct decision in environments that only have one timestep. | [NeuralUCB - Iris Dataset](https://docs.agilerl.com/en/latest/tutorials/bandits/agilerl_neural_ucb_tutorial.html) \u003cbr\u003e [NeuralTS - PenDigits](https://docs.agilerl.com/en/latest/tutorials/bandits/agilerl_neural_ts_tutorial.html) |\n| [Custom Modules \u0026 Networks](https://docs.agilerl.com/en/latest/tutorials/custom_networks/index.html) | Learn how to create custom evolvable modules and networks for RL algorithms. | [Dueling Distributional Q Network](https://docs.agilerl.com/en/latest/tutorials/custom_networks/agilerl_rainbow_tutorial.html) \u003cbr\u003e [EvolvableSimBa](https://docs.agilerl.com/en/latest/tutorials/custom_networks/agilerl_simba_tutorial.html) |\n| [LLM Finetuning](https://docs.agilerl.com/en/latest/tutorials/llm_finetuning/index.html) | Learn how to finetune an LLM using AgileRL. | [GRPO](https://docs.agilerl.com/en/latest/tutorials/llm_finetuning/index.html) |\n\n## Evolvable algorithms (more coming soon!)\n\n  ### Single-agent algorithms\n\n  | RL         | Algorithm |\n  | ---------- | --------- |\n  | [On-Policy](https://docs.agilerl.com/en/latest/on_policy/index.html)  | [Proximal Policy Optimization (PPO)](https://docs.agilerl.com/en/latest/api/algorithms/ppo.html) |\n  | [Off-Policy](https://docs.agilerl.com/en/latest/off_policy/index.html) | [Deep Q Learning (DQN)](https://docs.agilerl.com/en/latest/api/algorithms/dqn.html) \u003cbr\u003e  [Rainbow DQN](https://docs.agilerl.com/en/latest/api/algorithms/dqn_rainbow.html) \u003cbr\u003e [Deep Deterministic Policy Gradient (DDPG)](https://docs.agilerl.com/en/latest/api/algorithms/ddpg.html) \u003cbr\u003e [Twin Delayed Deep Deterministic Policy Gradient (TD3)](https://docs.agilerl.com/en/latest/api/algorithms/td3.html) |\n  | [Offline](https://docs.agilerl.com/en/latest/offline_training/index.html)    | [Conservative Q-Learning (CQL)](https://docs.agilerl.com/en/latest/api/algorithms/cql.html) \u003cbr\u003e  [Implicit Language Q-Learning (ILQL)](https://docs.agilerl.com/en/latest/api/algorithms/ilql.html) |\n\n  ### Multi-agent algorithms\n\n  | RL         | Algorithm |\n  | ---------- | --------- |\n  | [Multi-agent](https://docs.agilerl.com/en/latest/multi_agent_training/index.html) | [Multi-Agent Deep Deterministic Policy Gradient (MADDPG)](https://docs.agilerl.com/en/latest/api/algorithms/maddpg.html) \u003cbr\u003e [Multi-Agent Twin-Delayed Deep Deterministic Policy Gradient (MATD3)](https://docs.agilerl.com/en/latest/api/algorithms/matd3.html) |\n\n  ### Contextual multi-armed bandit algorithms\n\n  | RL         | Algorithm |\n  | ---------- | --------- |\n  | [Bandits](https://docs.agilerl.com/en/latest/bandits/index.html) | [Neural Contextual Bandits with UCB-based Exploration (NeuralUCB)](https://docs.agilerl.com/en/latest/api/algorithms/neural_ucb.html) \u003cbr\u003e [Neural Contextual Bandits with Thompson Sampling (NeuralTS)](https://docs.agilerl.com/en/latest/api/algorithms/neural_ts.html) |\n\n## Train an agent to beat a Gym environment\n\nBefore starting training, there are some meta-hyperparameters and settings that must be set. These are defined in \u003ccode\u003eINIT_HP\u003c/code\u003e, for general parameters, and \u003ccode\u003eMUTATION_PARAMS\u003c/code\u003e, which define the evolutionary probabilities, and \u003ccode\u003eNET_CONFIG\u003c/code\u003e, which defines the network architecture. For example:\n```python\nINIT_HP = {\n    'ENV_NAME': 'LunarLander-v3',   # Gym environment name\n    'ALGO': 'DQN',                  # Algorithm\n    'DOUBLE': True,                 # Use double Q-learning\n    'CHANNELS_LAST': False,         # Swap image channels dimension from last to first [H, W, C] -\u003e [C, H, W]\n    'BATCH_SIZE': 256,              # Batch size\n    'LR': 1e-3,                     # Learning rate\n    'MAX_STEPS': 1_000_000,         # Max no. steps\n    'TARGET_SCORE': 200.,           # Early training stop at avg score of last 100 episodes\n    'GAMMA': 0.99,                  # Discount factor\n    'MEMORY_SIZE': 10000,           # Max memory buffer size\n    'LEARN_STEP': 1,                # Learning frequency\n    'TAU': 1e-3,                    # For soft update of target parameters\n    'TOURN_SIZE': 2,                # Tournament size\n    'ELITISM': True,                # Elitism in tournament selection\n    'POP_SIZE': 6,                  # Population size\n    'EVO_STEPS': 10_000,            # Evolution frequency\n    'EVAL_STEPS': None,             # Evaluation steps\n    'EVAL_LOOP': 1,                 # Evaluation episodes\n    'LEARNING_DELAY': 1000,         # Steps before starting learning\n    'WANDB': True,                  # Log with Weights and Biases\n}\n```\n```python\nMUTATION_PARAMS = {\n    # Relative probabilities\n    'NO_MUT': 0.4,                              # No mutation\n    'ARCH_MUT': 0.2,                            # Architecture mutation\n    'NEW_LAYER': 0.2,                           # New layer mutation\n    'PARAMS_MUT': 0.2,                          # Network parameters mutation\n    'ACT_MUT': 0,                               # Activation layer mutation\n    'RL_HP_MUT': 0.2,                           # Learning HP mutation\n    'MUT_SD': 0.1,                              # Mutation strength\n    'RAND_SEED': 1,                             # Random seed\n}\n```\n```python\nNET_CONFIG = {\n    'latent_dim': 16\n\n    'encoder_config': {\n      'hidden_size': [32]     # Observation encoder configuration\n    }\n\n    'head_config': {\n      'hidden_size': [32]     # Network head configuration\n    }\n\n}\n```\nFirst, use \u003ccode\u003eutils.utils.create_population\u003c/code\u003e to create a list of agents - our population that will evolve and mutate to the optimal hyperparameters.\n```python\nimport torch\nfrom agilerl.utils.utils import (\n    make_vect_envs,\n    create_population,\n    observation_space_channels_to_first\n)\n\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\n\nnum_envs = 16\nenv = make_vect_envs(env_name=INIT_HP['ENV_NAME'], num_envs=num_envs)\n\nobservation_space = env.single_observation_space\naction_space = env.single_action_space\nif INIT_HP['CHANNELS_LAST']:\n    observation_space = observation_space_channels_to_first(observation_space)\n\nagent_pop = create_population(\n    algo=INIT_HP['ALGO'],                 # Algorithm\n    observation_space=observation_space,  # Observation space\n    action_space=action_space,            # Action space\n    net_config=NET_CONFIG,                # Network configuration\n    INIT_HP=INIT_HP,                      # Initial hyperparameters\n    population_size=INIT_HP['POP_SIZE'],  # Population size\n    num_envs=num_envs,                    # Number of vectorized environments\n    device=device\n)\n```\nNext, create the tournament, mutations and experience replay buffer objects that allow agents to share memory and efficiently perform evolutionary HPO.\n```python\nfrom agilerl.components.replay_buffer import ReplayBuffer\nfrom agilerl.hpo.tournament import TournamentSelection\nfrom agilerl.hpo.mutation import Mutations\n\nmemory = ReplayBuffer(\n    max_size=INIT_HP['MEMORY_SIZE'],   # Max replay buffer size\n    device=device,\n)\n\ntournament = TournamentSelection(\n    tournament_size=INIT_HP['TOURN_SIZE'], # Tournament selection size\n    elitism=INIT_HP['ELITISM'],            # Elitism in tournament selection\n    population_size=INIT_HP['POP_SIZE'],   # Population size\n    eval_loop=INIT_HP['EVAL_LOOP'],        # Evaluate using last N fitness scores\n)\n\nmutations = Mutations(\n    no_mutation=MUTATION_PARAMS['NO_MUT'],                # No mutation\n    architecture=MUTATION_PARAMS['ARCH_MUT'],             # Architecture mutation\n    new_layer_prob=MUTATION_PARAMS['NEW_LAYER'],          # New layer mutation\n    parameters=MUTATION_PARAMS['PARAMS_MUT'],             # Network parameters mutation\n    activation=MUTATION_PARAMS['ACT_MUT'],                # Activation layer mutation\n    rl_hp=MUTATION_PARAMS['RL_HP_MUT'],                   # Learning HP mutation\n    mutation_sd=MUTATION_PARAMS['MUT_SD'],                # Mutation strength\n    rand_seed=MUTATION_PARAMS['RAND_SEED'],               # Random seed\n    device=device,\n)\n```\nThe easiest training loop implementation is to use our \u003ccode\u003etrain_off_policy()\u003c/code\u003e function. It requires the \u003ccode\u003eagent\u003c/code\u003e have methods \u003ccode\u003eget_action()\u003c/code\u003e and \u003ccode\u003elearn().\u003c/code\u003e\n```python\nfrom agilerl.training.train_off_policy import train_off_policy\n\ntrained_pop, pop_fitnesses = train_off_policy(\n    env=env,                                   # Gym-style environment\n    env_name=INIT_HP['ENV_NAME'],              # Environment name\n    algo=INIT_HP['ALGO'],                      # Algorithm\n    pop=agent_pop,                             # Population of agents\n    memory=memory,                             # Replay buffer\n    swap_channels=INIT_HP['CHANNELS_LAST'],    # Swap image channel from last to first\n    max_steps=INIT_HP[\"MAX_STEPS\"],            # Max number of training steps\n    evo_steps=INIT_HP['EVO_STEPS'],            # Evolution frequency\n    eval_steps=INIT_HP[\"EVAL_STEPS\"],          # Number of steps in evaluation episode\n    eval_loop=INIT_HP[\"EVAL_LOOP\"],            # Number of evaluation episodes\n    learning_delay=INIT_HP['LEARNING_DELAY'],  # Steps before starting learning\n    target=INIT_HP['TARGET_SCORE'],            # Target score for early stopping\n    tournament=tournament,                     # Tournament selection object\n    mutation=mutations,                        # Mutations object\n    wb=INIT_HP['WANDB'],                       # Weights and Biases tracking\n)\n\n```\n\n## Citing AgileRL\n\nIf you use AgileRL in your work, please cite the repository:\n```bibtex\n@software{Ustaran-Anderegg_AgileRL,\nauthor = {Ustaran-Anderegg, Nicholas and Pratt, Michael and Sabal-Bermudez, Jaime},\nlicense = {Apache-2.0},\ntitle = {{AgileRL}},\nurl = {https://github.com/AgileRL/AgileRL}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagilerl%2Fagilerl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fagilerl%2Fagilerl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fagilerl%2Fagilerl/lists"}