{"id":17325978,"url":"https://github.com/eleurent/rl-agents","last_synced_at":"2025-04-13T00:46:14.485Z","repository":{"id":24165170,"uuid":"93446064","full_name":"eleurent/rl-agents","owner":"eleurent","description":"Implementations of Reinforcement Learning and Planning algorithms","archived":false,"fork":false,"pushed_at":"2024-01-13T14:04:29.000Z","size":1059,"stargazers_count":626,"open_issues_count":39,"forks_count":156,"subscribers_count":18,"default_branch":"master","last_synced_at":"2025-04-13T00:46:09.726Z","etag":null,"topics":["agents","planning","reinforcement-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/eleurent.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2017-06-05T20:53:59.000Z","updated_at":"2025-04-10T23:28:32.000Z","dependencies_parsed_at":"2024-01-13T15:41:09.311Z","dependency_job_id":"a87d9f53-0be7-4852-8bf9-a2796d1c83ea","html_url":"https://github.com/eleurent/rl-agents","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eleurent%2Frl-agents","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eleurent%2Frl-agents/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eleurent%2Frl-agents/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/eleurent%2Frl-agents/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/eleurent","download_url":"https://codeload.github.com/eleurent/rl-agents/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248650432,"owners_count":21139672,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agents","planning","reinforcement-learning"],"created_at":"2024-10-15T14:14:45.434Z","updated_at":"2025-04-13T00:46:14.459Z","avatar_url":"https://github.com/eleurent.png","language":"Python","funding_links":[],"categories":["漏洞库_漏洞靶场"],"sub_categories":["资源传输下载"],"readme":"# rl-agents\n\nA collection of Reinforcement Learning agents\n\n[![build](https://github.com/eleurent/rl-agents/actions/workflows/build.yml/badge.svg)](https://github.com/eleurent/rl-agents/actions/workflows/build.yml)\n\n* [Installation](#installation)\n* [Usage](#usage)\n* [Monitoring](#monitoring)\n* [Agents](#agents)\n  * Planning\n    * [Value Iteration](#vi-value-iteration)\n    * [Cross-Entropy Method](#cem-cross-entropy-method)\n    * Monte-Carlo Tree Search\n      * [Upper Confidence Trees](#uct-upper-confidence-bounds-applied-to-trees)\n      * [Deterministic Optimistic Planning](#opd-optimistic-planning-for-deterministic-systems)\n      * [Open Loop Optimistic Planning](#olop-open-loop-optimistic-planning)\n      * [Trailblazer](#trailblazer)\n      * [PlaTγPOOS](#plaTγpoos)\n  * Safe planning\n    * [Robust Value Iteration](#rvi-robust-value-iteration)\n    * [Discrete Robust Optimistic Planning](#drop-discrete-robust-optimistic-planning)\n    * [Interval-based Robust Planning](#irp-interval-based-robust-planning)\n  * Value-based\n    * [Deep Q-Network](#dqn-deep-q-network)\n    * [Fitted-Q](#ftq-fitted-q)\n  * Safe value-based\n    * [Budgeted Fitted-Q](#bftq-budgeted-fitted-q)\n* [Citing](#citing) \n\n# Installation\n\n`pip install --user git+https://github.com/eleurent/rl-agents`\n\n\n# Usage\n\nMost experiments can be started by moving to \n`cd scripts` and running `python experiments.py`\n\n```\nUsage:\n  experiments evaluate \u003cenvironment\u003e \u003cagent\u003e (--train|--test)\n                                             [--episodes \u003ccount\u003e]\n                                             [--seed \u003cstr\u003e]\n                                             [--analyze]\n  experiments benchmark \u003cbenchmark\u003e (--train|--test)\n                                    [--processes \u003ccount\u003e]\n                                    [--episodes \u003ccount\u003e]\n                                    [--seed \u003cstr\u003e]\n  experiments -h | --help\n\nOptions:\n  -h --help            Show this screen.\n  --analyze            Automatically analyze the experiment results.\n  --episodes \u003ccount\u003e   Number of episodes [default: 5].\n  --processes \u003ccount\u003e  Number of running processes [default: 4].\n  --seed \u003cstr\u003e         Seed the environments and agents.\n  --train              Train the agent.\n  --test               Test the agent.\n```\n\nThe `evaluate` command allows to evaluate a given agent on a given environment. For instance,\n\n```bash\n# Train a DQN agent on the CartPole-v0 environment\n$ python3 experiments.py evaluate configs/CartPoleEnv/env.json configs/CartPoleEnv/DQNAgent.json --train --episodes=200\n```\n\nEvery agent interacts with the environment following a standard interface:\n```python\naction = agent.act(state)\nnext_state, reward, done, info = env.step(action)\nagent.record(state, action, reward, next_state, done, info)\n```\n\nThe environments are described by their [gym](https://github.com/openai/gym) `id`, and module for registration.\n```JSON\n{\n    \"id\": \"CartPole-v0\",\n    \"import_module\": \"gym\"\n}\n```\n\nAnd the agents by their class, and configuration dictionary.\n\n```JSON\n{\n    \"__class__\": \"\u003cclass 'rl_agents.agents.deep_q_network.pytorch.DQNAgent'\u003e\",\n    \"model\": {\n        \"type\": \"MultiLayerPerceptron\",\n        \"layers\": [512, 512]\n    },\n    \"gamma\": 0.99,\n    \"n_steps\": 1,\n    \"batch_size\": 32,\n    \"memory_capacity\": 50000,\n    \"target_update\": 1,\n    \"exploration\": {\n        \"method\": \"EpsilonGreedy\",\n        \"tau\": 50000,\n        \"temperature\": 1.0,\n        \"final_temperature\": 0.1\n    }\n}\n```\n\nIf keys are missing from these configurations, values in `agent.default_config()` will be used instead.\n\nFinally, a batch of experiments can be scheduled in a _benchmark_.\nAll experiments are then executed in parallel on several processes.\n\n```bash\n# Run a benchmark of several agents interacting with environments\n$ python3 experiments.py benchmark cartpole_benchmark.json --test --processes=4\n```\n\nA benchmark configuration file contains a list of environment configurations and a list of agent configurations.\n\n```JSON\n{\n    \"environments\": [\"envs/cartpole.json\"],\n    \"agents\": [\"agents/dqn.json\", \"agents/mcts.json\"]\n}\n```\n\n# Monitoring\n\nThere are several tools available to monitor the agent performances:\n* *Run metadata*: for the sake of reproducibility, the environment and agent configurations used for the run are merged and saved to a `metadata.*.json` file.\n* [*Gym Monitor*](https://github.com/openai/gym/blob/master/gym/wrappers/monitor.py): the main statistics (episode rewards, lengths, seeds) of each run are logged to an `episode_batch.*.stats.json` file. They can be automatically visualised by running `scripts/analyze.py`\n* [*Logging*](https://docs.python.org/3/howto/logging.html): agents can send messages through the standard python logging library. By default, all messages with log level _INFO_ are saved to a `logging.*.log` file. Add the option `scripts/experiments.py --verbose` to save with log level _DEBUG_.\n* [*Tensorboard*](https://github.com/lanpa/tensorboardX): by default, a tensoboard writer records information about useful scalars, images and model graphs to the run directory. It can be visualized by running:\n```tensorboard --logdir \u003cpath-to-runs-dir\u003e```\n\n# Agents\n\nThe following agents are currently implemented:\n\n## Planning\n\n### [`VI` Value Iteration](rl_agents/agents/dynamic_programming/value_iteration.py)\n\nPerform a Value Iteration to compute the state-action value, and acts greedily with respect to it.\n\nOnly compatible with [finite-mdp](https://github.com/eleurent/finite-mdp) environments, or environments that handle an `env.to_finite_mdp()` conversion method.\n\nReference: [Dynamic Programming](https://press.princeton.edu/titles/9234.html), Bellman R., Princeton University Press (1957).\n\n### [`CEM` Cross-Entropy Method](rl_agents/agents/cross_entropy_method/cem.py)\n\nA sampling-based planning algorithm, in which sequences of actions are drawn from a prior gaussian distribution. This distribution is iteratively bootstraped by minimizing its cross-entropy to a target distribution approximated by the top-k candidates.\n\nOnly compatible with continuous action spaces. The environment is used as an oracle dynamics and reward model. \n\nReference: [A Tutorial on the Cross-Entropy Method](web.mit.edu/6.454/www/www_fall_2003/gew/CEtutorial.pdf), De Boer P-T., Kroese D.P, Mannor S. and Rubinstein R.Y. (2005).\n\n### `MCTS` Monte-Carlo Tree Search\n\nA world transition model is leveraged for trajectory search. A look-ahead tree is expanded so as to explore the trajectory space and quickly focus around the most promising moves.\n\nReferences:\n* [Efficient Selectivity and Backup Operators in Monte-Carlo Tree Search](https://hal.inria.fr/inria-00116992/document), Coulom R., 2006.\n\n#### [`UCT` Upper Confidence bounds applied to Trees](rl_agents/agents/tree_search/mcts.py)\nThe tree is traversed by iteratively applying an optimistic selection rule at each depth, and the value at leaves is estimated by sampling.\nEmpirical evidence shows that this popular algorithms performs well in many applications, but it has been proved theoretically to achieve a much worse performance (doubly-exponential) than uniform planning in some problems.\n\nReferences:\n* [Bandit based Monte-Carlo Planning](http://ggp.stanford.edu/readings/uct.pdf), Kocsis L., Szepesvári C. (2006).\n* [Bandit Algorithms for Tree Search](https://hal.inria.fr/inria-00136198v2), Coquelin P-A., Munos R. (2007).\n\n#### [`OPD` Optimistic Planning for Deterministic systems](rl_agents/agents/tree_search/deterministic.py)\nThis algorithm is tailored for systems with deterministic dynamics and rewards.\nIt exploits the reward structure to achieve a polynomial rate on regret, and behaves efficiently in numerical experiments with dense rewards.\n\nReference: [Optimistic Planning for Deterministic Systems](https://hal.inria.fr/hal-00830182), Hren J., Munos R. (2008).\n\n#### [`OLOP` Open Loop Optimistic Planning](rl_agents/agents/tree_search/olop.py)\n\nReferences: \n* [Open Loop Optimistic Planning](http://sbubeck.com/COLT10_BM.pdf), Bubeck S., Munos R. (2010).\n* [Practical Open-Loop Optimistic Planning](https://arxiv.org/abs/1904.04700), Leurent E., Maillard O.-A. (2019).\n\n#### [Trailblazer](rl_agents/agents/tree_search/trailblazer.py)\n\nReference: [Blazing the trails before beating the path: Sample-efficient Monte-Carlo planning](http://researchers.lille.inria.fr/~valko/hp/serve.php?what=publications/grill2016blazing.pdf), Grill J. B., Valko M., Munos R. (2017).\n\n#### [PlaTγPOOS](rl_agents/agents/tree_search/platypoos.py)\n\nReference: [Scale-free adaptive planning for deterministic dynamics \u0026 discounted rewards](http://researchers.lille.inria.fr/~valko/hp/publications/bartlett2019scale-free.pdf), Bartlett P., Gabillon V., Healey J., Valko M. (2019).\n\n\n## Safe planning\n\n### [`RVI` Robust Value Iteration](rl_agents/agents/dynamic_programming/robust_value_iteration.py)\n\nA list of possible [finite-mdp](https://github.com/eleurent/finite-mdp) models is provided in the agent configuration. The MDP ambiguity set is constrained to be rectangular: different models can be selected at every transition.The corresponding robust state-action value is computed so as to maximize the worst-case total reward.\n\nReferences:\n* [Robust Control of Markov Decision Processes with Uncertain Transition Matrices](https://people.eecs.berkeley.edu/~elghaoui/pdffiles/rmdp_erl.pdf), Nilim A., El Ghaoui L. (2005).\n* [Robust Dynamic Programming](http://www.corc.ieor.columbia.edu/reports/techreports/tr-2002-07.pdf), Iyengar G. (2005).\n* [Robust Markov Decision Processes](http://www.optimization-online.org/DB_FILE/2010/05/2610.pdf), Wiesemann W. et al. (2012).\n\n### [`DROP` Discrete Robust Optimistic Planning](rl_agents/agents/robust/robust.py)\n\nThe MDP ambiguity set is assumed to be finite, and is constructed from a list of modifiers to the true environment.\nThe corresponding robust value is approximately computed by [Deterministic Optimistic Planning](#deterministic-optimistic-planning) so as to maximize the worst-case total reward.\n\nReferences:\n* [Approximate Robust Control of Uncertain Dynamical Systems](https://arxiv.org/abs/1903.00220), Leurent E. et al. (2018).\n\n### [`IRP` Interval-based Robust Planning](rl_agents/agents/robust/robust.py)\n\nWe assume that the MDP is a parametrized dynamical system, whose parameter is uncertain and lies in a continuous ambiguity set. We use interval prediction to compute the set of states that can be reached at any time _t_, given that uncertainty, and leverage it to evaluate and improve a robust policy.\n\nIf the system is Linear Parameter-Varying (LPV) with polytopic uncertainty, an fast and stable interval predictor can be designed. Otherwise, sampling-based approaches can be used instead, with an increased computational load.\n\nReferences:\n* [Approximate Robust Control of Uncertain Dynamical Systems](https://arxiv.org/abs/1903.00220), Leurent E. et al. (2018).\n* [Interval Prediction for Continuous-Time Systems with Parametric Uncertainties](https://arxiv.org/abs/1904.04727), Leurent E. et al (2019).\n\n## Value-based\n\n### [`DQN` Deep Q-Network](rl_agents/agents/deep_q_network)\n\nA neural-network model is used to estimate the state-action value function and produce a greedy optimal policy.\n\nImplemented variants:\n* Double DQN\n* Dueling architecture\n* N-step targets\n\nReferences:\n* [Playing Atari with Deep Reinforcement Learning](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf), Mnih V. et al (2013).\n* [Deep Reinforcement Learning with Double Q-learning](https://arxiv.org/abs/1509.06461), van Hasselt H. et al. (2015).\n* [Dueling Network Architectures for Deep Reinforcement Learning](https://arxiv.org/abs/1511.06581), Wang Z. et al. (2015).\n\n### [`FTQ` Fitted-Q](rl_agents/agents/fitted_q)\n\nA Q-function model is trained by performing each step of Value Iteration as a supervised learning procedure applied to a batch\nof transitions covering most of the state-action space.\n\nReference: [Tree-Based Batch Mode Reinforcement Learning](http://www.jmlr.org/papers/volume6/ernst05a/ernst05a.pdf), Ernst D. et al (2005).\n\n## Safe Value-based\n\n### [`BFTQ` Budgeted Fitted-Q](rl_agents/agents/budgeted_ftq)\n\nAn adaptation of **`FTQ`** in the budgeted setting: we maximise the expected reward _r_ of a policy _π_ under the constraint that an expected cost _c_ remains under a given budget _β_.\nThe policy _π(a | s, _β_)_ is conditioned on this cost budget _β_, which can be changed online.\n\nTo that end, the Q-function model is trained to predict both the expected reward _Qr_ and the expected cost _Qc_ of the optimal constrained policy _π_. \n\nThis agent can only be used with environments that provide a cost signal in their `info` field:\n```\n\u003e\u003e\u003e obs, reward, done, info = env.step(action)\n\u003e\u003e\u003e info\n{'cost': 1.0}\n``` \n\nReference: [Budgeted Reinforcement Learning in Continuous State Space](https://arxiv.org/abs/1903.01004), Carrara N., Leurent E., Laroche R., Urvoy T., Maillard O-A., Pietquin O. (2019).\n\n# Citing\n\nIf you use this project in your work, please consider citing it with:\n```\n@misc{rl-agents,\n  author = {Leurent, Edouard},\n  title = {rl-agents: Implementations of Reinforcement Learning algorithms},\n  year = {2018},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https://github.com/eleurent/rl-agents}},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feleurent%2Frl-agents","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Feleurent%2Frl-agents","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Feleurent%2Frl-agents/lists"}