{"id":24131594,"url":"https://github.com/deepbiolab/drl","last_synced_at":"2026-06-10T19:31:18.619Z","repository":{"id":271411846,"uuid":"913212789","full_name":"deepbiolab/drl","owner":"deepbiolab","description":"Implementation of deep reinforcement learning","archived":false,"fork":false,"pushed_at":"2025-02-16T04:48:50.000Z","size":32199,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-03-01T07:17:05.654Z","etag":null,"topics":["advantage-actor-critic","alphazero","cross-entropy-method","deep-deterministic-policy-gradient","dqn","dueling-ddqn","hill-climbing","mc-control","monte-carlo-methods","policy-based-method","policy-gradient","ppo","prioritized-dqn","q-learning","reinforce","sarsa","temporal-difference","tile-coding","value-based-methods"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/deepbiolab.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2025-01-07T08:49:18.000Z","updated_at":"2025-02-16T04:48:53.000Z","dependencies_parsed_at":"2025-02-02T20:33:11.830Z","dependency_job_id":null,"html_url":"https://github.com/deepbiolab/drl","commit_stats":null,"previous_names":["deepbiolab/drl","deepbiolab/drl-zero-to-hero"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/deepbiolab/drl","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepbiolab%2Fdrl","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepbiolab%2Fdrl/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepbiolab%2Fdrl/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepbiolab%2Fdrl/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/deepbiolab","download_url":"https://codeload.github.com/deepbiolab/drl/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/deepbiolab%2Fdrl/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":34168086,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"online","status_checked_at":"2026-06-10T02:00:07.152Z","response_time":89,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["advantage-actor-critic","alphazero","cross-entropy-method","deep-deterministic-policy-gradient","dqn","dueling-ddqn","hill-climbing","mc-control","monte-carlo-methods","policy-based-method","policy-gradient","ppo","prioritized-dqn","q-learning","reinforce","sarsa","temporal-difference","tile-coding","value-based-methods"],"created_at":"2025-01-11T21:18:25.988Z","updated_at":"2026-06-10T19:31:18.607Z","avatar_url":"https://github.com/deepbiolab.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"\n# Deep Reinforcement Learning (DRL) Implementation\n\n\u003cdiv style=\"text-align: center;\"\u003e\n    \u003cimg src=\"./assets/drl.png\" alt=\"Mountain Car Environment\" width=\"90%\"\u003e\n\u003c/div\u003e\n\nThis repository contains implementations of various deep reinforcement learning algorithms, focusing on fundamental concepts and practical applications.\n\n## Project Structure\n\n\u003e It is recommended to follow the material in the given order.\n\n### [Model Free Learning](./model-free-learning/introduction.md)\n\n#### Discrete State Problems\n\n##### Monte Carlo Methods\nImplementation of Monte Carlo (MC) algorithms using the Blackjack environment as an example:\n\n1. **[MC Prediction](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**\n   - First-visit MC prediction for estimating action-value function\n   - Policy evaluation with stochastic limit policy\n\n2. **[MC Control with Incremental Mean](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**\n   - GLIE (Greedy in the Limit with Infinite Exploration)\n   - Epsilon-greedy policy implementation\n   - Incremental mean updates\n\n3. **[MC Control with Constant-alpha](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**\n   - Fixed learning rate approach\n   - Enhanced control over update process\n\n##### Temporal Difference Methods\nImplementation of TD algorithms on both Blackjack and CliffWalking environments:\n\n1. **[SARSA (On-Policy TD Control)](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**\n   - State-Action-Reward-State-Action\n   - On-policy learning with epsilon-greedy exploration\n   - Episode-based updates with TD(0)\n\n2. **[Q-Learning (Off-Policy TD Control)](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**\n   - Also known as SARSA-Max\n   - Off-policy learning using maximum action values\n   - Optimal action-value function approximation\n\n3. **[Expected SARSA](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**\n   - Extension of SARSA using expected values\n   - More stable learning through action probability weighting\n   - Combines benefits of SARSA and Q-Learning\n\n\n#### Continuous State Problems\n##### Uniform Discretization\n\n1. **[Q-Learning (Off-Policy TD Control)](model-free-learning/continuous-state-problems/uniform-discretization/discretization_mountaincar.ipynb)**\n   - Q-Learning to the MountainCar environment using discretized state spaces\n   - State space discretization through uniform grid representation for continuous variables\n   - Exploration of the impact of discretization granularity on learning performance\n\n##### Tile Coding Discretization\n\n1. **[Q-Learning (Off-Policy TD Control) with Tile Coding](model-free-learning/continuous-state-problems/tiling-discretization/tiling_discretization_acrobot.ipynb)**\n   - Q-Learning applied to the Acrobot environment using tile coding for state space representation\n   - Tile coding as a method to efficiently represent continuous state spaces by overlapping feature grids\n\n\n### [Model Based Learning](./model-based-learning/introduction.md)\n\n#### Value Based Iteration\n\n##### Vanilla Deep Q Network\n\n1. **[Deep Q Network with Experience Replay (DQN)](./model-based-learning/value-based/vanilla-dqn/dqn_lunarlander.ipynb)**\n   - A neural network is used to approximate the Q-value function $Q(s, a)$.\n   - Breaks the temporal correlation of samples by randomly sampling from a replay buffer.\n   - Periodically updates the target network's parameters to reduce instability in target value estimation.\n\n##### Variants of Deep Q Network\n\n1. **[Double Deep Q Network with Experience Replay (DDQN)](./model-based-learning/value-based/variants-dqn/double_dqn_lunarlander.ipynb)**\n   - Addresses the overestimation bias in vanilla DQN by decoupling action selection and evaluation.\n   - This decoupling helps stabilize training and improves the accuracy of Q-value estimates.\n2. **[Prioritized Double Deep Q Network (Prioritized DDQN)](./model-based-learning/value-based/variants-dqn/prioritized_ddqn_lunarlander.ipynb)**  \n   - Enhances the efficiency of experience replay by prioritizing transitions with higher temporal-difference (TD) errors.  \n   - Combines the stability of Double DQN with prioritized sampling to focus on more informative experiences.\n3. **[Dueling Double Deep Q Network (Dueling DDQN)](./model-based-learning/value-based/variants-dqn/dueling_ddqn_lunarlander.ipynb)**\n   - Introduces a new architecture that separates the estimation of **state value** $V(s)$ and **advantage function** $A(s, a)$\n   - Improves learning efficiency by explicitly modeling the state value $V(s)$, which captures the overall \"desirability\" of actions \n   - Works particularly well in environments where some actions are redundant or where the state value $V(s)$ plays a dominant role in decision-making.\n\n4. **[Noisy Dueling Prioritized Double Deep Q-Network (Noisy DDQN)](./model-based-learning/value-based/variants-dqn/noisy_dueling_ddqn_lunarlander.ipynb)**\n   - Combines **Noisy Networks**, **Dueling Architecture**, **Prioritized Experience Replay**, and **Double Q-Learning** into a single framework.\n   - **Noisy Networks** replace ε-greedy exploration with parameterized noise, enabling more efficient exploration by learning stochastic policies.\n   - **Dueling Architecture** separates the estimation of **state value** $V(s)$ and **advantage function** $A(s, a)$, improving learning efficiency.\n   - **Prioritized Experience Replay** focuses on transitions with higher temporal-difference (TD) errors, enhancing sample efficiency.\n   - **Double Q-Learning** reduces overestimation bias by decoupling action selection from evaluation.\n   - This combination significantly improves convergence speed and stability, particularly in environments with sparse or noisy rewards.\n\n##### Asynchronous Deep Q Network\n\n1. **[Asynchronous One Step Deep Q Network without Experience Replay (AsyncDQN)](./model-based-learning/value-based/async-dqn/asynchronous_dqn_lunarlander.ipynb)**\n   - Eliminates the dependency on experience replay by using asynchronous parallel processes to interact with the environment and update the shared Q-network.\n   - Achieves significant speedup by leveraging multiple CPU cores, making it highly efficient even without GPU acceleration.\n   - Compared to Dueling DDQN (22 minutes), AsyncDQN completes training in just 4.29 minutes on CPU, achieving a 5x speedup.\n\n2. **[Asynchronous One Step Deep SARSA without Experience Replay (AsyncDSARSA)](./model-based-learning/value-based/async-dqn/asynchronous_one_step_deep_sarsa_lunarlander.py)**\n   - Utilizes same asynchronous parallel processes to update a shared Q-network without the need for experience replay.  \n   - Employs a one-step SARSA—on-policy update rule that leverages the next selected action to enhance stability and reduce overestimation (basically same as AsyncDQN).\n\n3. **[Asynchronous N-Step Deep Q Network without Experience Replay (AsyncNDQN)](./model-based-learning/value-based/async-dqn/asynchronous_n_step_dqn_lunarlander.ipynb)**\n   - Extends AsyncDQN by incorporating N-step returns, which balances the trade-off between bias (shorter N) and variance (longer N).\n   - N-step returns accelerate the propagation of rewards across states, enabling faster convergence compared to one-step updates.\n   - Like AsyncDQN, it eliminates the dependency on experience replay, using asynchronous parallel processes to update the shared Q-network.\n\n\n#### Policy Based Iteration\n\n##### Black Box Optimization\n\n1. **[Hill Climbing](./model-based-learning/policy-based/black-box-optimization/hill-climbing/hill_climbing.ipynb)**  \n   - A simple optimization technique that iteratively improves the policy by making small adjustments to the parameters.  \n   - Relies on evaluating the performance of the policy after each adjustment and keeping the changes that improve performance.  \n   - Works well in low-dimensional problems but can struggle with local optima and high-dimensional spaces.  \n\n2. **[Cross Entropy Method (CEM)](./model-based-learning/policy-based/black-box-optimization/cross-entropy/cross_entropy_method.ipynb)**  \n   - A probabilistic optimization algorithm that searches for the best policy by iteratively sampling and updating a distribution over policy parameters.  \n   - Particularly effective in high-dimensional or continuous action spaces due to its ability to focus on promising regions of the parameter space.  \n   - Often used as a baseline for policy optimization in reinforcement learning.\n\n##### Policy Gradient Methods\n1. **[REINFORCE](./model-based-learning/policy-based/policy-gradient-methods/vanilla-reinforce/reinforce_with_discrete_actions.ipynb)**\n   - A foundational policy gradient algorithm that directly optimizes the policy by maximizing the expected cumulative reward.\n   - Uses Monte Carlo sampling to estimate the policy gradient.\n   - Updates the policy parameters based on the gradient of the expected reward with respect to the policy.\n\n2. **[Improved REINFORCE](./model-based-learning/policy-based/policy-gradient-methods/improved-reinforce/improved_reinforce.ipynb)**\n   - Parallel collection of multiple trajectories and allows the policy gradient to be estimated by averaging across different trajectories, leading to more stable updates.\n   - Rewards are normalized to stabilize learning and ensure consistent gradient step sizes.\n   - Credit assignment is improved by considering only the future rewards for each action and reduces gradient noise without affecting the averaged gradient, leading to faster and more stable training.\n\n3. **[Proximal Policy Optimization (PPO)](./model-based-learning/policy-based/policy-gradient-methods/proximal-policy-optimization/ppo.ipynb)**\n   - Introduces a clipped surrogate objective to ensure stable updates by preventing large changes in the policy.\n   - Balances exploration and exploitation by limiting the policy ratio deviation within a trust region.\n   - Combines the simplicity of REINFORCE with the stability of Trust Region Policy Optimization (TRPO), making it efficient and robust for large-scale problems.\n\n\n#### Actor Critic Methods\n\n1. **[A2C](./model-based-learning/actor-critic/advantage-actor-critic/a2c.ipynb)**\n   - A synchronous version of the Advantage Actor-Critic (A3C) algorithm.\n   - Uses multiple parallel environments to collect trajectories and updates the policy in a synchronized manner.\n   - Combines the benefits of policy-based and value-based methods by using a shared network to estimate both the policy (actor) and the value function (critic).\n\n2. **[A3C](./model-based-learning/actor-critic/async-advantage-actor-critic/a3c.py)**\n   - An asynchronous version of the Advantage Actor-Critic algorithm.  \n   - Multiple agents interact with independent environments asynchronously, allowing faster updates and better exploration of the state space.  \n   - Each agent maintains its own local network, which is periodically synchronized with a global network. \n\n3. **[DDPG](./model-based-learning/actor-critic/deep-deterministic-policy-gradient/ddpg.ipynb)**\n   - A off-policy Actor-Critic algorithm designed for **continuous action spaces**.\n   - Combines the deterministic policy gradient with Q-learning to optimize a parameterized policy.\n   - Uses **two networks**:\n     - **Actor**: Learns a deterministic policy $\\mu(s | \\theta^\\mu)$ to select actions.\n     - **Critic**: Estimates the action-value function $Q(s, a | \\theta^Q)$.\n   - Employs **target networks** for both Actor and Critic to stabilize training by slowly updating the target parameters:\n   - Leverages **experience replay** to store transitions and break correlations between sequential data.\n   - Introduces **exploration noise** (e.g., Ornstein-Uhlenbeck process) to encourage exploration in continuous action spaces.\n\n\n\n## Environments Brief in This Project\n\n- **[Blackjack](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/blackjack.py)**: Classic card game environment for policy learning\n- **[CliffWalking](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/cliffwalking.py)**: Grid-world navigation task with negative rewards and cliff hazards\n- **[Taxi-v3](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/taxi.py)**: Grid-world transportation task where an agent learns to efficiently navigate, pick up and deliver passengers to designated locations while optimizing rewards.\n- **[MountainCar](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/mountain_car.py)**: Continuous control task where an underpowered car must learn to build momentum by moving back and forth to overcome a steep hill and reach the goal position.\n- **[Acrobot](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/acrobot.py)**: A two-link robotic arm environment where the goal is to swing the end of the second link above a target height by applying torque at the actuated joint. It challenges agents to solve nonlinear dynamics and coordinate the motion of linked components efficiently.\n- **[LunarLander](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/box2d/lunar_lander.py)**: A physics-based environment where an agent controls a lunar lander to safely land on a designated pad. The task involves managing fuel consumption, balancing thrust, and handling the dynamics of gravity and inertia.\n- **[PongDeterministic-v4](https://ale.farama.org/environments/pong/)**: A classic Atari environment where the agent learns to play Pong, a two-player game where the objective is to hit the ball past the opponent's paddle. The Deterministic-v4 variant ensures fixed frame-skipping, making the environment faster and more predictable for training. This environment is commonly used to benchmark reinforcement learning algorithms, especially for discrete action spaces.\n- **[Pendulum-v1](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/pendulum.py)**: A classic control task where the goal is to balance a pendulum by applying torque at the actuated joint. The environment challenges agents to solve nonlinear dynamics and maintain equilibrium in a continuous state and action space.\n\n\n## Requirements\n\nCreate (and activate) a new environment with `Python 3.10` and install [Pytorch](https://pytorch.org/get-started/locally/) with version `PyTorch 2.5.1`\n\n```bash\nconda create -n DRL python=3.10\nconda activate DRL\n```\n\n\n## Installation\n\n1. Clone the repository:\n```bash\ngit clone https://github.com/deepbiolab/drl.git\ncd drl\n```\n\n2. Install dependencies:\n```bash\npip install -r requirements.txt\n```\n\n## Usage\n\n### Exmaple: Monte Carlo Methods\n\nRun the Monte Carlo implementation:\n```bash\ncd monte-carlo-methods\npython monte_carlo.py\n```\nOr explore the detailed notebook:\n\n## Future Work\n\n- Comprehensive implementations of fundamental RL algorithms\n   - [x] [MC Control (Monte-Carlo Control)](http://incompleteideas.net/book/RLbook2020.pdf)\n   - [x] [MC Control with Incremental Mean](http://incompleteideas.net/book/RLbook2020.pdf)\n   - [x] [MC Control with Constant-alpha](http://incompleteideas.net/book/RLbook2020.pdf)\n   - [x] [SARSA](http://incompleteideas.net/book/RLbook2020.pdf)\n   - [x] [SARSA Max (Q-Learning)](http://incompleteideas.net/book/RLbook2020.pdf)\n   - [x] [Expected SARSA](http://incompleteideas.net/book/RLbook2020.pdf)\n   - [x] [Q-learning with Uniform Discretization](http://incompleteideas.net/book/RLbook2020.pdf)\n   - [x] [Q-learning with Tile Coding Discretization](http://incompleteideas.net/book/RLbook2020.pdf)\n   - [x] [DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)\n   - [x] [DDQN](https://arxiv.org/pdf/1509.06461)\n   - [x] [Prioritized DDQN](https://arxiv.org/pdf/1511.05952)\n   - [x] [Dueling DDQN](https://arxiv.org/pdf/1511.06581)\n   - [x] [Async One Step DQN](https://arxiv.org/pdf/1602.01783)\n   - [x] [Async N Step DQN](https://arxiv.org/pdf/1602.01783)\n   - [x] [Async One Step SARSA](https://arxiv.org/pdf/1602.01783)\n   - [ ] [Distributional DQN](https://arxiv.org/pdf/1707.06887)\n   - [x] [Noisy DQN](https://arxiv.org/pdf/1706.10295)\n   - [ ] [Rainbow](https://arxiv.org/pdf/1710.02298)\n   - [x] [Hill Climbing](https://en.wikipedia.org/wiki/Hill_climbing)\n   - [x] [Cross Entropy Method](https://en.wikipedia.org/wiki/Cross-entropy_method)\n   - [x] [REINFORCE](https://people.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)\n   - [x] [PPO](https://arxiv.org/pdf/1707.06347)\n   - [x] [A3C](https://arxiv.org/pdf/1602.01783)\n   - [x] [A2C](https://arxiv.org/pdf/1602.01783)\n   - [x] [DDPG](https://arxiv.org/pdf/1509.02971)\n\n    \n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepbiolab%2Fdrl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fdeepbiolab%2Fdrl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fdeepbiolab%2Fdrl/lists"}