https://github.com/deepbiolab/drl
Implementation of deep reinforcement learning
https://github.com/deepbiolab/drl
advantage-actor-critic alphazero cross-entropy-method deep-deterministic-policy-gradient dqn dueling-ddqn hill-climbing mc-control monte-carlo-methods policy-based-method policy-gradient ppo prioritized-dqn q-learning reinforce sarsa temporal-difference tile-coding value-based-methods
Last synced: 3 months ago
JSON representation
Implementation of deep reinforcement learning
- Host: GitHub
- URL: https://github.com/deepbiolab/drl
- Owner: deepbiolab
- License: mit
- Created: 2025-01-07T08:49:18.000Z (4 months ago)
- Default Branch: main
- Last Pushed: 2025-02-12T03:54:29.000Z (3 months ago)
- Last Synced: 2025-02-12T04:37:27.934Z (3 months ago)
- Topics: advantage-actor-critic, alphazero, cross-entropy-method, deep-deterministic-policy-gradient, dqn, dueling-ddqn, hill-climbing, mc-control, monte-carlo-methods, policy-based-method, policy-gradient, ppo, prioritized-dqn, q-learning, reinforce, sarsa, temporal-difference, tile-coding, value-based-methods
- Language: Jupyter Notebook
- Homepage:
- Size: 27.3 MB
- Stars: 0
- Watchers: 1
- Forks: 0
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# Deep Reinforcement Learning (DRL) Implementation
![]()
This repository contains implementations of various deep reinforcement learning algorithms, focusing on fundamental concepts and practical applications.
## Project Structure
> It is recommended to follow the material in the given order.
### [Model Free Learning](./model-free-learning/introduction.md)
#### Discrete State Problems
##### Monte Carlo Methods
Implementation of Monte Carlo (MC) algorithms using the Blackjack environment as an example:1. **[MC Prediction](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**
- First-visit MC prediction for estimating action-value function
- Policy evaluation with stochastic limit policy2. **[MC Control with Incremental Mean](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**
- GLIE (Greedy in the Limit with Infinite Exploration)
- Epsilon-greedy policy implementation
- Incremental mean updates3. **[MC Control with Constant-alpha](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**
- Fixed learning rate approach
- Enhanced control over update process##### Temporal Difference Methods
Implementation of TD algorithms on both Blackjack and CliffWalking environments:1. **[SARSA (On-Policy TD Control)](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**
- State-Action-Reward-State-Action
- On-policy learning with epsilon-greedy exploration
- Episode-based updates with TD(0)2. **[Q-Learning (Off-Policy TD Control)](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**
- Also known as SARSA-Max
- Off-policy learning using maximum action values
- Optimal action-value function approximation3. **[Expected SARSA](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**
- Extension of SARSA using expected values
- More stable learning through action probability weighting
- Combines benefits of SARSA and Q-Learning#### Continuous State Problems
##### Uniform Discretization1. **[Q-Learning (Off-Policy TD Control)](model-free-learning/continuous-state-problems/uniform-discretization/discretization_mountaincar.ipynb)**
- Q-Learning to the MountainCar environment using discretized state spaces
- State space discretization through uniform grid representation for continuous variables
- Exploration of the impact of discretization granularity on learning performance##### Tile Coding Discretization
1. **[Q-Learning (Off-Policy TD Control) with Tile Coding](model-free-learning/continuous-state-problems/tiling-discretization/tiling_discretization_acrobot.ipynb)**
- Q-Learning applied to the Acrobot environment using tile coding for state space representation
- Tile coding as a method to efficiently represent continuous state spaces by overlapping feature grids### [Model Based Learning](./model-based-learning/introduction.md)
#### Value Based Iteration
##### Vanilla Deep Q Network
1. **[Deep Q Network with Experience Replay (DQN)](./model-based-learning/value-based/vanilla-dqn/dqn_lunarlander.ipynb)**
- A neural network is used to approximate the Q-value function $Q(s, a)$.
- Breaks the temporal correlation of samples by randomly sampling from a replay buffer.
- Periodically updates the target network's parameters to reduce instability in target value estimation.##### Variants of Deep Q Network
1. **[Double Deep Q Network with Experience Replay (DDQN)](./model-based-learning/value-based/variants-dqn/double_dqn_lunarlander.ipynb)**
- Addresses the overestimation bias in vanilla DQN by decoupling action selection and evaluation.
- This decoupling helps stabilize training and improves the accuracy of Q-value estimates.
2. **[Prioritized Double Deep Q Network (Prioritized DDQN)](./model-based-learning/value-based/variants-dqn/prioritized_ddqn_lunarlander.ipynb)**
- Enhances the efficiency of experience replay by prioritizing transitions with higher temporal-difference (TD) errors.
- Combines the stability of Double DQN with prioritized sampling to focus on more informative experiences.
3. **[Dueling Double Deep Q Network (Dueling DDQN)](./model-based-learning/value-based/variants-dqn/dueling_ddqn_lunarlander.ipynb)**
- Introduces a new architecture that separates the estimation of **state value** $V(s)$ and **advantage function** $A(s, a)$
- Improves learning efficiency by explicitly modeling the state value $V(s)$, which captures the overall "desirability" of actions
- Works particularly well in environments where some actions are redundant or where the state value $V(s)$ plays a dominant role in decision-making.4. **[Noisy Dueling Prioritized Double Deep Q-Network (Noisy DDQN)](./model-based-learning/value-based/variants-dqn/noisy_dueling_ddqn_lunarlander.ipynb)**
- Combines **Noisy Networks**, **Dueling Architecture**, **Prioritized Experience Replay**, and **Double Q-Learning** into a single framework.
- **Noisy Networks** replace ε-greedy exploration with parameterized noise, enabling more efficient exploration by learning stochastic policies.
- **Dueling Architecture** separates the estimation of **state value** $V(s)$ and **advantage function** $A(s, a)$, improving learning efficiency.
- **Prioritized Experience Replay** focuses on transitions with higher temporal-difference (TD) errors, enhancing sample efficiency.
- **Double Q-Learning** reduces overestimation bias by decoupling action selection from evaluation.
- This combination significantly improves convergence speed and stability, particularly in environments with sparse or noisy rewards.##### Asynchronous Deep Q Network
1. **[Asynchronous One Step Deep Q Network without Experience Replay (AsyncDQN)](./model-based-learning/value-based/async-dqn/asynchronous_dqn_lunarlander.ipynb)**
- Eliminates the dependency on experience replay by using asynchronous parallel processes to interact with the environment and update the shared Q-network.
- Achieves significant speedup by leveraging multiple CPU cores, making it highly efficient even without GPU acceleration.
- Compared to Dueling DDQN (22 minutes), AsyncDQN completes training in just 4.29 minutes on CPU, achieving a 5x speedup.2. **[Asynchronous One Step Deep SARSA without Experience Replay (AsyncDSARSA)](./model-based-learning/value-based/async-dqn/asynchronous_one_step_deep_sarsa_lunarlander.py)**
- Utilizes same asynchronous parallel processes to update a shared Q-network without the need for experience replay.
- Employs a one-step SARSA—on-policy update rule that leverages the next selected action to enhance stability and reduce overestimation (basically same as AsyncDQN).3. **[Asynchronous N-Step Deep Q Network without Experience Replay (AsyncNDQN)](./model-based-learning/value-based/async-dqn/asynchronous_n_step_dqn_lunarlander.ipynb)**
- Extends AsyncDQN by incorporating N-step returns, which balances the trade-off between bias (shorter N) and variance (longer N).
- N-step returns accelerate the propagation of rewards across states, enabling faster convergence compared to one-step updates.
- Like AsyncDQN, it eliminates the dependency on experience replay, using asynchronous parallel processes to update the shared Q-network.#### Policy Based Iteration
##### Black Box Optimization
1. **[Hill Climbing](./model-based-learning/policy-based/black-box-optimization/hill-climbing/hill_climbing.ipynb)**
- A simple optimization technique that iteratively improves the policy by making small adjustments to the parameters.
- Relies on evaluating the performance of the policy after each adjustment and keeping the changes that improve performance.
- Works well in low-dimensional problems but can struggle with local optima and high-dimensional spaces.2. **[Cross Entropy Method (CEM)](./model-based-learning/policy-based/black-box-optimization/cross-entropy/cross_entropy_method.ipynb)**
- A probabilistic optimization algorithm that searches for the best policy by iteratively sampling and updating a distribution over policy parameters.
- Particularly effective in high-dimensional or continuous action spaces due to its ability to focus on promising regions of the parameter space.
- Often used as a baseline for policy optimization in reinforcement learning.##### Policy Gradient Methods
1. **[REINFORCE](./model-based-learning/policy-based/policy-gradient-methods/vanilla-reinforce/reinforce_with_discrete_actions.ipynb)**
- A foundational policy gradient algorithm that directly optimizes the policy by maximizing the expected cumulative reward.
- Uses Monte Carlo sampling to estimate the policy gradient.
- Updates the policy parameters based on the gradient of the expected reward with respect to the policy.2. **[Improved REINFORCE](./model-based-learning/policy-based/policy-gradient-methods/improved-reinforce/improved_reinforce.ipynb)**
- Parallel collection of multiple trajectories and allows the policy gradient to be estimated by averaging across different trajectories, leading to more stable updates.
- Rewards are normalized to stabilize learning and ensure consistent gradient step sizes.
- Credit assignment is improved by considering only the future rewards for each action and reduces gradient noise without affecting the averaged gradient, leading to faster and more stable training.3. **[Proximal Policy Optimization (PPO)](./model-based-learning/policy-based/policy-gradient-methods/proximal-policy-optimization/ppo.ipynb)**
- Introduces a clipped surrogate objective to ensure stable updates by preventing large changes in the policy.
- Balances exploration and exploitation by limiting the policy ratio deviation within a trust region.
- Combines the simplicity of REINFORCE with the stability of Trust Region Policy Optimization (TRPO), making it efficient and robust for large-scale problems.#### Actor Critic Methods
1. **[A2C](./model-based-learning/actor-critic/advantage-actor-critic/a2c.ipynb)**
- A synchronous version of the Advantage Actor-Critic (A3C) algorithm.
- Uses multiple parallel environments to collect trajectories and updates the policy in a synchronized manner.
- Combines the benefits of policy-based and value-based methods by using a shared network to estimate both the policy (actor) and the value function (critic).2. **[A3C](./model-based-learning/actor-critic/async-advantage-actor-critic/a3c.py)**
- An asynchronous version of the Advantage Actor-Critic algorithm.
- Multiple agents interact with independent environments asynchronously, allowing faster updates and better exploration of the state space.
- Each agent maintains its own local network, which is periodically synchronized with a global network.## Environments Brief in This Project
- **[Blackjack](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/blackjack.py)**: Classic card game environment for policy learning
- **[CliffWalking](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/cliffwalking.py)**: Grid-world navigation task with negative rewards and cliff hazards
- **[Taxi-v3](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/taxi.py)**: Grid-world transportation task where an agent learns to efficiently navigate, pick up and deliver passengers to designated locations while optimizing rewards.
- **[MountainCar](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/mountain_car.py)**: Continuous control task where an underpowered car must learn to build momentum by moving back and forth to overcome a steep hill and reach the goal position.
- **[Acrobot](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/acrobot.py)**: A two-link robotic arm environment where the goal is to swing the end of the second link above a target height by applying torque at the actuated joint. It challenges agents to solve nonlinear dynamics and coordinate the motion of linked components efficiently.
- **[LunarLander](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/box2d/lunar_lander.py)**: A physics-based environment where an agent controls a lunar lander to safely land on a designated pad. The task involves managing fuel consumption, balancing thrust, and handling the dynamics of gravity and inertia.
- **[PongDeterministic-v4](https://ale.farama.org/environments/pong/)**: A classic Atari environment where the agent learns to play Pong, a two-player game where the objective is to hit the ball past the opponent's paddle. The Deterministic-v4 variant ensures fixed frame-skipping, making the environment faster and more predictable for training. This environment is commonly used to benchmark reinforcement learning algorithms, especially for discrete action spaces.## Requirements
Create (and activate) a new environment with `Python 3.10` and install [Pytorch](https://pytorch.org/get-started/locally/) with version `PyTorch 2.5.1`
```bash
conda create -n DRL python=3.10
conda activate DRL
```## Installation
1. Clone the repository:
```bash
git clone https://github.com/deepbiolab/drl.git
cd drl
```2. Install dependencies:
```bash
pip install -r requirements.txt
```## Usage
### Exmaple: Monte Carlo Methods
Run the Monte Carlo implementation:
```bash
cd monte-carlo-methods
python monte_carlo.py
```
Or explore the detailed notebook:## Future Work
- Comprehensive implementations of fundamental RL algorithms
- [x] [MC Control (Monte-Carlo Control)](http://incompleteideas.net/book/RLbook2020.pdf)
- [x] [MC Control with Incremental Mean](http://incompleteideas.net/book/RLbook2020.pdf)
- [x] [MC Control with Constant-alpha](http://incompleteideas.net/book/RLbook2020.pdf)
- [x] [SARSA](http://incompleteideas.net/book/RLbook2020.pdf)
- [x] [SARSA Max (Q-Learning)](http://incompleteideas.net/book/RLbook2020.pdf)
- [x] [Expected SARSA](http://incompleteideas.net/book/RLbook2020.pdf)
- [x] [Q-learning with Uniform Discretization](http://incompleteideas.net/book/RLbook2020.pdf)
- [x] [Q-learning with Tile Coding Discretization](http://incompleteideas.net/book/RLbook2020.pdf)
- [x] [DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)
- [x] [DDQN](https://arxiv.org/pdf/1509.06461)
- [x] [Prioritized DDQN](https://arxiv.org/pdf/1511.05952)
- [x] [Dueling DDQN](https://arxiv.org/pdf/1511.06581)
- [x] [Async One Step DQN](https://arxiv.org/pdf/1602.01783)
- [x] [Async N Step DQN](https://arxiv.org/pdf/1602.01783)
- [x] [Async One Step SARSA](https://arxiv.org/pdf/1602.01783)
- [ ] [Distributional DQN](https://arxiv.org/pdf/1707.06887)
- [x] [Noisy DQN](https://arxiv.org/pdf/1706.10295)
- [ ] [Rainbow](https://arxiv.org/pdf/1710.02298)
- [x] [Hill Climbing](https://en.wikipedia.org/wiki/Hill_climbing)
- [x] [Cross Entropy Method](https://en.wikipedia.org/wiki/Cross-entropy_method)
- [x] [REINFORCE](https://people.cs.umass.edu/~barto/courses/cs687/williams92simple.pdf)
- [x] [PPO](https://arxiv.org/pdf/1707.06347)
- [x] [A3C](https://arxiv.org/pdf/1602.01783)
- [x] [A2C](https://arxiv.org/pdf/1602.01783)
- [ ] DDPG
- [ ] MCTS
- [ ] AlphaZero