https://github.com/deepbiolab/drl-zero-to-hero

Implementation of deep reinforcement learning
https://github.com/deepbiolab/drl-zero-to-hero

ddqn double-dqn dqn dueling-ddqn monte-carlo-methods prioritized-dqn q-learning sarsa temporal-diff tile-coding

Last synced: 4 months ago
JSON representation

Implementation of deep reinforcement learning

Host: GitHub
URL: https://github.com/deepbiolab/drl-zero-to-hero
Owner: deepbiolab
License: mit
Created: 2025-01-07T08:49:18.000Z (4 months ago)
Default Branch: main
Last Pushed: 2025-01-18T19:34:36.000Z (4 months ago)
Last Synced: 2025-01-18T20:19:57.957Z (4 months ago)
Topics: ddqn, double-dqn, dqn, dueling-ddqn, monte-carlo-methods, prioritized-dqn, q-learning, sarsa, temporal-diff, tile-coding
Language: Jupyter Notebook
Homepage:
Size: 13.6 MB
Stars: 0
Watchers: 1
Forks: 0
Open Issues: 0
Metadata Files:
- Readme: README.md
- License: LICENSE

Awesome Lists containing this project

README

        
# Deep Reinforcement Learning (DRL) Implementation

This repository contains implementations of various deep reinforcement learning algorithms, focusing on fundamental concepts and practical applications.

## Project Structure

> It is recommended to follow the material in the given order.

### Model Free Learning

#### Discrete State Problems

##### Monte Carlo Methods

Implementation of Monte Carlo (MC) algorithms using the Blackjack environment as an example:

1. **[MC Prediction](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**

   - First-visit MC prediction for estimating action-value function

   - Policy evaluation with stochastic limit policy

2. **[MC Control with Incremental Mean](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**

   - GLIE (Greedy in the Limit with Infinite Exploration)

   - Epsilon-greedy policy implementation

   - Incremental mean updates

3. **[MC Control with Constant-alpha](model-free-learning/discrete-state-problems/monte-carlo-methods/monte_carlo_blackjack.ipynb)**

   - Fixed learning rate approach

   - Enhanced control over update process

##### Temporal Difference Methods

Implementation of TD algorithms on both Blackjack and CliffWalking environments:

1. **[SARSA (On-Policy TD Control)](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**

   - State-Action-Reward-State-Action

   - On-policy learning with epsilon-greedy exploration

   - Episode-based updates with TD(0)

2. **[Q-Learning (Off-Policy TD Control)](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**

   - Also known as SARSA-Max

   - Off-policy learning using maximum action values

   - Optimal action-value function approximation

3. **[Expected SARSA](model-free-learning/discrete-state-problems/temporal-difference-methods/temporal_difference_blackjack.ipynb)**

   - Extension of SARSA using expected values

   - More stable learning through action probability weighting

   - Combines benefits of SARSA and Q-Learning

#### Continuous State Problems

##### Uniform Discretization

1. **[Q-Learning (Off-Policy TD Control)](model-free-learning/continuous-state-problems/uniform-discretization/discretization_mountaincar.ipynb)**

   - Q-Learning to the MountainCar environment using discretized state spaces

   - State space discretization through uniform grid representation for continuous variables

   - Exploration of the impact of discretization granularity on learning performance

##### Tile Coding Discretization

1. **[Q-Learning (Off-Policy TD Control) with Tile Coding](model-free-learning/continuous-state-problems/tiling-discretization/tiling_discretization_acrobot.ipynb)**

   - Q-Learning applied to the Acrobot environment using tile coding for state space representation

   - Tile coding as a method to efficiently represent continuous state spaces by overlapping feature grids

### Model Based Learning

#### Value Based Iteration

##### Vanilla Deep Q Network

1. **[Deep Q Network with Experience Replay (DQN)](./model-based-learning/value-iteration/vanilla-dqn/dqn_lunarlander.ipynb)**

   - A neural network is used to approximate the Q-value function $Q(s, a)$.

   - Breaks the temporal correlation of samples by randomly sampling from a replay buffer.

   - Periodically updates the target network's parameters to reduce instability in target value estimation.

##### Variants of Deep Q Network

1. **[Double Deep Q Network with Experience Replay (DDQN)](./model-based-learning/value-iteration/variants-dqn/double_dqn_lunarlander.ipynb)**

   - Addresses the overestimation bias in vanilla DQN by decoupling action selection and evaluation.

   - This decoupling helps stabilize training and improves the accuracy of Q-value estimates.

2. **[Prioritized Double Deep Q Network (Prioritized DDQN)](./model-based-learning/value-iteration/variants-dqn/prioritized_ddqn_lunarlander.ipynb)**  

   - Enhances the efficiency of experience replay by prioritizing transitions with higher temporal-difference (TD) errors.  

   - Combines the stability of Double DQN with prioritized sampling to focus on more informative experiences.

3. **[Dueling Double Deep Q Network (Dueling DDQN)](./model-based-learning/value-iteration/variants-dqn/dueling_ddqn_lunarlander.ipynb)**

   - Introduces a new architecture that separates the estimation of **state value** $V(s)$ and **advantage function** $A(s, a)$

   - Improves learning efficiency by explicitly modeling the state value $V(s)$, which captures the overall "desirability" of actions 

   - Works particularly well in environments where some actions are redundant or where the state value $V(s)$ plays a dominant role in decision-making.

## Environments Brief in This Project

- **[Blackjack](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/blackjack.py)**: Classic card game environment for policy learning

- **[CliffWalking](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/cliffwalking.py)**: Grid-world navigation task with negative rewards and cliff hazards

- **[Taxi-v3](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/toy_text/taxi.py)**: Grid-world transportation task where an agent learns to efficiently navigate, pick up and deliver passengers to designated locations while optimizing rewards.

- **[MountainCar](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/mountain_car.py)**: Continuous control task where an underpowered car must learn to build momentum by moving back and forth to overcome a steep hill and reach the goal position.

- **[Acrobot](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/classic_control/acrobot.py)**: A two-link robotic arm environment where the goal is to swing the end of the second link above a target height by applying torque at the actuated joint. It challenges agents to solve nonlinear dynamics and coordinate the motion of linked components efficiently.

- **[LunarLander](https://github.com/Farama-Foundation/Gymnasium/blob/main/gymnasium/envs/box2d/lunar_lander.py)**: A physics-based environment where an agent controls a lunar lander to safely land on a designated pad. The task involves managing fuel consumption, balancing thrust, and handling the dynamics of gravity and inertia.

## Requirements

Create (and activate) a new environment with `Python 3.10` and `PyTorch 2.5.1`

- **Linux** or **Mac**: 

```bash

conda create -n DRL python=3.10

conda activate DRL

```

## Installation

1. Clone the repository:

```bash

git clone https://github.com/deepbiolab/drl.git

cd drl

```

2. Install dependencies:

```bash

pip install -r requirements.txt

```

## Usage

### Exmaple: Monte Carlo Methods

Run the Monte Carlo implementation:

```bash

cd monte-carlo-methods

python monte_carlo.py

```

Or explore the detailed notebook:

## Future Work

- Comprehensive implementations of fundamental RL algorithms

   - [x] [MC Control (Monte-Carlo Control)](http://incompleteideas.net/book/RLbook2020.pdf)

   - [x] [MC Control with Incremental Mean](http://incompleteideas.net/book/RLbook2020.pdf)

   - [x] [MC Control with Constant-alpha](http://incompleteideas.net/book/RLbook2020.pdf)

   - [x] [SARSA](http://incompleteideas.net/book/RLbook2020.pdf)

   - [x] [SARSA Max (Q-Learning)](http://incompleteideas.net/book/RLbook2020.pdf)

   - [x] [Expected SARSA](http://incompleteideas.net/book/RLbook2020.pdf)

   - [x] [Q-learning with Uniform Discretization](http://incompleteideas.net/book/RLbook2020.pdf)

   - [x] [Q-learning with Tile Coding Discretization](http://incompleteideas.net/book/RLbook2020.pdf)

   - [x] [DQN](https://storage.googleapis.com/deepmind-media/dqn/DQNNaturePaper.pdf)

   - [x] [DDQN](https://arxiv.org/pdf/1509.06461)

   - [x] [Prioritized DDQN](https://arxiv.org/pdf/1511.05952)

   - [x] [Dueling DDQN](https://arxiv.org/pdf/1511.06581)

   - [ ] [Rainbow](https://arxiv.org/pdf/1710.02298)

   - [ ] Hill Climbing

   - [ ] Cross Entropy Method

   - [ ] REINFORCE

   - [ ] A2C

   - [ ] [A3C](https://arxiv.org/pdf/1602.01783)

   - [ ] PPO

   - [ ] DDPG

   - [ ] MCTS, AlphaZero

ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/deepbiolab/drl-zero-to-hero

Awesome Lists containing this project

README