https://github.com/marcometer/episodic-transformer-memory-ppo
Clean baseline implementation of PPO using an episodic TransformerXL memory
https://github.com/marcometer/episodic-transformer-memory-ppo
actor-critic deep-reinforcement-learning episodic-memory gated-transformer-xl gtrxl memory-gym on-policy policy-gradient pomdp ppo proximal-policy-optimization pytorch transformer transformer-xl trxl
Last synced: 2 months ago
JSON representation
Clean baseline implementation of PPO using an episodic TransformerXL memory
- Host: GitHub
- URL: https://github.com/marcometer/episodic-transformer-memory-ppo
- Owner: MarcoMeter
- License: mit
- Created: 2022-05-04T10:32:14.000Z (over 3 years ago)
- Default Branch: main
- Last Pushed: 2024-06-18T13:35:45.000Z (over 1 year ago)
- Last Synced: 2025-04-11T02:11:32.774Z (6 months ago)
- Topics: actor-critic, deep-reinforcement-learning, episodic-memory, gated-transformer-xl, gtrxl, memory-gym, on-policy, policy-gradient, pomdp, ppo, proximal-policy-optimization, pytorch, transformer, transformer-xl, trxl
- Language: Python
- Homepage:
- Size: 23.9 MB
- Stars: 172
- Watchers: 3
- Forks: 22
- Open Issues: 1
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
# TransformerXL as Episodic Memory in Proximal Policy Optimization
This repository features a PyTorch based implementation of PPO using TransformerXL (TrXL). Its intention is to provide a clean baseline/reference implementation on how to successfully employ memory-based agents using Transformers and PPO.
# Features
- Episodic Transformer Memory
- TransformerXL (TrXL)
- Gated TransformerXL (GTrXL)
- Environments
- Proof-of-concept Memory Task (PocMemoryEnv)
- CartPole
- Masked velocity
- Minigrid Memory
- Visual Observation Space 3x84x84
- Egocentric Agent View Size 3x3 (default 7x7)
- Action Space: forward, rotate left, rotate right
- [MemoryGym](https://github.com/MarcoMeter/drl-memory-gym)
- Mortar Mayhem
- Mystery Path
- Searing Spotlights (WIP)
- Tensorboard
- Enjoy (watch a trained agent play)# Citing this work
```bibtex
@article{pleines2023trxlppo,
title = {TransformerXL as Episodic Memory in Proximal Policy Optimization},
author = {Pleines, Marco and Pallasch, Matthias and Zimmer, Frank and Preuss, Mike},
journal= {Github Repository},
year = {2023},
url = {https://github.com/MarcoMeter/episodic-transformer-memory-ppo}
}
```# Contents
- [Installation](#installation)
- [Train a model](#train-a-model)
- [Enjoy a model](#enjoy-a-model)
- [Episodic Transformer Memory Concept](#episodic-transformer-memory-concept)
- [Hyperparameters](#hyperparameters)
- [Episodic Transformer Memory](#episodic-transformer-memory)
- [General](#general)
- [Schedules](#schedules)
- [Add Environment](#add-environment)
- [Tensorboard](#tensorboard)
- [Results](#results)# Installation
Install [PyTorch](https://pytorch.org/get-started/locally/) 1.12.1 depending on your platform. We recommend the usage of [Anaconda](https://www.anaconda.com/).
Create Anaconda environment:
```bash
conda create -n transformer-ppo python=3.7 --yes
conda activate transformer-ppo
```CPU:
```bash
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cpuonly -c pytorch
```CUDA:
```bash
conda install pytorch==1.12.1 torchvision==0.13.1 torchaudio==0.12.1 cudatoolkit=11.6 -c pytorch -c conda-forge
```Install the remaining requirements and you are good to go:
```bash
pip install -r requirements.txt
```# Train a model
The training is launched via `train.py`. `--config` specifies the path to the yaml config file featuring hyperparameters. The `--run-id` is used to distinguish training runs. After training, the trained model will be saved to `./models/$run-id$.nn`.
```bash
python train.py --config configs/minigrid.yaml --run-id=my-trxl-training
```# Enjoy a model
To watch an agent exploit its trained model, execute `enjoy.py`. Some pre-trained models can be found in: `./models/`. The to-be-enjoyed model is specified using the `--model` flag.
```bash
python enjoy.py --model=models/mortar_mayhem_grid_trxl.nn
```# Episodic Transformer Memory Concept

# Hyperparameters
#### Episodic Transformer Memory
Hyperparameter
Description
num_blocks
Number of transformer blocks
embed_dim
Embedding size of every layer inside a transformer block
num_heads
Number of heads used in the transformer's multi-head attention mechanism
memory_length
Length of the sliding episodic memory window
positional_encoding
Relative and learned positional encodings can be used
layer_norm
Whether to apply layer normalization before or after every transformer component. Pre layer normalization refers to the identity map re-ordering.
gtrxl
Whether to use Gated TransformerXL
gtrxl_bias
Initial value for GTrXL's bias weight
#### General
gamma
Discount factor
lamda
Regularization parameter used when calculating the Generalized Advantage Estimation (GAE)
updates
Number of cycles that the entire PPO algorithm is being executed
n_workers
Number of environments that are used to sample training data
worker_steps
Number of steps an agent samples data in each environment (batch_size = n_workers * worker_steps)
epochs
Number of times that the whole batch of data is used for optimization using PPO
n_mini_batch
Number of mini batches that are trained throughout one epoch
value_loss_coefficient
Multiplier of the value function loss to constrain it
hidden_layer_size
Number of hidden units in each linear hidden layer
max_grad_norm
Gradients are clipped by the specified max norm
#### Schedules
These schedules can be used to polynomially decay the learning rate, the entropy bonus coefficient and the clip range.
learning_rate_schedule
The learning rate used by the AdamW optimizer
beta_schedule
Beta is the entropy bonus coefficient that is used to encourage exploration
clip_range_schedule
Strength of clipping losses done by the PPO algorithm
# Add Environment
Follow these steps to train another environment:
1. Implement a wrapper of your desired environment. It needs the properties `observation_space`, `action_space` and `max_episode_steps`. The needed functions are `render()`, `reset()` and `step`.
2. Extend the `create_env()` function in `utils.py` by adding another if-statement that queries the environment's "type"
3. Adjust the "type" and "name" key inside the environment's yaml configNote that only environments with visual or vector observations are supported. Concerning the environment's action space, it can be either discrte or multi-discrete.
# Tensorboard
During training, tensorboard summaries are saved to `summaries/run-id/timestamp`.
Run `tensorboad --logdir=summaries` to watch the training statistics in your browser using the URL [http://localhost:6006/](http://localhost:6006/).
# Results
Every experiment is repeated on 5 random seeds. Each model checkpoint is evaluated on 50 unknown environment seeds, which are repeated 5 times. Hence, one data point aggregates 1250 (5x5x50) episodes. Rliable is used to retrieve the interquartile mean and the bootstrapped confidence interval. The training is conducted using the more sophisticated DRL framework [neroRL](https://github.com/MarcoMeter/neroRL). The clean GRU-PPO baseline can be found [here](https://github.com/MarcoMeter/recurrent-ppo-truncated-bptt).
## Mystery Path Grid (Goal & Origin Hidden)

TrXL and GTrXL have identical performance. See [Issue #7](https://github.com/MarcoMeter/episodic-transformer-memory-ppo/issues/7).
## Mortar Mayhem Grid (10 commands)
