Ecosyste.ms: Awesome
An open API service indexing awesome lists of open source software.
https://github.com/ToruOwO/minimal-stable-PPO
A minimal and stable PPO.
https://github.com/ToruOwO/minimal-stable-PPO
Last synced: 3 months ago
JSON representation
A minimal and stable PPO.
- Host: GitHub
- URL: https://github.com/ToruOwO/minimal-stable-PPO
- Owner: ToruOwO
- Created: 2023-04-05T03:57:39.000Z (almost 2 years ago)
- Default Branch: main
- Last Pushed: 2024-02-09T22:19:32.000Z (11 months ago)
- Last Synced: 2024-02-09T23:27:24.702Z (11 months ago)
- Language: Python
- Size: 652 KB
- Stars: 79
- Watchers: 1
- Forks: 1
- Open Issues: 0
-
Metadata Files:
- Readme: README.md
Awesome Lists containing this project
- awesome-isaac-gym - Minimal Stable PPO
README
# minimal-stable-PPO
A minimal and stable Proximal Policy Optimization ([PPO](https://arxiv.org/pdf/1707.06347.pdf)), tested on IsaacGymEnvs.
## Requirements
* Python (tested on 3.7)
* PyTorch (tested on 1.8.1)## Training on IsaacGymEnvs
Following instructions [here](https://github.com/NVIDIA-Omniverse/IsaacGymEnvs) to install Isaac Gym and the IsaacGymEnvs repo.
Optional instructions for cleaner code and dependencies:
* Under [isaacgymenvs](https://github.com/NVIDIA-Omniverse/IsaacGymEnvs/tree/main/isaacgymenvs) directory, the `cfg`, `learning` subdirectories and `train.py` file can be removed.
* The dependency on `rl-games` on [this](https://github.com/NVIDIA-Omniverse/IsaacGymEnvs/blob/main/setup.py#L22) line can be removed.### First example
To train a policy on Cartpole, run
```bash
python train.py task=Cartpole
```
Cartpole should converge to optimal within a few seconds of starting.In `configs` directory, we provide the main config file and template configs for `Cartpole` and `AllegroHand` tasks. We use [Hydra](https://hydra.cc/docs/intro/) for config management following IsaacGymEnvs.
### Custom tasks
To train on additional tasks, follow the template configs to define `[new_task].yaml` under `configs/task` and `[new_task]PPO.yaml` under `configs/train`.### Results
Logging on TensorBoard and WandB are supported by default.
Our PPO results match IsaacGymEnvs' default RL implementation, in terms of both training speed and performance.
#### Cartpole in 40 seconds
#### AllegroHand in 3 hours
## Key arguments and parameters
### Main config (`config.yaml`)
* `task=TASK` - Selects which task to use. Options correspond to the config for each environment in `configs/task`.
* `num_envs=NUM_ENVS` - Selects the number of environments to use (overriding the default number of environments set in the task config).
* `seed=SEED` - Sets a seed value for randomizations, and overrides the default seed set up in the task config.
* `device_id=DEVICE_ID` - Device used for physics simulation and the RL algorithm.
* `graphics_device_id=GRAPHICS_DEVICE_ID` - Which Vulkan graphics device ID to use for rendering. Defaults to 0. **Note** - this may be different from CUDA device ID, and does **not** follow PyTorch-like device syntax.
* `pipeline=PIPELINE` - Which API pipeline to use. Defaults to `gpu`, can also set to `cpu`. When using the `gpu` pipeline, all data stays on the GPU and everything runs as fast as possible. When using the `cpu` pipeline, simulation can run on either CPU or GPU, depending on the `sim_device` setting, but a copy of the data is always made on the CPU at every step.
* `test=TEST`- If set to `True`, only runs inference on the policy and does not do any training.
* `checkpoint=CHECKPOINT_PATH` - Set to path to the checkpoint to load for training or testing.
* `headless=HEADLESS` - Whether to run in headless mode.
* `output_name=OUTPUT_NAME` - Sets the output folder name.
* `wandb_mode=WANDB_MODE` - Options for using WandB.### RL config (`train/[task_name]PPO.yaml`)
The main configs to experiment with are:
* `train.network.mlp.units`
* `train.ppo.gamma`
* `train.ppo.tau`
* `train.ppo.learning_rate`
* `train.ppo.lr_schedule`
* `train.ppo.kl_threshold` (only relevant when `lr_schedule == 'kl'`)
* `train.ppo.e_clip`
* `train.ppo.horizon_length`
* `train.ppo.minibatch_size`
* `train.ppo.max_agent_steps`We recommend the default value for other configs, but of course, RL is RL :)
Here are some helpful guides to tuning PPO hyperparameters:
[The 37 Implementation Details of Proximal Policy Optimization](https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/)
[Engstrom L, Ilyas A, Santurkar S, Tsipras D, Janoos F, Rudolph L, Madry A. Implementation matters in deep policy gradients: A case study on ppo and trpo. International Conference on Learning Representations, 2020](https://arxiv.org/abs/2005.12729)
[Andrychowicz M, Raichuk A, Stańczyk P, Orsini M, Girgin S, Marinier R, Hussenot L, Geist M, Pietquin O, Michalski M, Gelly S. What matters in on-policy reinforcement learning? a large-scale empirical study. International Conference on Learning Representations, 2021](https://arxiv.org/abs/2006.05990)
[Duan Y, Chen X, Houthooft R, Schulman J, Abbeel P. Benchmarking deep reinforcement learning for continuous control. In International conference on machine learning 2016 Jun 11 (pp. 1329-1338). PMLR.](https://arxiv.org/abs/1604.06778)
I also documented a few general takeaways in [this](https://twitter.com/ToruO_O/status/1643469546765029376?cxt=HHwWgMDQ4Yy2484tAAAA) tweet.
## Wait, doesn't IsaacGymEnvs already provide RL training scripts?
Yes, [rl_games](https://github.com/Denys88/rl_games/tree/master/rl_games) has great performance but could be hard to use.
If all you're looking for is a **simple, clean, performant** PPO that is easy to modify and extend, try this repo :))) And feel free to give feedback to make this better!
## Citation
Please use the following bibtex if you find this repo helpful and would like to cite:
@misc{minimal-stable-PPO,
author = {Lin, Toru},
title = {A minimal and stable PPO},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/ToruOwO/minimal-stable-PPO}},
}## Acknowledgement
Shout-out to [hora](https://github.com/HaozhiQi/hora) and [rl_games](https://github.com/Denys88/rl_games/), which this code implementation referenced!