{"id":13935817,"url":"https://github.com/adik993/ppo-pytorch","last_synced_at":"2025-07-19T21:30:44.043Z","repository":{"id":92393788,"uuid":"165447640","full_name":"adik993/ppo-pytorch","owner":"adik993","description":"Proximal Policy Optimization(PPO) with Intrinsic Curiosity Module(ICM)","archived":false,"fork":false,"pushed_at":"2019-01-12T23:55:11.000Z","size":1472,"stargazers_count":133,"open_issues_count":3,"forks_count":27,"subscribers_count":4,"default_branch":"master","last_synced_at":"2024-11-27T03:34:49.568Z","etag":null,"topics":["cartpole-v1","deep-learning","generalized-advantage-estimation","icm","intrinsic-curiosity-module","mountaincar-v0","pendulum-v0","ppo","proximal-policy-optimization","pytorch","reinforcement-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/adik993.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-01-12T23:45:12.000Z","updated_at":"2024-11-12T08:16:09.000Z","dependencies_parsed_at":"2023-05-17T03:00:09.456Z","dependency_job_id":null,"html_url":"https://github.com/adik993/ppo-pytorch","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/adik993/ppo-pytorch","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adik993%2Fppo-pytorch","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adik993%2Fppo-pytorch/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adik993%2Fppo-pytorch/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adik993%2Fppo-pytorch/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/adik993","download_url":"https://codeload.github.com/adik993/ppo-pytorch/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/adik993%2Fppo-pytorch/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":266019657,"owners_count":23864916,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["cartpole-v1","deep-learning","generalized-advantage-estimation","icm","intrinsic-curiosity-module","mountaincar-v0","pendulum-v0","ppo","proximal-policy-optimization","pytorch","reinforcement-learning"],"created_at":"2024-08-07T23:02:07.170Z","updated_at":"2025-07-19T21:30:44.034Z","avatar_url":"https://github.com/adik993.png","language":"Python","funding_links":[],"categories":["Python"],"sub_categories":[],"readme":"# Proximal Policy Optimization(PPO) in PyTorch\n\nThis repository contains implementation of reinforcement learning algorithm called Proximal Policy Optimization(PPO).\nIt also implements Intrinsic Curiosity Module(ICM).\n\n|  CartPole-v1 (PPO)                         |  MountainCar-v0 (PPO + ICM)                      |  Pendulum-v0 (PPO + ICM)                   |\n|:------------------------------------------:|:------------------------------------------------:|:------------------------------------------:|\n| ![CartPole-V1](assets/CartPole-V1-PPO.gif) | ![MountainCar-v0](assets/MountainCar-v0-PPO.gif) | ![Pendulum-v0](assets/Pendulum-v0-PPO.gif) |\n\n## What is PPO\n\nPPO is an online policy gradient algorithm built with stability in mind. It optimizes clipped surrogate function\nto make sure new policy is close to the previous one.\n\nSince it's online algorithm it uses the experience gathered to update the policy and then discards the experience(there\nis no replay buffer), because of that it does well in environments that has dense reward like `CartPole-V1` where you\nget the reward immediately, but it struggles to learn the policy for the environments with sparse reward like\n`MountainCar-v0` where we get the positive reward only when we reach the top which is a rare event. For a offline\nalgorithms like DQN it is much easier to solve sparse reward problems, because of the fact they can store this\nrare events in the replay buffer and use it multiple times for training.\n\nIn order to make the learning of sparse reward problems easier we need to introduce the curiosity concept\n\n## What is curiosity\n\nCuriosity is the concept of calculating additional reward for agent called intrinsic reward apart from the reward\nfrom the environment itself called extrinsic reward. There are many ideas of how to define the curiosity, but in this\nproject the idea of Intrinsic Curiosity Module(ICM) is used. Authors define the curiosity as a measure of suprise the\nencountered state brings to the agent. They achieve that by encoding the states into the latent vector and then\nimplementing two models. The forward model that given the encoded state and the action predicts the next state and the\ninverse model that given encoded state and encoded next state tries to predict the action that must have been taken to\ntransit from one state to the other. The intrinsic reward is calculated as a distance between the actual encoded next\nstate vector and the forward model's prediction of the next state. One may wonder what is the inverse model for if it's\nnot used for calculating the reward. The authors explain that with the example of the agent exploring the environment\nand seeing the tree with the leafs moving in the wind. The leafs are out of agent's control, but still he would be\ncurious about them. To avoid it the inverse model was introduced that makes sure agent is curious about states he have\nthe control of.\n\n## How to run\n\nFirst make sure to install all dependencies listed in the `requirements.txt`. Then run one of the following or use them\nas an example to run the algorithm on any other environment:\n * CartPole-v1 `python run_cartpole.py`\n * MountainCar-v0 `python run_mountain_car.py`\n * Pendulum-v0 `python run_pendulum.py`\n\n## Implementation details\n\nThe agent(`PPO`) explores(`Runner`) multiple environments at once(`MultiEnv`) for a specified number of steps.\nIf the `Curiosity` was plugged in the reward is augmented with the intrinsic reward from the curiosity module. If the\n`normalize_state` or `normalize_reward` is enabled the normalization is performed(`Normalizer`) on the states and\nrewards respectively. Then the discounted reward(`Reward`) and discounted advantage(`Advantage`) is calculated\non the rewards gathered. That data is split into `n_mini_batches` and used to perform `n_optimization_epochs` of\ntraining with Adam optimizer using `learning_rate`. Most of the classes accept the `Reporter` argument which can be used\nto plug in the `TensorBoardReporter` used to publish data to tensorboard for live tracking of the learning progress.\n\n![tensorboard](assets/tensorboard.png)\n\n## Normalize or not\n\nNormalization may help on some complicated continous problems like `Pendulum-v0`, but may hurt the performance on\nthe simple discrete environments like `CartPole-v1`. \n\n## TODO\n\n- [ ] Early stopping\n- [ ] Model saving\n- [ ] CNN\n- [ ] LSTM\n\n## References\n\n1. [Proximal Policy Optimization Algorithms](https://arxiv.org/abs/1707.06347)\n2. [Curiosity-driven Exploration by Self-supervised Prediction](https://arxiv.org/abs/1705.05363)\n3. [High-Dimensional Continuous Control Using Generalized Advantage Estimation](https://arxiv.org/abs/1506.02438)","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadik993%2Fppo-pytorch","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fadik993%2Fppo-pytorch","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fadik993%2Fppo-pytorch/lists"}