{"id":20685223,"url":"https://github.com/vachanvy/reinforcement-learning","last_synced_at":"2025-04-09T22:19:53.611Z","repository":{"id":261790715,"uuid":"852753863","full_name":"VachanVY/Reinforcement-Learning","owner":"VachanVY","description":"PyTorch implementations of algorithms from \"Reinforcement Learning: An Introduction by Sutton and Barto\", along with various RL research papers.","archived":false,"fork":false,"pushed_at":"2025-03-12T08:22:07.000Z","size":16550,"stargazers_count":80,"open_issues_count":0,"forks_count":3,"subscribers_count":5,"default_branch":"main","last_synced_at":"2025-04-02T20:11:12.115Z","etag":null,"topics":["actor-critic-algorithm","actor-critic-pytorch","artificial-intelligence","ddpg-algorithm","deep-deterministic-policy-gradient","deep-reinforcement-learning","dqn","dqn-pytorch","policy-gradient","policy-gradient-with-baseline","ppo-algorithm","proximal-policy-optimization","pytorch","reinforcement-learning","reinforcement-learning-an-introduction","rl-book","soft-actor-critic-continuous","sutton-barto-book"],"latest_commit_sha":null,"homepage":"https://tinyurl.com/rlzero2hero","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/VachanVY.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-09-05T11:15:57.000Z","updated_at":"2025-03-28T20:55:52.000Z","dependencies_parsed_at":"2025-01-19T10:19:36.381Z","dependency_job_id":"22185050-a1c4-4776-acce-8a3e810efe38","html_url":"https://github.com/VachanVY/Reinforcement-Learning","commit_stats":null,"previous_names":["vachanvy/reinforcement-learning"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VachanVY%2FReinforcement-Learning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VachanVY%2FReinforcement-Learning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VachanVY%2FReinforcement-Learning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/VachanVY%2FReinforcement-Learning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/VachanVY","download_url":"https://codeload.github.com/VachanVY/Reinforcement-Learning/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":248119670,"owners_count":21050814,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["actor-critic-algorithm","actor-critic-pytorch","artificial-intelligence","ddpg-algorithm","deep-deterministic-policy-gradient","deep-reinforcement-learning","dqn","dqn-pytorch","policy-gradient","policy-gradient-with-baseline","ppo-algorithm","proximal-policy-optimization","pytorch","reinforcement-learning","reinforcement-learning-an-introduction","rl-book","soft-actor-critic-continuous","sutton-barto-book"],"created_at":"2024-11-16T22:26:24.959Z","updated_at":"2025-04-09T22:19:53.592Z","avatar_url":"https://github.com/VachanVY.png","language":"Python","funding_links":[],"categories":["📚 Other resources and classic RL"],"sub_categories":[],"readme":"# Reinforcement-Learning\n\n---\n\n## Algorithms from Reinforcement Learning: An Introduction by Andrew Barto and Richard S. Sutton\n\n| **Algorithms**                          | **Environment (Name \u0026 Goal)**               | **Environment GIF**                           | **Plots**               |\n|----------------------------------------|---------------------------------------------|-----------------------------------------------|-------------------------------------------|\n| [Policy Iteration](#policy-iteration)  | **Frozen Lake**: The player makes moves until they reach the goal or fall in a hole. The lake is slippery (unless disabled) so the player may move perpendicular to the intended direction sometimes.               | ![pol](images/frozen_lake_policy_iteration.gif)  ![pol](images/frozen_slippery_lake_policy_iteration.gif)   | - |\n| [Value Iteration](#value-iteration)    | **Taxi-v3**: The taxi starts at a random location within the grid. The passenger starts at one of the designated pick-up locations. The passenger also has a randomly assigned destination (one of the four designated locations).                | ![Gridworld](images/Taxi-v3_value_iteration.gif) ![Gridworld](images/Taxi-v3_value_iteration1.gif) ![Gridworld](images/Taxi-v3_value_iteration2.gif)     | - |\n| [Monte Carlo Exploring Starts](#monte-carlo-exploring-starts) | **Blackjack-v1**: a card game where the goal is to beat the dealer by obtaining cards that sum to closer to 21 (without going over 21) than the dealer's cards        | ![Blackjack](images/mc_blackjack_value_iteration1.gif)  | ![Graph](images/blackjack_actions.png) ![Graph](images/blackjack_qvals.png) |\n| [Sarsa](#sarsa)                        | **CliffWalking-v0**: Reach goal without falling  | ![CliffWalking](images/cliff_walking_sarsa.gif)           | ![Graph](images/cliff_walking_gamma0.99_alpha0.1_epsilon0.1.png)  Sarsa: Orange       |\n| [Q-learning](#q-learning)              | **CliffWalking-v0**: Reach goal without falling  | ![CliffWalking](images/cliff_walking_qlearning.gif)      | ![Graph](images/cliff_walking_gamma0.99_alpha0.1_epsilon0.1.png)  Q-learning: Blue    |\n| [Expected Sarsa](#expected-sarsa)      | **CliffWalking-v0**: Reach goal without falling  | ![CliffWalking](images/cliff_walking_expected_sarsa.gif)  | ![Graph](images/cliff_walking_gamma0.99_alpha0.1_epsilon0.1.png)  Expected Sarsa: Green |\n| [Double Q-learning](#double-q-learning)          | **CliffWalking-v0**: Reach goal without falling  | ![CliffWalking](images/cliff_walking_double_qlearning.gif)  | ![Graph](images/cliff_walking_gamma0.99_alpha0.1_epsilon0.1.png) Double Q-learning: Red |\n| n-step Bootstrapping **(TODO)**        | -                                           | -                                             | -                                         |\n| [Dyna-Q]() | **ShortcutMazeEnv** (*custom made env*): Reach the goal dodging obstacles | ![maze0](images/shortcut_maze_before_Dyna-Q_with_25_planning_steps.gif) ![maze](images/shortcut_maze_after_Dyna-Q_with_25_planning_steps.gif) | ![compare by steps](images/dyna_q_num_planning_steps_zoomed.png)                                         | \n| [Prioritized Sweeping]() | **ShortcutMazeEnv** (*custom made env*): Reach the goal dodging obstacles | ![maze0](images/shortcut_maze_prioritized_sweeping_maze_env.gif) | ![steps](images/prioritized_sweeping_maze_env_num_training_curves.png) ![sum rewards](images/prioritized_sweeping_maze_env_sum_rewards.png)                                        | \n| [Monte-Carlo Policy-Gradient](#monte-carlo-policy-gradient) | **CartPole-v1**: goal is to balance the pole by applying forces in the left and right direction on the cart.                 | ![CartPole](images/actor_critic_cartpole.gif)               | ![Graph](images/monte_carlo_policy_gradient_cartpole_train_graph.png)         |\n| [REINFORCE with Baseline](#reinforce-with-baseline) | **CartPole-v1**: goal is to balance the pole by applying forces in the left and right direction on the cart.                 | ![CartPole](images/to/reinforce_baseline.gif)  | - |\n| [One-Step Actor-Critic](#one-step-actor-critic) | **CartPole-v1**: goal is to balance the pole by applying forces in the left and right direction on the cart.                 | ![CartPole](images/actor_critic_cartpole.gif)        | ![Graph](images/actor_critic_cartpole_rewards.png) |\n| Policy Gradient on Continuous Actions **(TODO)** | -                                    | -                                             | -                                         |\n| On-policy Control with Approximation **(TODO)** | -                                   | -                                             | -                                         |\n| Off-policy Methods with Approximation **(TODO)** | -                                   | -                                             | -                                         |\n| Eligibility Traces **(TODO)**          | -                                           | -                                             | -                                         |\n\n---\n\n## Deep Reinforcement Learning: Paper Implementations\n\n| **Year** | **Paper**                                                       | **Environment (Name \u0026 Goal)**               | **Environment GIF**                           | **Plots**               |\n|----------|-----------------------------------------------------------------|---------------------------------------------|-----------------------------------------------|-------------------------|\n| 2013     | [Playing Atari with Deep Reinforcement Learning](#)             | **ALE/Pong-v5** - You control the right paddle, you compete against the left paddle controlled by the computer. You each try to keep deflecting the ball away from your goal and into your opponent’s goal.   | \u003cimg src=\"images/dqn_pong.gif\" width=\"200\"\u003e                 | \u003cimg src=\"images/loss_dqn.png\" width=\"1000\"\u003e \u003cimg src=\"images/Sum_of_Reward.svg\" width=\"1000\"\u003e \u003cimg src=\"images/Steps_per_Episode.svg\" width=\"1000\"\u003e |\n| 2014     | [Deep Deterministic Policy Gradient (DDPG)](#)                  | **Pendulum-v1** - The pendulum starts in a random position and the goal is to apply torque on the free end to swing it into an upright position, with its center of gravity right above the fixed point.     | ![Pendulum](images/ddpg_pendulum.gif)                | ![Plot](images/ddpg_on_Pendulum-v1.png) |\n| 2015, 2016     | [Deep Reinforcement Learning with Double Q-Learning + Prioritized Experience Replay](#)         | -                   | -                   | - |\n| 2017     | [Proximal Policy Optimization (PPO)](#)                         | **LunarLander-v3**: This environment is a classic rocket trajectory optimization problem. According to Pontryagin’s maximum principle, it is optimal to fire the engine at full throttle or turn it off                 | ![opaos](images/lunar_lander.gif)                 | ![Plot](images/LunarLander-v3_rewards.png) |\n| 2018     | [Soft Actor-Critic (SAC)](#)                                    | **InvertedDoublePendulum-v5**: The cart can be pushed left or right, and the goal is to balance the second pole on top of the first pole, which is in turn on top of the cart, by applying continuous forces to the cart. | Constant Alpha: ![Plot](images/sac_inverteddoublependulum-v5_.gif) Learnable Alpha (**TODO**: add an explanation for adaptive alpha loss): ![Plot](images/sac_adaptive_alpha_inverteddoublependulum-v5_.gif) | Constant Alpha: ![Plot](images/sac_rewards_InvertedDoublePendulum-v5.png) Learnable Alpha: ![Plot](images/sac_rewards_adaptive_alpha_InvertedDoublePendulum-v5.png) |\n| 2017     | [Mastering the Game of Go without Human Knowledge](#)           | Go - Win against self-played adversary       | -                   |  -   |\n| 2017     | [AlphaZero](#)                                                  | Chess - Beat traditional engines             | -             |  -   |\n| 2020     | [Mastering Atari, Go, Chess and Shogi with a Learned Model](#)  | Multiple Environments (Planning with Models)| -                | ! -   |\n| 20xx     | [AlphaFold](#)                                                  | Protein Folding - Predict protein structures| -     | -   |\n\n\n\n---\n\n## Reinforcement Learning: An Introduction by Andrew Barto and Richard S. Sutton\n\u003e Below links don't redirect anywhere, gotta refactor the code and add links, for now go to the repo directly👆\n* [Dynamic Programming]()\n  * [Policy Iteration - Policy Evaluation \u0026 Policy Iteration]()\n  * [Value Iteration]()\n* [Monte-Carlo Methods]()\n  * [Monte Carlo Exploring Starts]()\n* [Temporal-Difference (Tabular)]()\n  * [Sarsa]()\n  * [Q-learning]()\n  * [Expected Sarsa]()\n  * Double Q-learning **(TODO)**\n* [n-step Bootstrapping (**TODO**)]\n* Planning and Learning with Tabular Methods (**TODO**)\n* [On-policy Prediction with Approximation]()\n  * Covered in [Papers]() Section, where we use function approximators like Neural Networks for RL\n* On-policy Control with Approximation (**TODO**)\n* Off-policy Methods with Approximation (**TODO**)\n* Eligibility Traces (**TODO**)\n* [Policy Gradient Methods]()\n  * [Monte-Carlo Policy-Gradient]()\n  * [REINFORCE with Baseline]()\n  * [One-Step Actor-Critic]()\n  * Policy Gradient on Continuous Actions (**TODO**)\n\n---\n## Reinforcement Learning: Paper Implementations\n\u003e Below links don't redirect anywhere, gotta refactor the code and add links, for now go to the repo directly👆\n* [2013: Playing Atari with Deep Reinforcement Learning]()\n* Prioritized DDQN || 2015: Deep Reinforcement Learning with Double Q-learning **+** 2016 Prioritized Experience Replay || **(TODO)**\n* [2017: Proximal Policy Optimization (PPO)]()\n* [2014: Deep Deterministic Policy Gradient]()\n* 2018: Soft Actor-Critic **(TODO)**\n* AlphaGo, AlphaZero, AlphaFold, etc: **(TODO)**\n  * 2017: Mastering the game of go without human knowledge\n  * 2017: AlphaZero\n  * 2020: Mastering Atari, Go, chess and shogi by planning with a learned model\n  * 20xx: AlphaFold\n\n\n\u003e 👇 Not Updated... Go to the repo files directly. To see the agent act in the env, see .gif files in images folder\n\n\u003cdiv style=\"display: flex; justify-content: space-around; align-items: center; flex-wrap: wrap;\"\u003e\n  \u003cimg src=\"images/dqn_pong.gif\" alt=\"DQN Pong\" width=\"300\"\u003e\n  \u003cimg src=\"images/cliff_walking_qlearning.gif\" alt=\"Q-Learning Cliff Walking\" width=\"300\"\u003e\n  \u003cimg src=\"images/cliff_walking_sarsa.gif\" alt=\"SARSA Cliff Walking\" width=\"300\"\u003e\n  \u003cimg src=\"images/cliff_walking_expected_sarsa.gif\" alt=\"Expected SARSA Cliff Walking\" width=\"300\"\u003e\n\u003c/div\u003e\n\n\n## Playing Atari with Deep Reinforcement Learning\n```python\npip install -r requirements.txt # INSTALL REQUIRED LIBRARIES\npython dqn.py # TO TRAIN FROM SCRATCH\npython dqn_play.py # TO GENERATE .gif OF AI AGENT PLAYING PONG\n```\n\n![Alt Text](images/dqn_pong.gif)\n![](images/sum_reward_tensorboard.png)\n![](images/steps_per_ep_dqn.png)\n![](images/loss_dqn.png)\n\n## (Important Clips from the paper)\n* The\nmodel is a convolutional neural network, trained with a variant of Q-learning,\nwhose input is raw pixels and whose output is a value function estimating future\nrewards. We apply our method to seven Atari 2600 games from the Arcade Learning\nEnvironment, with no adjustment of the architecture or learning algorithm\n\n### Introduction\n* Most successful RL applications\nthat operate on these domains have relied on hand-crafted features combined with linear value\nfunctions or policy representations. Clearly, the performance of such systems heavily relies on the\nquality of the feature representation.\n* Firstly, most successful deep learning applications to date have required large amounts of hand labelled\ntraining data\n* RL algorithms, on the other hand, must be able to learn from a scalar reward\nsignal that is frequently sparse, noisy and delayed. The delay between actions and resulting rewards,\nwhich can be thousands of timesteps long, seems particularly daunting when compared to the direct\nassociation between inputs and targets found in supervised learning\n* Another issue is that most deep\nlearning algorithms assume the data samples to be independent, while in reinforcement learning one\ntypically encounters sequences of highly correlated states\n* Furthermore, in RL the data distribution\nchanges as the algorithm learns new behaviours, which can be problematic for deep learning\nmethods that assume a fixed underlying distribution\n\n* The network is\ntrained with a variant of the Q-learning [26] algorithm, with stochastic gradient descent to update\nthe weights. To alleviate the problems of correlated data and non-stationary distributions, we use\nan experience replay mechanism [13] which randomly samples previous transitions, and thereby\nsmooths the training distribution over many past behaviors\n* The network was not provided\nwith any game-specific information or hand-designed visual features, and was not privy to the\ninternal state of the emulator; it learned from nothing but the video input, the reward and terminal\nsignals, and the set of possible actions—just as a human player would. Furthermore the network architecture\nand all hyperparameters used for training were kept constant across the games\n\n### Background\n* ![image](https://github.com/user-attachments/assets/20f449cc-9595-4f4e-9128-59599b55549d)\n* ![image](https://github.com/user-attachments/assets/0e714b40-8993-481d-a57c-b077a099d837)\n\n### Deep Reinforcement Learning\n* Experience replay where we store the agent’s experiences at each time-step, et = (st; at; rt; st+1)\nin a data-set D = e1; :::; eN, pooled over many episodes into a replay memory. During the inner\nloop of the algorithm, we apply Q-learning updates, or minibatch updates, to samples of experience,\ne \u0018 D, drawn at random from the pool of stored samples. After performing experience replay,\nthe agent selects and executes an action according to an eps-greedy policy\n* ![image](https://github.com/user-attachments/assets/8434e1f9-70e8-4b5d-9d3d-d093a742b9e2)\n* Advantages:\n  * First, each step of experience is potentially used in many weight updates, which allows for greater data efficiency.\n  * Second, learning directly from consecutive samples is inefficient, due to the strong correlations\nbetween the samples; randomizing the samples breaks these correlations and therefore reduces the\nvariance of the updates\n  * Third, when learning on-policy the current parameters determine the next\ndata sample that the parameters are trained on. For example, if the maximizing action is to move left\nthen the training samples will be dominated by samples from the left-hand side; if the maximizing\naction then switches to the right then the training distribution will also switch. It is easy to see how\nunwanted feedback loops may arise and the parameters could get stuck in a poor local minimum, or\neven diverge catastrophically. Note that when learning by experience replay, it is necessary to learn off-policy\n(because our current parameters are different to those used to generate the sample), which motivates\nthe choice of Q-learning\n\n* In practice, our algorithm only stores the last N experience tuples in the replay memory, and samples\nuniformly at random from D when performing updates. This approach is in some respects limited\nsince the memory buffer does not differentiate important transitions and always overwrites with\nrecent transitions due to the finite memory size N. Similarly, the uniform sampling gives equal\nimportance to all transitions in the replay memory. A more sophisticated sampling strategy might\nemphasize transitions from which we can learn the most, similar to prioritized sweeping\n\n![image](https://github.com/user-attachments/assets/4ace2761-7e8c-4fcd-9691-ac6cdfce0124)\n![image](https://github.com/user-attachments/assets/8a03abfc-1bd0-4ed9-ae5e-1e12e8d44cb7)\n\n### Experiments\n* ![image](https://github.com/user-attachments/assets/56ec5a1b-8976-4de8-9a48-904af341c71d)\n\n## Q-learning, Sarsa and Expected Sarsa on Cliff-Walking\n```python\npython qlearning_sarsa_expectedsarsa_on_cliff_walking.py --animation True --experiments True\n```\n\u003cdiv style=\"display: flex; justify-content: space-around; align-items: center; flex-wrap: wrap;\"\u003e\n  \u003cimg src=\"images/cliff_walking_qlearning.gif\" alt=\"Q-Learning Cliff Walking\" width=\"300\"\u003e\n  \u003cimg src=\"images/cliff_walking_sarsa.gif\" alt=\"SARSA Cliff Walking\" width=\"300\"\u003e\n  \u003cimg src=\"images/cliff_walking_expected_sarsa.gif\" alt=\"Expected SARSA Cliff Walking\" width=\"300\"\u003e\n\u003c/div\u003e\n\n* Q-learning learns the optimal policy that travels right along the edge, this may result in the agent occasionally falling off the cliff when **TRAINING** (not during inference) with an epsilon-greedy policy which is the reason for getting a lower `sum of rewards` during training than Sarsa and Expected Sarsa\n\u003cdiv style=\"display: flex; justify-content: space-around; align-items: center; flex-wrap: wrap;\"\u003e\n  \u003cimg src=\"images/cliff_walking_gamma0.99_alpha0.1_epsilon0.1.png\" alt=\"Gamma 0.99, Alpha 0.1, Epsilon 0.1\" width=\"200\"\u003e\n  \u003cimg src=\"images/cliff_walking_gamma0.95_alpha0.1_epsilon0.1.png\" alt=\"Gamma 0.95, Alpha 0.1, Epsilon 0.1\" width=\"200\"\u003e\n  \u003cimg src=\"images/cliff_walking_gamma0.8_alpha0.1_epsilon0.1.png\" alt=\"Gamma 0.8, Alpha 0.1, Epsilon 0.1\" width=\"200\"\u003e\n  \u003cimg src=\"images/cliff_walking_gamma0.99_alpha0.01_epsilon0.png\" alt=\"Gamma 0.99, Alpha 0.01, Epsilon 0\" width=\"200\"\u003e\n  \u003cimg src=\"images/cliff_walking_gamma0.75_alpha0.1_epsilon0.png\" alt=\"Gamma 0.75, Alpha 0.1, Epsilon 0\" width=\"200\"\u003e\n  \u003cimg src=\"images/cliff_walking_gamma0.75_alpha0.1_epsilon0.1.png\" alt=\"Gamma 0.75, Alpha 0.1, Epsilon 0.1\" width=\"200\"\u003e\n  \u003cimg src=\"images/cliff_walking_gamma0.9_alpha0.1_epsilon0.1.png\" alt=\"Gamma 0.9, Alpha 0.1, Epsilon 0.1\" width=\"200\"\u003e\n  \u003cimg src=\"images/cliff_walking_gamma0.9_alpha0.1_epsilon0.1.png\" alt=\"Gamma 0.9, Alpha 0.1, Epsilon 0.1\" width=\"200\"\u003e\n\u003c/div\u003e\n\n\n![image](https://github.com/user-attachments/assets/57246c91-0eeb-44c4-ac28-db057afef960)\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvachanvy%2Freinforcement-learning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fvachanvy%2Freinforcement-learning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fvachanvy%2Freinforcement-learning/lists"}