{"id":15017247,"url":"https://github.com/mtrazzi/rl-book-challenge","last_synced_at":"2025-08-20T04:32:44.604Z","repository":{"id":47537997,"uuid":"235986019","full_name":"mtrazzi/rl-book-challenge","owner":"mtrazzi","description":"self-studying the Sutton \u0026 Barto the hard way","archived":false,"fork":false,"pushed_at":"2021-11-27T11:15:43.000Z","size":14217,"stargazers_count":190,"open_issues_count":1,"forks_count":32,"subscribers_count":8,"default_branch":"master","last_synced_at":"2024-12-09T15:51:09.046Z","etag":null,"topics":["anki","matplotlib","reinforcement-learning","rl-algorithms","rl-book","sutton-barto-book"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/mtrazzi.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2020-01-24T10:57:42.000Z","updated_at":"2024-12-07T11:56:04.000Z","dependencies_parsed_at":"2022-09-14T17:22:02.754Z","dependency_job_id":null,"html_url":"https://github.com/mtrazzi/rl-book-challenge","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtrazzi%2Frl-book-challenge","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtrazzi%2Frl-book-challenge/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtrazzi%2Frl-book-challenge/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/mtrazzi%2Frl-book-challenge/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/mtrazzi","download_url":"https://codeload.github.com/mtrazzi/rl-book-challenge/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":230394228,"owners_count":18218707,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["anki","matplotlib","reinforcement-learning","rl-algorithms","rl-book","sutton-barto-book"],"created_at":"2024-09-24T19:50:07.180Z","updated_at":"2024-12-19T07:06:55.302Z","avatar_url":"https://github.com/mtrazzi.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# In this repo\n\n1. Python replication of all the plots from [Reinforcement Learning: An Introduction](http://incompleteideas.net/book/RLbook2018trimmed.pdf)\n2. Solution for all of the exercises\n3. Anki flashcards summary of the book\n\n## 1. Replicate all the figures\n\nTo reproduce a figure, say figure 2.2, do:\n\n```bash\ncd chapter2\npython figures.py 2.2\n```\n\n### Chapter 2\n1. [Figure 2.2: Average performance of epsilon-greedy action-value methods on the 10-armed testbed](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter2/plots/fig2.2.png)\n2. [Figure 2.3: Optimistic initial action-value estimates](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter2/plots/fig2.3.png)\n3. [Figure 2.4: Average performance of UCB action selection on the 10-armed testbed](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter2/plots/fig2.4.png)\n4. [Figure 2.5: Average performance of the gradient bandit algorithm](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter2/plots/fig2.5.png)\n5. [Figure 2.6: A parameter study of the various bandit algorithms](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter2/plots/fig2.6.png)\n\n### Chapter 4\n1. Figure 4.2: Jack’s car rental problem ([value function](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/fig4.2.png), [policy](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/fig4.2_policy.png))\n2. Figure 4.3: The solution to the gambler’s problem ([value function](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/fig4.3.png), [policy](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/fig4.3_policy.png))\n\n### Chapter 5\n1. [Figure 5.1: Approximate state-value functions for the blackjack policy](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/fig5.1.png)\n2. [Figure 5.2: The optimal policy and state-value function for blackjack found by Monte Carlo ES](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/fig5.2.png)\n3. [Figure 5.3: Weighted importance sampling](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/fig5.3.png)\n4. [Figure 5.4: Ordinary importance sampling with surprisingly unstable estimates](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/fig5.4.png)\n5. Figure 5.5: A couple of right turns for the racetrack task ([1](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/fig5.5_left.png), [2](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/fig5.5_right_1.png), [3](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/fig5.5_right_2.png))\n\n### Chapter 6\n1. [Figure 6.1: Changes recommended in the driving home example by Monte Carlo methods (left)\nand TD methods (right)](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/fig6.1.png)\n2. [Example 6.2: Random walk](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/example6.2.png) ([comparison](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/example6.2_comparison.png))\n3. [Figure 6.2: Performance of TD(0) and constant MC under batch training on the random walk task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/fig6.2.png)\n4. [Example 6.5: Windy Gridworld](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/example6.5.png)\n5. [Example 6.6: Cliff Walking](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/example6.6.png)\n6. [Figure 6.3: Interim and asymptotic performance of TD control methods](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/fig6.3.png) ([comparison](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/fig6.3_comparison.png))\n7. [Figure 6.5: Comparison of Q-learning and Double Q-learning](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/fig6.5.png)\n\n### Chapter 7\n1. [Figure 7.2: Performance of n-step TD methods on 19-state random walk](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter7/plots/fig7.2.png) ([comparison](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter7/plots/fig7.2_comparison.png))\n2. [Figure 7.4: Gridworld example of the speedup of policy learning due to the use of n-step\nmethods](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter7/plots/fig7.4.png)\n\n### Chapter 8\n1. [Figure 8.2: Average learning curves for Dyna-Q agents varying in their number of planning steps](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/fig8.2.png)\n2. [Figure 8.3: Policies found by planning and nonplanning Dyna-Q agents](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/fig8.3.png)\n3. [Figure 8.4: Average performance of Dyna agents on a blocking task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/fig8.4.png)\n4. [Figure 8.5: Average performance of Dyna agents on a shortcut task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/fig8.5.png)\n5. [Example 8.4: Prioritized sweeping significantly shortens learning time on the Dyna maze task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/example8.4.png)\n6. [Figure 8.7: Comparison of efficiency of expected and sample updates](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/fig8.7.png)\n7. [Figure 8.8: Relative efficiency of different update distributions](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/fig8.8.png)\n\n### Chapter 9\n1. [Figure 9.1: Gradient Monte Carlo algorithm on the 1000-state random walk task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter9/plots/fig9.1.png)\n2. [Figure 9.2: Semi-gradient n-steps TD algorithm on the 1000-state random walk task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter9/plots/fig9.2.png)\n3. [Figure 9.5: Fourier basis vs polynomials on the 1000-state random walk task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter9/plots/fig9.5.png) ([comparison](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter9/plots/fig9.5_comparison.png))\n4. [Figure 9.10: State aggregation vs. Tile coding on 1000-state random walk task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter9/plots/fig9.10.png) ([comparison](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter9/plots/fig9.10.png))\n\n### Chapter 10\n1. Figure 10.1: The cost-to-go function for Mountain Car task in one run ([428 steps](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.1_428_steps.png); [12](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.1_12_episodes.png), [104](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.1_104_episodes.png), [1000](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.1_1000_episodes.png), [9000](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.1_9000_episodes.png) episodes)\n2. [Figure 10.2: Learning curves for semi-gradient Sarsa on Mountain Car task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.2.png)\n3. [Figure 10.3: One-step vs multi-step performance of semi-gradient Sarsa on the Mountain Car task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.3.png)\n4. [Figure 10.4: Effect of the alpha and n on early performance of n-step semi-gradient Sarsa](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.4.png)\n5. [Figure 10.5: Differential semi-gradient Sarsa on the access-control queuing task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter10/plots/fig10.5.png)\n\n### Chapter 11\n1. [Figure 11.2: Demonstration of instability on Baird’s counterexample](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter11/plots/fig11.2.png)\n2. [Figure 11.5: The behavior of the TDC algorithm on Baird’s counterexample](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter11/plots/fig11.5.png)\n3. [Figure 11.6: The behavior of the ETD algorithm in expectation on Baird’s counterexample](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter11/plots/fig11.6.png)\n\n### Chapter 12\n1. [Figure 12.3: Off-line λ-return algorithm on 19-state random walk](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter12/plots/fig12.3.png)\n2. [Figure 12.6: TD(λ) algorithm on 19-state random walk](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter12/plots/fig12.6.png)\n3. [Figure 12.8: True online TD(λ) algorithm on 19-state random walk](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter12/plots/fig12.8.png)\n4. [Figure 12.10: Sarsa(λ) with replacing traces on Mountain Car](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter12/plots/fig12.10.png)\n5. [Figure 12.11: Summary comparison of Sarsa(λ) algorithms on Mountain Car](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter12/plots/fig12.11.png)\n\n### Chapter 13\n1. [Figure 13.1: REINFORCE on the short-corridor grid world](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter13/plots/fig13.1.png)\n2. [Figure 13.2: REINFORCE with baseline on the short-corridor grid-world](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter13/plots/fig13.2.png)\n\n## 2. Solution to all of the exercises ([text answers](https://github.com/mtrazzi/rl-book-challenge/tree/master/exercises.txt))\n\nTo reproduce the results of an exercise, say exercise 2.5 do:\n\n```bash\ncd chapter2\npython figures.py ex2.5\n```\n\n### Chapter 2\n\n1. [Exercise2.5: Difficulties that sample-average methods have for nonstationary problems](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter2/plots/ex2.5.png)\n\n1. [Exercise2.11: Figure analogous to Figure 2.6 for the nonstationary\ncase](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter2/plots/ex2.11.png)\n\n### Chapter 4\n\n1. Exercise 4.7: Modified Jack's car rental problem ([value function](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/ex4.7.png), [policy](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/ex4.7_policy.png))\n\n2. Exercise 4.9: Gambler’s problem with ph = 0.25 ([value function](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/ex4.9_ph_025.png), [policy](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/ex4.9_ph_025_policy.png)) and ph = 0.55 ([value function](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/ex4.9_ph_055.png), [policy](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter4/plots/ex4.9_ph_055_policy.png))\n\n### Chapter 5\n\n1. Exercise 5.14: Modified MC Control on the racetrack ([1](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/ex5.14_right_1.png), [2](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter5/plots/ex5.14_right_2.png))\n\n### Chapter 6\n\n1. [Exercise 6.4: Wider range of values alpha](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/ex6.4.png)\n2. [Exercise 6.5: High alpha, 99ffect of initialization](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/ex6.5.png)\n3. [Exercise 6.9: Windy Gridworld with King’s Moves](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/ex6.9.png)\n4. [Exercise 6.10: Stochastic Wind](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/ex6.10.png)\n5. [Exercise 6.13: Double Expected Sarsa vs. Expected Sarsa](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter6/plots/ex6.13.png)\n\n### Chapter 7\n\n1. [Exercise7.2: Sum of TD error vs. n-step TD on 19-states random walk](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter7/plots/ex7.2.png)\n2. [Exercise7.3: 19 states vs. 5 states, left-side outcome of -1](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter7/plots/ex7.3.png)\n3. [Exercise7.7: Off-policy action-value prediction on a not-so-random walk](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter7/plots/ex7.7.png)\n4. [Exercise7.10: Off-policy action-value prediction on a not-so-random walk](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter7/plots/ex7.10.png)\n\n### Chapter 8\n1. [Exercise8.1: n-step sarsa on the maze task](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/ex8.1.png)\n2. [Exercise8.4: Gridworld experiment to test the exploration bonus](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter8/plots/ex8.4.png)\n\n### Chapter 11\n\n1. [Exercise11.3: One-step semi-gradient Q-learning to Baird’s counterexample](https://raw.githubusercontent.com/mtrazzi/rl-book-challenge/master/chapter11/plots/ex11.3.png)\n\n## 3. [Anki flashcards](https://drive.google.com/file/d/1-KSrOr5G6M9G-SjJ0weqnycJvX2rspi1/view?usp=sharing) (cf. [this blog](http://augmentingcognition.com/ltm.html))\n\n## Appendix\n\n### Dependencies\n\n```bash\nnumpy\nmatplotlib\nseaborn\n```\n\n### Credits\n\nAll of the code and answers are mine, except for mountain car's [tile coding](https://github.com/mtrazzi/rl-book-challenge/blob/master/chapter10/tiles_sutton.py) (url in the book).\n\nThis README is inspired from [ShangtongZhang's repo](https://github.com/ShangtongZhang/reinforcement-learning-an-introduction).\n\n### Design choices\n\n1. All of the chapters are self-contained.\n2. The environments use a gym-like API with methods:\n\n```bash\ns = env.reset()\ns_p, r, d, dict = env.step(a)\n```\n\n### How long did it take\n\nThe entire thing (plots, exercises, anki cards (including reviewing)) took about 400h of focused work.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtrazzi%2Frl-book-challenge","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fmtrazzi%2Frl-book-challenge","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fmtrazzi%2Frl-book-challenge/lists"}