{"id":28496491,"url":"https://github.com/linesd/tabular-methods","last_synced_at":"2025-10-08T20:05:08.645Z","repository":{"id":294941301,"uuid":"210219253","full_name":"linesd/tabular-methods","owner":"linesd","description":"Tabular methods for reinforcement learning","archived":false,"fork":false,"pushed_at":"2020-07-03T15:23:04.000Z","size":1588,"stargazers_count":38,"open_issues_count":1,"forks_count":8,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-08-10T00:08:52.732Z","etag":null,"topics":["algorithm","cliffwalking","gridworld","gridworld-cliff","gridworld-environment","policy-evaluation","policy-iteration","q-learning","q-learning-algorithm","q-learning-vs-sarsa","reinforcement-learning","reinforcement-learning-agent","reinforcement-learning-algorithms","sarsa","sarsa-algorithm","sarsa-learning","tabular-environments","tabular-methods","tabular-q-learning","value-iteration"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/linesd.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2019-09-22T21:53:10.000Z","updated_at":"2025-05-21T11:45:36.000Z","dependencies_parsed_at":"2025-05-26T04:45:28.799Z","dependency_job_id":null,"html_url":"https://github.com/linesd/tabular-methods","commit_stats":null,"previous_names":["linesd/tabular-methods"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/linesd/tabular-methods","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linesd%2Ftabular-methods","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linesd%2Ftabular-methods/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linesd%2Ftabular-methods/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linesd%2Ftabular-methods/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/linesd","download_url":"https://codeload.github.com/linesd/tabular-methods/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/linesd%2Ftabular-methods/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":279000704,"owners_count":26082819,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","status":"online","status_checked_at":"2025-10-08T02:00:06.501Z","response_time":56,"last_error":null,"robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":true,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["algorithm","cliffwalking","gridworld","gridworld-cliff","gridworld-environment","policy-evaluation","policy-iteration","q-learning","q-learning-algorithm","q-learning-vs-sarsa","reinforcement-learning","reinforcement-learning-agent","reinforcement-learning-algorithms","sarsa","sarsa-algorithm","sarsa-learning","tabular-environments","tabular-methods","tabular-q-learning","value-iteration"],"created_at":"2025-06-08T12:30:28.522Z","updated_at":"2025-10-08T20:05:08.630Z","avatar_url":"https://github.com/linesd.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# tabular-methods\n\n\n[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://github.com/YannDubs/disentangling-vae/blob/master/LICENSE) \n[![Python 3.5+](https://img.shields.io/badge/python-3.5+-blue.svg)](https://www.python.org/downloads/release/python-360/)\n\nThis repository is a python implementation of tabular-methods for reinforcement learning focusing on the dynamic \nprogramming and temporal difference methods presented in \n[Reinforcement Learning, An Introduction](http://incompleteideas.net/book/the-book-2nd.html). The following \nalgorithms are implemented:\n\n1. **Value Iteration:** see page 67 of [Reinforcement Learning, An Introduction](http://incompleteideas.net/book/bookdraft2017nov5.pdf)\n2. **Policy Iteration:** see page 64 of [Reinforcement Learning, An Introduction](http://incompleteideas.net/book/bookdraft2017nov5.pdf)\n3. **SARSA, on-policy TD control:** see page 105 of [Reinforcement Learning, An Introduction](http://incompleteideas.net/book/bookdraft2017nov5.pdf)\n4. **Q-Learning off-policy TD control:** see page 107 of [Reinforcement Learning, An Introduction](http://incompleteideas.net/book/bookdraft2017nov5.pdf)\n\n**Notes:**\n- Tested for python \u003e= 3.5\n\n**Table of Contents:**\n1. [Install](#install)\n2. [Examples](#examples)\n    1. [Create Grid World](#create-grid-world)\n    2. [Dynamic Programming (Value Iteration \u0026 Policy Iteration)](#dynamic-programming)\n    3. [Temporal Difference (SARSA and Q-Learning)](#temporal-difference)\n3. [Test](#testing)\n\n## Install\n```\n# clone repo\npip install -r requirements.txt\n```\n\n## Examples\n### Create Grid World\nThis describes the example found in `examples/example_plot_gridworld.py` which illustrates all the\nfunctionality of the `GridWorld` class found in `env/grid_world.py`. It shows how to:\n\n- Define the grid world size by specifying the number of rows and columns.\n- Add a single start state.\n- Add multiple goal states.\n- Add obstructions such as walls, bad states and restart states.\n- Define the rewards for the different types of states.\n- Define the transition probabilities for the world.\n\nThe grid world is instantiated with the number of rows, number of columns, start \nstate and goal states:\n```\n# specify world parameters\nnum_rows = 10\nnum_cols = 10\nstart_state = np.array([[0, 4]]) # shape (1, 2)\ngoal_states = np.array([[0, 9], \n                        [2, 2], \n                        [8, 7]]) # shape (n, 2)\n\ngw = GridWorld(num_rows=num_rows,\n               num_cols=num_cols,\n               start_state=start_state,\n               goal_states=goal_states)\n```\n\nAdd obstructed states, bad states and restart states:\n\n- Obstructed states: walls that prohibit the agent from entering that state.\n- Bad states: states that incur a greater penalty than a normal step.\n- Restart states: states that incur a high penalty and transition the agent \nback to the start state (but do not end the episode).\n\n```\nobstructions = np.array([[0,7],[1,1],[1,2],[1,3],[1,7],[2,1],[2,3],\n                         [2,7],[3,1],[3,3],[3,5],[4,3],[4,5],[4,7],\n                         [5,3],[5,7],[5,9],[6,3],[6,9],[7,1],[7,6],\n                         [7,7],[7,8],[7,9],[8,1],[8,5],[8,6],[9,1]]) # shape (n, 2)\nbad_states = np.array([[1,9],\n                       [4,2],\n                       [4,4],\n                       [7,5],\n                       [9,9]])      # shape (n, 2)\nrestart_states = np.array([[3,7],\n                           [8,2]])  # shape (n, 2)\n\ngw.add_obstructions(obstructed_states=obstructions,\n                    bad_states=bad_states,\n                    restart_states=restart_states)\n```\nDefine the rewards for the obstructions:\n\n```\ngw.add_rewards(step_reward=-1,\n               goal_reward=10,\n               bad_state_reward=-6,\n               restart_state_reward=-100)\n```\nAdd transition probabilities to the grid world.\n\np_good_transition is the probability that the agent successfully\nexecutes the intended action. The action is then incorrectly executed\nwith probability 1 - p_good_transition and in tis case the agent\ntransitions to the left of the intended transition with probability\n(1 - p_good_transition) * bias and to the right with probability\n(1 - p_good_transition) * (1 - bias).\n\n```\ngw.add_transition_probability(p_good_transition=0.7,\n                              bias=0.5)\n```\n\nFinally, add a discount to the world and create the model. \n\n```\ngw.add_discount(discount=0.9)\nmodel = gw.create_gridworld()\n``` \n\nThe created grid world can be viewed with the `plot_gridworld` function in `utils/plots`.\n\n```\nplot_gridworld(model, title=\"Test world\")\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"doc/imgs/unsolved_gridworld.png\" width=500\u003e\n\u003c/p\u003e\n\n### Dynamic programming\n#### Value Iteration \u0026 Policy Iteration\n\nHere the created grid world is solved through the use of the dynamic programming method\nvalue iteration (from `examples/example_value_iteration.py`). See also \n`examples/example_policy_iteration.py` for the equivalent solution via policy iteration.\n\nApply value iteration to the grid world:\n\n```\n# solve with value iteration\nvalue_function, policy = value_iteration(model, maxiter=100)\n\n# plot the results\nplot_gridworld(model, value_function=value_function, policy=policy, title=\"Value iteration\")\n\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"doc/imgs/value_iteration.png\" width=500\u003e\n\u003c/p\u003e\n\n### Temporal Difference\n#### SARSA \u0026 Q-Learning\n\nThis example describes the code found in `examples/example_sarsa.py` and `examples/example_qlearning.py` \nwhich use SARSA and Q-Learning to replicate the solution to the classic **cliff walk** environment on page 108 of \n[Sutton's book](http://incompleteideas.net/book/bookdraft2017nov5.pdf). \n\nThe cliff walk environment is created with the code:\n```\n# specify world parameters\nnum_rows = 4\nnum_cols = 12\nrestart_states = np.array([[3,1],[3,2],[3,3],[3,4],[3,5],\n                           [3,6],[3,7],[3,8],[3,9],[3,10]])\nstart_state = np.array([[3,0]])\ngoal_states = np.array([[3,11]])\n\n# create model\ngw = GridWorld(num_rows=num_rows,\n               num_cols=num_cols,\n               start_state=start_state,\n               goal_states=goal_states)\ngw.add_obstructions(restart_states=restart_states)\ngw.add_rewards(step_reward=-1,\n               goal_reward=10,\n               restart_state_reward=-100)\ngw.add_transition_probability(p_good_transition=1,\n                              bias=0)\ngw.add_discount(discount=0.9)\nmodel = gw.create_gridworld()\n\n# plot the world\nplot_gridworld(model, title=\"Cliff Walk\")\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"doc/imgs/unsolved_cliffworld.png\" width=500\u003e\n\u003c/p\u003e\n\nSolve the cliff walk with the on-policy temporal difference control method **SARSA** and plot the results. \nSARSA returns three values, the q_function, the policy and the state_counts. Here the policy and the \nstate_counts are passed to `plot_gridworld` so that the path most frequently used by the agent is shown. \nHowever, the q_function can be passed instead to show the q_function values on the plot as was done with\nthe dynamic programming examples.  \n\n```\n# solve with SARSA\nq_function, pi, state_counts = sarsa(model, alpha=0.1, epsilon=0.2, maxiter=100, maxeps=100000)\n\n# plot the results\nplot_gridworld(model, policy=pi, state_counts=state_counts, title=\"SARSA\")\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"doc/imgs/sarsa_cliffworld.png\" width=500\u003e\n\u003c/p\u003e\n\nSolve the cliff walk with the off-policy temporal difference control method **Q-Learning** and plot the results.\n\n```\n# solve with Q-Learning\nq_function, pi, state_counts = qlearning(model, alpha=0.9, epsilon=0.2, maxiter=100, maxeps=10000)\n\n# plot the results\nplot_gridworld(model, policy=pi, state_counts=state_counts, title=\"Q-Learning\", path=path)\n```\n\u003cp align=\"center\"\u003e\n  \u003cimg src=\"doc/imgs/qlearning_cliffworld.png\" width=500\u003e\n\u003c/p\u003e\n\nFrom the plots, it is clear that the SARSA agent learns a conservative solution to the cliff walk and shows\npreference for the path furthest away from the cliff edge. In contrast, the Q-Learning agent learns the riskier\npath along the cliff edge. \n\n## Testing\n\nTesting setup with [pytest](https://docs.pytest.org) (requires installation). Should you want to check version \ncompatibility or make changes, you can check that original tabular-methods functionality remains unaffected by \nexecuting `pytest -v` in the **test** directory. You should see the following:\n\n![pytest_results](doc/imgs/pytest_results.png)\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinesd%2Ftabular-methods","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Flinesd%2Ftabular-methods","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Flinesd%2Ftabular-methods/lists"}