{"id":21989713,"url":"https://github.com/iandanforth/sai-assignment","last_synced_at":"2026-04-29T11:03:00.098Z","repository":{"id":142059650,"uuid":"177102489","full_name":"iandanforth/sai-assignment","owner":"iandanforth","description":"Q-Learners","archived":false,"fork":false,"pushed_at":"2019-03-22T08:38:00.000Z","size":88,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2026-04-03T04:16:48.313Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/iandanforth.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2019-03-22T08:36:30.000Z","updated_at":"2019-03-22T08:38:02.000Z","dependencies_parsed_at":null,"dependency_job_id":"09ee104a-4a9a-4534-9aff-d99371bb1b5a","html_url":"https://github.com/iandanforth/sai-assignment","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/iandanforth/sai-assignment","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iandanforth%2Fsai-assignment","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iandanforth%2Fsai-assignment/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iandanforth%2Fsai-assignment/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iandanforth%2Fsai-assignment/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/iandanforth","download_url":"https://codeload.github.com/iandanforth/sai-assignment/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/iandanforth%2Fsai-assignment/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":32422533,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-04-29T06:29:02.080Z","status":"ssl_error","status_checked_at":"2026-04-29T06:29:00.631Z","response_time":110,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-29T19:32:20.622Z","updated_at":"2026-04-29T11:03:00.082Z","avatar_url":"https://github.com/iandanforth.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Coding Assignment\n\n### Scripts\n\nFiles `part1.py` and `part2.py` implement the prompts and produce relevant charts \nwhich will open in your default browser.\n\n## Installation\n\n### Requirements\n\n - Python 2.7\n\n### Setup\n\nI recommend creating a virtual environment of your choice (pipenv, venv, conda) before installation\nhowever that is not required. e.g.\n\n```\npip install pipenv\npipenv --python 2.7\npipenv shell\n```\n\n`pip install -U -r requirements.txt`\n\n#### Tests + Coverage\n\n`pytest --cov`\n\n## Discussion\n\n### Part 1\n\nThe initial task was to implement a Q-Learning agent to solve a gridworld maze which changes after\n1000 steps.\n\nTo implement this I modified the [CliffWalkingEnv](https://github.com/openai/gym/blob/master/gym/envs/toy_text/cliffwalking.py) \nfrom the OpenAI Gym toytext suite to use a\nwall rather than a cliff. Like CliffWalking the Gridworld environment is deterministic.\n\nNext I implemented a QAgent class to manage the agent policy, select actions, and perform learning\nupdates.\n\nFinally I instrumented the main script with relevant visualizations of learning progress using\nplotly.\n\n#### Usage\n\n`python part1.py` to run.\n\n`python part1.py --help` for all options.\n\nThis will train a `QAgent` until it accumulates 200 reward (by default). A random version \nof the QAgent will also be run as a baseline.\n\nThe script will generate four charts.\n\n - A visualization of the greedy version of the policy after 1000 steps\n - A graph of the cumulative reward over the first 1000 steps\n - A visualization of the greedy version of the policy after all training steps\n - A graph of the cumulative reward after all training_steps\n\nThe final chart should look like this:\n\u003cp align=\"center\"\u003e\u003cimg width=\"80%\" src=\"images/baseline-training.png\" /\u003e\u003c/p\u003e\n\n#### Notes\n\nThere is an interesting edge case here. If, prior to the wall update, the agent has not\naccumulated sufficient reward, the Q-table will not be populated with any values\nbelow the wall. This makes it *easier* for the agent to learn following the wall update. It\ndoesn't have to unlearn anything as the portion of the Q-table below the wall is still\nessentially random. You can observe this with:\n\n`python part1.py --seed-type bad`\n\nIf an agent receives sufficient reward prior to the wall update then it has to overcome prior\nknowledge and this takes substantially more time. Judging from the prompt this is the\npreferred case.\n\nIn part 2 we modify the wall-shift to 2000 steps to make it much more likely that agents \nhave learned sufficiently before demonstrating relearning.\n\n### Part 2\n\nThe second task is to extend the previous training to support parallel and asynchronous updates \nto the policy. An optional component was to use multiprocessing rather than threading and\nthat is done here.\n\nThe implementation draws heavily from [SubprocVecEnv](https://github.com/hill-a/stable-baselines/blob/master/stable_baselines/common/vec_env/subproc_vec_env.py) \nin stable-baselines. While that library \nallows for easy parallel training it is synchronous across environments requires Python 3.\n\nI back-ported and adapted several key pieces of that library to provide a vector of\n`AsyncQAgents` which each manage their own env and are trained asynchronously. Each agent updates\nthe shared multiprocessing.Array object (Q-Table) every `async_update` steps.\n\n#### Usage\n\n`python part2.py` to run.\n\n`python part2.py --help` for all options.\n\nThis will train [2, 4, 6, 8] agents on the gridworld. `wall_shift` is set to 2000 as noted above\nto allow for sufficient early learning.\n\nThe script will generate nine charts.\n\nFor each `worker_count` it will generate:\n\n - A visualization of the greedy version of the policy after all training steps\n - A graph of the cumulative reward after all training_steps, for each worker and on average\n\nFinally it will generate:\n\n - A graph of the average cumulative reward across the previous runs to compare the impact\n   of increasing `worker_count`.\n\nThe final chart should look like this:\n\u003cp align=\"center\"\u003e\u003cimg width=\"80%\" src=\"images/async-training.png\" /\u003e\u003c/p\u003e\n\n#### Notes\n\nHyperparameters are not tuned to produce the fastest training runs. Instead they are set to\nmore clearly visualize the influence of `worker_count` on training.\n\n### Part 3\n\nThe final task is to implement a deep Q network (DQN) in the style of the original Nature paper\nand train it on the Space Invaders gym environment.\n\n`part3.py` is the start of this implementation but it is incomplete due to time constraints.\nPlease review the code to see the extension of work from Part 2. Generally the approach \nfollows the same pattern:\n\nA shared policy (now with separate lock) is passed to multiple agents each with their own\ngym environment.\n\n#### Usage\n\nAs this is incomplete code there are additional requirements to get it running.\n\n - Python 3.6+\n - [pytorch](https://pytorch.org/get-started/locally/)\n - [stable-baselines](https://github.com/hill-a/stable-baselines)\n\n`python part3.py` - WARNING: This will consume all available CPU.\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiandanforth%2Fsai-assignment","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fiandanforth%2Fsai-assignment","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fiandanforth%2Fsai-assignment/lists"}