{"id":26005438,"url":"https://github.com/wb-az/reinforcement-learning","last_synced_at":"2026-05-16T09:34:44.004Z","repository":{"id":169710798,"uuid":"577082640","full_name":"Wb-az/Reinforcement-Learning","owner":"Wb-az","description":"Reinforcement Learning  and Deeep reinforcement Learning","archived":false,"fork":false,"pushed_at":"2023-06-05T15:18:05.000Z","size":714,"stargazers_count":0,"open_issues_count":0,"forks_count":1,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-03-05T21:04:40.394Z","etag":null,"topics":["bipedalwalker-v3","custom-environment","deep-learning","deep-reinforcement-learning","epsilon-greedy","learning-policies","lunarlandercontinuous-v2","policy-iteration-algorithm","python","pytorch-implementation","q-learning-algorithm","soft-actor-critic","soft-actor-critic-continuous"],"latest_commit_sha":null,"homepage":"","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Wb-az.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-12-11T22:51:28.000Z","updated_at":"2023-05-21T20:33:38.000Z","dependencies_parsed_at":null,"dependency_job_id":"da52f144-c14b-464e-bf45-1c6bfdef0e52","html_url":"https://github.com/Wb-az/Reinforcement-Learning","commit_stats":null,"previous_names":["wb-az/reinforcement-learning"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Wb-az/Reinforcement-Learning","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wb-az%2FReinforcement-Learning","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wb-az%2FReinforcement-Learning/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wb-az%2FReinforcement-Learning/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wb-az%2FReinforcement-Learning/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Wb-az","download_url":"https://codeload.github.com/Wb-az/Reinforcement-Learning/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Wb-az%2FReinforcement-Learning/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33096974,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-16T04:41:52.686Z","status":"ssl_error","status_checked_at":"2026-05-16T04:41:52.009Z","response_time":115,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["bipedalwalker-v3","custom-environment","deep-learning","deep-reinforcement-learning","epsilon-greedy","learning-policies","lunarlandercontinuous-v2","policy-iteration-algorithm","python","pytorch-implementation","q-learning-algorithm","soft-actor-critic","soft-actor-critic-continuous"],"created_at":"2025-03-05T20:56:53.305Z","updated_at":"2026-05-16T09:34:43.998Z","avatar_url":"https://github.com/Wb-az.png","language":"Jupyter Notebook","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Reinforcement Learning and Deep Reinforcement Learning\n\n\n**\u003ch2\u003e\u003cdiv\u003e1. Introduction\u003c/h2\u003e\u003c/div\u003e** \n\nThis repository contains three reinforcement learning tasks. \n1. A customised phoneme environment and arbitrary policies\n2. Build a Q-learning agent that trains on the phoneme environment\n3. Advanced implementation of Deep Reinforcement Learning Soft Actor-Critic to train continuous gym environments: LunarLander-v2 and BipedalWalker-v3\n\nThe scope and results of each task are summarised below.\n\n**\u003ch3\u003e\u003cdiv\u003e2. Task 1\u003c/h3\u003e\u003c/div\u003e**\n\nFor this task, a custom phonetic environment was built. The agent's mission is to identify words with one of \nthe phonetic sounds of the IPA English alphabet:  **ʊ**, **ʌ**, **uː** whilst avoiding hitting \nenvironment boundaries and movable obstacles. The agent was trained with three arbitrary policies: random, \nbiased, and combination. This task illustrates the learning process and the impact of environment design, including the size of the grid,\nreward, the number of phonemes in the grid and moving obstacles in the agent's performance \n(Figure 1)\n\n**\u003ch4\u003e\u003cdiv\u003e 2.1 Environment\u003c/div\u003e\u003c/h4\u003e**\n\nThe phoneme environment is a configurable N x M array of integers representing objects.\nAll objects except the wall are placed randomly in the environment. Each object is represented as follows:\n\n- 0: empty cell\n- 1: moving obstacle\n- 2: 'ʊ' word\n- 3: 'ʌ' word\n- 4: 'u:' word\n- 5: Agent\n- 6: Goal\n- 7: Boundaries/walls\n\nThe words are randomly from the phoneme list. The grid can be adapted to collect the three \nsounds or any of their combinations with a minimal change in the rewards and policies functions. For a more advanced task, each word with the same sound can be encoded with a number. In this work, the mission is to collect/learn the phonetic sound 'ʊ'.\n\n- The available area to place objects is the total grid area - the boundary area\n\na = M x N - 2 x (M + N) - 4\n\n- The total number of words on the grid and moveable obstacles are given by floor division of the empty \n  cells (refer to the notebook in the associated documents for the full description)\n\n  \n- There is only one goal (G) and one learner (A)\n\n**\u003ch4\u003e\u003cdiv\u003e 2.2 Actions\u003c/div\u003e\u003c/h4\u003e**\n\nThe actions available at each time step are:\n- up\n- down\n- left \n- right\n- grab \nAfter undertaking an action, the agent gets a reward and transitions to a new state. Then the environment sends a signal indicating whether the game is over or not. \n\n\n**\u003ch4\u003e\u003cdiv\u003e 2.3 Observations\u003c/div\u003e\u003c/h4\u003e**\n\nThe observation of the environment is a dictionary that contains\n- relative coordinates to all words in the grid\n- relative coordinates to the goal \n- relative coordinates to the obstacles\n- a neighbourhood 3x3 array with the encoded values \n- a counter indicating the words left\n- relative distance to the obstacles\n- the current location of the agent\n\n\n**\u003ch4\u003e\u003cdiv\u003e 2.4 Policies\u003c/div\u003e\u003c/h4\u003e**\n- Goal-oriented \"Biased policy\" - only grabs sounds when at the same position of the sound to a \n  defined sound and \n  searches for the Goal.\n- Random policy - takes actions randomly if not sound at the same location.\n- Combined policy - with p = epsilon explores, otherwise follows the biased policy.\n\n\n**\u003ch4\u003e\u003cdiv\u003e2.5 Rewards\u003c/div\u003e\u003c/h4\u003e**\n\n- -1 per each time step\n- -20 for hitting a moving obstacle \n- -10 for grabbing in an empty cell or hitting a wall\n- -10 for grabbing a word with the 'ʊ' sound\n- -20 for grabbing ʌ_pos and uː\n- 100 if grabbing the correct sound\n\n-  reaching the goal if all ʊ were collected  area  x phonemes collected\n-  reaching the goal and ʊ left area x (total phonemes - phonemes connected)\n\n#### Associated file\n1. ```t1_phoneme_environment.ipynb``` - the Jupyter notebook with the class environment, policies, comparison and visualisation of the stats\n\n\n\n\u003cimg src=\"results/env_7x7.png\" alt=\"Phonetic\nEnv\" height=\"200\"/\u003e __A__  \u003cimg src=\"results/env_comparison_50_ep.png\" alt=\"Env Comp.\" height=\"220\"/\u003e __B__\n\nFigure 1. __A__ Configurable phonetic environment size 7 x 7.  __B__.\nPolicies comparison at different environment configurations after 50 epochs\nof training.\n\n\n#### Associated file\n1. ```t1_phoneme_environment.ipynb``` - the Jupyter notebook with the class environment, policies, comparison and visualisation of the stats\n\n\n### Task 2\n\nIn this task, the agent follows the Q-learning algorithm (off-policy algorithm) for the learning \nprocess. The reward \nper \nepisode remarkedly improves in comparison with task 1. The effect of the environment size, epsilon and alphas on the learning process was also compared.\n\n\u003cfigure\u003e \u003cimg src=\"results/ql_comparison_30000_10x10.png\" alt=\"Phonetic Env\" height=\"250\"\u003e  \n\u003cfigcaption  \u003e  Figure 2. Comparison of the Q-agent performance in an environment size 10 x \n10 with learning rates and expsilons of 0.1, 0.5 and 1.0 and 30,000 training epochs.\n\u003c/figcaption\u003e\n\u003c/figure\u003e\n\n#### Associated files\n1. ```phonemes.py``` – the class environment\n2. ```plotting.py``` - a function to visualise the statistics from training\n3. ```t2_qlearning.ipynb``` - the Jupyter notebook of the  q-learning implementation.\n\n\n\n### Task 3 \n\nThe advanced algorithm, Soft Actor-Critic (SAC), combined policy and value-based methods in this task. The agent learns the Policy and the Value function. Two gym continuous environments experiments were used in this task:\n- LunarLander-v2\n- BipedalWalker-v3\n\nThe LunarLander-v2 Continuos environment is complete after when the agent reaches a reward of \u003e= 200.\nThe BipedalWalker-v3 Continuos environment is complete after 100 consecutive episodes with an average reward \u003e= 300.\n\nThe best results are summarised in Table 1.\n\nTable 1. Training results of the continuos gym environments Lunar-Lander-v2 and BipedalWalker-v3 \nwith the Soft Actor-Critic algorithm. The actor and critics had three hidden layers with 256 \nhidden units. The batch size was set to 256.\n\n|Exp   | Environment\t    | Memory | Learning rate\t\u003cbr\u003e actor / critic|  tau  | reward \u003cbr\u003e scale | Exploration |  Episodes  | steps\u003cbr\u003e to learn |\n|:----:|:-----------------|:------:|:------------------:|:-----:|:-----------------:|:-----------:|:--------:|:----------------------------:|\n|Exp-01| LunarLander-v2   |  5e5   |   0.0003 / 0.0003  | 0.005 |         1         |    1000     | 500      |      104877\u003csup\u003e1\u003c/sup\u003e       | \n|Exp-02| LunarLander-v2   |  5e5   |   0.0005 / 0.0003  | 0.01  |         0.5       |    1000     | 500      |      111323\u003csup\u003e1\u003c/sup\u003e         |\n|Exp-03| LunarLander-v2   |  5e5   |   0.0005 / 0.0003  | 0.05  |         1         |    1000     | 500      |      82458\u003csup\u003e1\u003c/sup\u003e         |\n|Exp-04| BipedalWalker-v3 |  5e5   |   0.0001 / 0.0001  | 0.01  |         1         |    1000     | 600      |       348007                   | \n|Exp-05| BipedalWalker-v3 |  1e5   |   0.0001/ 0.00005  | 0.01  |         1         |    1000     | 500      |            364085              |\n|Exp-06| BipedalWalker-v3 |  5e5   |   0.0001/ 0.0002   | 0.05  |         1         |    1000     | 700      |        455524\u003csup\u003e2\u003c/sup\u003e     |\n\n\u003csup\u003e1\u003c/sup\u003e Solve the environment in the learning steps.\n\n\u003csup\u003e2\u003c/sup\u003e Solved the envriroment with steps to learn from a total of 617406 steps.\n\n\n\u003cimg src=\"results/lunar/sum of rewards per episode_500_0.0003.png\" alt=\"exp_01\" width=\"325\"/\u003e__A__\u003cimg src=\"results/lunar02/sum of rewards per episode_500_0.0003.png\" alt=\"exp_02\" width=\"325\"/\u003e__B__\u003cimg src=\"results/lunar03/sum of rewards per episode_500_0.0005.png\" alt=\"exp_03\" width=\"325\"/\u003e__C__\n\n  \u003cp align=\"center\"\u003e\n      Figure 3. LunarLander-v2 Continuous training graphs. A. Experiment Exp-01. B Experiment Exp-02 . C Experiment Exp-03.\n  \u003c/p\u003e\n\n\u003cimg src=\"results/bipedal/sum of rewards per episode_600_0.0001.png\" alt=\"exp_04\" width=\"325\"/\u003e__A__\u003cimg src=\"results/bipedal02/sum of rewards per episode_600_0.0001.png\" alt=\"exp_05\" width=\"325\"/\u003e__B__\u003cimg src=\"results/bipedal03/sum of rewards per episode_700_0.0002.png\" alt=\"exp_06\" width=\"325\"/\u003e__C__ \n\n \u003cp align=\"center\"\u003e\n      Figure 4. BipedalWalker-v3 Continuous training graphs. A. Experiment Exp-04. B Experiment Exp-05 . C Experiment Exp-06.\n \u003c/p\u003e\n\nhttps://user-images.githubusercontent.com/120340996/224035224-b9781120-5825-4484-8113-3957487f448c.mp4\n\n\n\nhttps://github.com/Wb-az/Reinforcement-Learning/assets/120340996/a1a4970d-bf76-445a-a70c-ae7b81d28e35\n\n\n\n\n\n\n\n#### Associated files (six)\n1. ```utils``` – this folder contains four .py files:\n-\t```networks_architecture.py``` which contains policy, value function and critic approximators\n-\t```memory.py``` - a method to save the agent transitions\n-\t```sac.py``` - the implementation of the SAC algorithm\n-\t```plotting``` - a function to visualise the statistics from training\n\n2. ```main.py``` - trains and evaluate the performance of the agent \n3. ```t3_sac_main.ipynb``` - the Jupyter notebook version of the main, designed to run on Google collab GPUs \n\n\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwb-az%2Freinforcement-learning","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fwb-az%2Freinforcement-learning","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fwb-az%2Freinforcement-learning/lists"}