{"id":18584089,"url":"https://github.com/e-candeloro/reinforcement-learning-maze-solver","last_synced_at":"2026-04-28T13:32:46.451Z","repository":{"id":114095429,"uuid":"497131776","full_name":"e-candeloro/Reinforcement-Learning-Maze-Solver","owner":"e-candeloro","description":"A Python script that executes a RL algorithm (Temporal Difference/Q-Learning) that trains an agent inside a labyrinth to find the exit with the least number of steps possible","archived":false,"fork":false,"pushed_at":"2022-05-29T09:31:24.000Z","size":253,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-05-16T05:37:02.764Z","etag":null,"topics":["maze-solver","python","qlearning","qlearning-algorithm","reinforcement-learning","reinforcement-learning-algorithms","rl","temporal-differencing-learning"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/e-candeloro.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2022-05-27T20:36:53.000Z","updated_at":"2022-05-30T15:40:00.000Z","dependencies_parsed_at":"2023-06-12T17:42:58.961Z","dependency_job_id":null,"html_url":"https://github.com/e-candeloro/Reinforcement-Learning-Maze-Solver","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/e-candeloro/Reinforcement-Learning-Maze-Solver","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FReinforcement-Learning-Maze-Solver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FReinforcement-Learning-Maze-Solver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FReinforcement-Learning-Maze-Solver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FReinforcement-Learning-Maze-Solver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/e-candeloro","download_url":"https://codeload.github.com/e-candeloro/Reinforcement-Learning-Maze-Solver/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/e-candeloro%2FReinforcement-Learning-Maze-Solver/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":263339941,"owners_count":23451518,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["maze-solver","python","qlearning","qlearning-algorithm","reinforcement-learning","reinforcement-learning-algorithms","rl","temporal-differencing-learning"],"created_at":"2024-11-07T00:26:07.983Z","updated_at":"2026-04-28T13:32:41.420Z","avatar_url":"https://github.com/e-candeloro.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Agent navigation in a maze using TD/Q-Learning Reinforcement Learning\n\nThis is the second homework done for the course of Automated Decision Making (2021-2022) at the University of Modena and Reggio Emilia.\nIt consists of a Python script that executes a RL algorithm (Temporal Difference/Q-Learning) that trains an agent inside a labyrinth to find the exit with the least number of steps possible.\n\n## Setup\nThe default maze for the agent to escape is the following (image)\n\n![maze](images/maze.jpg)\n\n### Considerations\n- The maze is represented with a number matrix, with 0 for the white squares and 1 for the walls (special values are used for the start and finish positions)\n- At each step the agent occupies one of the white squares.\n- At each step the agent has 4 possible actions: up, down, left,\nright.\n- The status is the current position (square)\n- The reward is -1 if the action is feasible (remind that we want\nto minimize the total length of the path), while it is -30 if the\naction is infeasible (moves toward a black square)\n- The training is made using the one step temporal difference\nlearning : TD(0) to learn the q(s, a) function\n- The learned q() is used for the tests.\nThe maze can be represented with a binary matrix where 1 denotes\na black square and 0 a white one.\n\n### The TD(0) or Q-Learning algorithm (pseudocode)\n\n![Q-learning](images/Q_learning_alg.jpg)\n\n## SCRIPT \u0026 ALGORITHM DESCRIPTION\n\nIn the main script, two classes are present: the **Environment** class and the **TDAgent** class. Those two classes are used\nin the **main()** function to allow the TD Reinforcement Learning algorithm to take place. The **main()** function takes the\nfollowing parameters (of which, part of them are used by the two classes):\n\n- **num episodes**: number of episodes for updating the policy.\n- **α**: weight given to the TD error, to update the policy online.\n- **γ**: discounting factor for the future state-action reward.\n- **ε**: probability (normalized by 100) of making a random action in the training using the ε-greedy approach.\n- **import_maze_csv**: when True, imports an example maze from a .csv file.\n- **show_training**: when True, shows the training steps that the agent make, for every episode till the end of all the\nepisodes. It is useful to visualize the improvement of the agent between the episodes.\n\n## Training\n\nAfter the instantiation of the Environment and TDAgent objects, a training phase is started.\nFirst the policy **Q(S,A)** is initialized with random values, then a **for loop cycles each episode** e where at the start of\nthem the state of the agent is set to the initial one S0.\n\nThe following **pseudocode loop (for each episode)** is implemented:\n1. Given the state **S**, select an action **A** using a ε-greedy policy **Qε(S,A)**.\n2. Given the selected action A, perform the action in the environment and compute the reward R, the next state **S′**\nand the **”is_over”** flag (to understand if the agent has reached the exit of the maze).\n3. Update the policy matrix using the ***TD(0)*** update:\n   \n   **Q(S,A) ← Q(S,A) + α[R + γ maxa Q(S′,a) −Q(S,A)]**\n4. Repeat until the episode terminates (agent reaches the maze exit)\n\n## Testing and Results\n***The agent is able to autonomously travel from start to finish in the least amount of moves thanks to the learned greedy policy.***\n\nAt the end of the training, to show the results, a version of the maze with the learned policy is printed. \nAfter pressing\nENTER, a small simulation is executed to show how the agents follows the learned policy to reach the end of the maze. \nA simple loop (where the greedy policy Q(S,A) is exploited) is used for this purpose.\n\n\nThe algorithm was also tested with a custom\nmaze from the labyrinth.csv file. \n\nThe images below show the maze used, the agent in action and the learned policy.\n\n![Agent start](images/agent_start.jpg) ![Agent move](images/agent_moves.jpg) ![Agent Policy](images/learned_policy.jpg)\n\n## Important Notes\nMore comments and explanations are present in the H2_CANDELORO_python.py script.\nAlso all the readme was originally wrote in the H2_CANDELORO_description.pdf file.\n\nThe following code was adapted\nand reformatted from a previous exercise done for the course of Machine Learning and Deep Learning (2020-2021) at the\nUniversity of Modena and Reggio Emilia.\n\nThe original project and code can be found at [this link](https://drive.google.com/drive/folders/1btN4CHqwsDtXdGXHTj7CMlHoKsS7ob2Z?usp=sharing).\n\nAll credits for the snippet of code used goes to the original author(s).\n\n\n\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fe-candeloro%2Freinforcement-learning-maze-solver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fe-candeloro%2Freinforcement-learning-maze-solver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fe-candeloro%2Freinforcement-learning-maze-solver/lists"}