{"id":25620620,"url":"https://github.com/leogaudin/learn2slither","last_synced_at":"2026-05-18T15:35:47.908Z","repository":{"id":276147012,"uuid":"925321923","full_name":"leogaudin/Learn2Slither","owner":"leogaudin","description":"42 · An introduction guide to reinforcement learning, teaching a snake how to behave in an environment through trial and error.","archived":false,"fork":false,"pushed_at":"2025-03-13T12:37:49.000Z","size":494,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-07-18T13:49:45.351Z","etag":null,"topics":["42","ai","algorithm","artificial-intelligence","deep-reinforcement-learning","game","python","reinforcement-learning","snake","snake-ai"],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/leogaudin.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null,"zenodo":null}},"created_at":"2025-01-31T16:52:46.000Z","updated_at":"2025-03-13T12:37:53.000Z","dependencies_parsed_at":"2025-02-06T15:35:17.153Z","dependency_job_id":"9310295e-082a-4e09-a9e9-4179f0fe88f2","html_url":"https://github.com/leogaudin/Learn2Slither","commit_stats":null,"previous_names":["leogaudin/learn2slither"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/leogaudin/Learn2Slither","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2FLearn2Slither","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2FLearn2Slither/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2FLearn2Slither/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2FLearn2Slither/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/leogaudin","download_url":"https://codeload.github.com/leogaudin/Learn2Slither/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/leogaudin%2FLearn2Slither/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33183079,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-18T09:27:30.708Z","status":"ssl_error","status_checked_at":"2026-05-18T09:27:28.300Z","response_time":71,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["42","ai","algorithm","artificial-intelligence","deep-reinforcement-learning","game","python","reinforcement-learning","snake","snake-ai"],"created_at":"2025-02-22T07:19:35.526Z","updated_at":"2026-05-18T15:35:47.879Z","avatar_url":"https://github.com/leogaudin.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"\u003cdiv align='center'\u003e\n\t\u003ch1\u003e🐍 Learn2Slither\u003c/h1\u003e\n\t\u003cimg src=\"https://img.shields.io/badge/-122%2F100-success?logo=42\u0026logoColor=fff\" /\u003e\n\u003c/div\u003e\n\n\u003e ⚠️ This tutorial assumes you have done [multilayer-perceptron](https://github.com/leogaudin/multilayer-perceptron) and [dslr](https://github.com/leogaudin/dslr).\n\n## Table of Contents\n\n- [Introduction](#introduction) 👋\n\n- [Q-Learning](#q-learning) 🧠\n\n- [The Snake game](#the-snake-game) 🐍\n\n- [The model, and how to handle PyTorch](#the-model-and-how-to-handle-pytorch) 🤖\n    - [The update rule](#the-update-rule) 🔄\n    - [Replay memory](#replay-memory) 💭\n    - [PyTorch shenanigans](#pytorch-shenanigans) 🤯\n\n- [Training the model](#training-the-model) 🚀\n    - [Hyperparameters](#hyperparameters) 🎛\n        - [`gamma`](#gamma)\n        - [`epsilon_init`, `epsilon_min`, `epsilon_decay`](#epsilon_init-epsilon_min-epsilon_decay)\n        - [`lr`](#lr)\n        - [`max_memory`](#max_memory)\n        - [`batch_size`](#batch_size)\n\n    - [Rewards](#rewards) 🎁\n        - [Tip if your snake starts to go in circles](#tip-if-your-snake-starts-to-go-in-circles) 🔄\n\n- [About this repository](#about-this-repository) 📚\n\n- [Resources](#resources) 📖\n\n## Introduction\n\nLearn2Slither introduces a new concept in our machine learning journey: **reinforcement learning**.\n\nReinforcement learning is used to teach an agent how to behave in an environment by performing actions and observing the rewards it gets from them.\n\nIt is appropriate **for problems where it is not possible to have a dataset of examples to learn from**, but where it is possible to interact with the environment and learn from the feedback it provides.\n\nIn this guide, we will use a specific type of reinforcement learning called **Deep Q-Learning** to teach an agent how to play the game of Snake.\n\n## Q-Learning\n\nQ-Learning is a reinforcement learning algorithm that associate a \"quality\" to each action in a given state.\n\nIn Snake, an example could be \"*If I have a wall in front of me, the quality of the action 'go forward' is very low*\". Because you die.\n\nThe objective of Q-Learning is basically this: given a state, it must output the best action to take.\n\nOnce again, we have an obscure function to represent this:\n\n$$\nQ(s, a) = r + \\gamma \\max_{a'} Q(s', a')\n$$\n\nLet's demystify this:\n\n- $Q(s, a)$ is the quality of the action $a$ in the state $s$.\n- $r$ is the reward of taking action $a$ in the state $s$.\n- $\\gamma$ is the discount factor (basically, how much we care about the future).\n- $s'$ is the next state.\n- $a'$ is the next action.\n- $\\max_{a'} Q(s', a')$ is the maximum predicted quality of the next action in the next state (basically, asking our model the best quality attainable in the next state, the best we can hope for).\n\nSo, we know we have to use Q-Learning, and that this algorithm requires inputting a state and outputting actions, so we need to:\n\n1. Code the game of Snake in a way that we can:\n    1. get the state of the game at each step\n    2. take an action to go to the next step.\n\n2. Create a model/agent that:\n    1. takes the state of the game\n    2. outputs the best action to take\n    3. applies Q-Learning at each step to tune its parameters.\n\n## The Snake game\n\nWatch [this video](https://www.youtube.com/watch?v=L8ypSXwyBds) to understand how to code the game of Snake with PyGame, as this guide will not cover it extensively.\n\nHowever, an important takeaway is that our action space will be **straight, left, right**. Going behind will always result in death and can be ignored for better training and simplicity.\n\nThe observation space in the video is not really applicable to this project, as the subject clearly states that the snake can only see things straight, left, right, and behind, starting from its head.\n\nThat means **giving the snake the exact relative position of the apples is not possible**.\n\nDon't worry, you will get ideas of how to represent the state of the game later on.\n\nHowever, one crucial thing is what you return every time a step is played. Basically, your `play_step` should take an action and return:\n\n- What **state** the game **was** in.\n- What **action** was taken.\n- What **reward** was given **for that action**.\n- What **state** the game is in **after that action**.\n- If the game is **done**.\n\nYou should be able to call it as follows:\n\n```python\nstate, action, reward, next_state, done = game.play_step(action)\n```\n\n## The model, and how to handle PyTorch\n\nYou should now have a functioning game, and you should be able to get the state of the game at each step.\n\nAt the time of writing, the state used in this repository is an array of:\n\n- **How much the snake is moving** (if it is going in circles or exploring).\n- The **last move** the snake made (straight, left, right).\n- The **danger** right next to the snake (if there is a wall or the snake's body).\n- If there is a **green apple** in the snake's vision.\n- If there is a **red apple** in the snake's vision.\n\nNow we are going to use PyTorch to create a model, similar to the one we used in [`multilayer-perceptron`](https://github.com/leogaudin/multilayer-perceptron), but this time we don't have to recode the whole framework.\n\nWe will use a simple neural network with 4 layers: an input layer, 2 hidden layers, and an output layer.\n\nThe model class will look as simple as this:\n\n```python\nimport torch\nimport torch.nn as nn\nimport torch.nn.functional as F\n\n\nclass DQN(nn.Module):\n    def __init__(\n        self,\n        n_observations,\n        n_actions,\n    ):\n        super().__init__()\n        self.layer1 = nn.Linear(n_observations, 42, dtype=torch.float32)\n        self.layer2 = nn.Linear(42, 42, dtype=torch.float32)\n        self.layer3 = nn.Linear(42, n_actions, dtype=torch.float32)\n\n    def forward(self, x: torch.Tensor) -\u003e torch.Tensor:\n        x = F.relu(self.layer1(x))\n        x = F.relu(self.layer2(x))\n        return self.layer3(x)\n```\n\nThe `forward` method is the one that will be called when we input a state to the model.\n\nAs you can see, the logic is nothing new: a simple feedforward network, with all the information we want to pass in our state as an input, and the actions we can take as an output.\n\nWhat will differ from the previous project is how to calculate the loss and update the model.\n\n### The update rule\n\nLet's take our Q-Learning formula from earlier:\n\n$$\nQ(s, a) = r + \\gamma \\max_{a'} Q(s', a')\n$$\n\nAs you may have understood earlier, this gives us the **maximum quality we can hope for given the action we just took**.\n\nDuring the game, this will allow us to update our model.\n\nIf the game is done, there is no next state to consider, so this update rule simply becomes:\n\n$$\nQ(s, a) = r\n$$\n\nLet's take an example to see how we can implement it:\n\n- Given a state $s$, your model output the following Q-values for respectively \"go straight\", \"go left\", \"go right\": $[0.1, 0.2, 0.3]$.\n\n- \"go right\" is the maximum value, so you take it, and get a reward of $1$.\n\n- The next state is $s'$, you give it to your model and get the following Q-values: $[0.2, 0.3, 0.4]$. So given the action \"go right\", you can hope for a maximum quality of $0.4$ in the future.\n\n- The quality of the action \"go right\" in the state $s$ is then updated to:\n\n$$\nQ(s, \\text{\"go right\"}) = 1 + \\gamma \\times 0.4\n$$\n\nNow that's great, but how do you calculate the loss given this information?\n\nWell, you simply assign the quality you just calculated to the Q-value of the action you took for that state:\n\n```python\nprediction = [0.1, 0.2, 0.3]\nnext_state_prediction = [0.2, 0.3, 0.4]\ntarget = prediction.clone()\n\naction = 2\nreward = 1\ngamma = 0.9\n\nmax_future_q = reward + gamma * max(next_state_prediction)\n\ntarget[action] = reward + max_future_q\n\nloss = MSELoss(target, prediction)\n```\n\n\u003e ⚠ The code above is a simplification to illustrate the concept.\n\n### Replay memory\n\nIn practice, you will update your model at each step, but a game is not only defined by actions took one step at a time.\n\nBecause of that, you will also need to store the transitions you made during the game in a **replay memory**.\n\nUsing the example from above, you will append this set of information to a list after each move:\n\n```python\nstate, action, reward, next_state, done = game.play_step(action)\n\nreplay_memory.append((state, action, reward, next_state, done))\n```\n\nThis memory can be represented as a matrix of shape $(\\text{nTransitions}, 5)$, where each row is a transition.\n\nThat will allow you to use the same function to train your model, whether it is on one transition or on a batch of transitions.\n\n\u003e 💡 A single step can simply be represented as a $(1, 5)$ matrix with a bit of manipulation.\n\nEvery time the game is done, you will sample a batch of transitions from the replay memory, and update your model with it.\n\n```python\nbatch = random.sample(replay_memory, batch_size)\nstates, actions, rewards, next_states, dones = zip(*batch)\n\nprediction = model(states)\ntarget = prediction.clone()\n\nfor i in range(len(dones)):\n    if dones[i]:\n        target[i][actions[i]] = rewards[i]\n    else:\n        max_future_q = rewards[i] + gamma * max(model(next_states[i]))\n        target[i][actions[i]] = rewards[i] + max_future_q\n\nloss = MSELoss(target, prediction)\n```\n\n\u003e 💡 As you may have noticed, the batches are not transitions in order, but rather random samples. This might sound counterintuitive, but it is actually relevant to decorrelate the actual sequences from their output and avoid overfitting.\n\n### PyTorch shenanigans\n\nIf you are coming from `multilayer-perceptron`, you might get **confused** by how PyTorch works, especially **when it comes to backpropagation**.\n\nIf we take the code above, the backpropagation would basically be:\n\n```python\n# self.optimizer = torch.optim.Adam(self.model.parameters(), lr=0.001)\n# ...\n# loss = MSELoss(target, prediction)\n\nself.optimizer.zero_grad()\nloss.backward()\nself.optimizer.step()\n```\n\nThat is weird, right? We only call some `backward` method on the loss, and then we call `step` on the optimizer.\n\nOne could think that we would need to pass the gradient with respect to the loss to the optimizer, and then tell it to perform backpropagation, but PyTorch handles this for us.\n\nEverytime we perform an operation with a tensor, **PyTorch keeps track of the operations and the gradients**, so when we call `backward` on the loss, PyTorch knows how to update the parameters of the model.\n\n\u003e ⚠ That is also why you should be consistent with your crucial operations, for example not switching to NumPy to perform some operations, as PyTorch will not be able to track the gradients.\n\n## Training the model\n\nNow, you have a model, you have a game, and you have a way to update the model.\n\nYou can now start to figure out how to train the model.\n\n\u003e ⚠ You should separate the logic of training, and the logic of playing the game. For instance, with `Agent` and `Game` classes.\n\nThe play logic will look like this:\n\n```python\nstate, action, reward, next_state, done = game.play_step(action)\n\nself.memory.append((state, action, reward, next_state, done))\nself.train_short_memory(state, action, reward, next_state, done)\n\nif done:\n    self.train_long_memory()\n    game.reset()\n```\n\n### Hyperparameters\n\nYou will need to tune some hyperparameters to get the best results:\n\n- `gamma`\n- `epsilon_init`, `epsilon_min`, `epsilon_decay`\n- `lr`\n- `max_memory`\n- `batch_size`\n\nThe guidelines given here might vary for your implementation, and the **best way to tune them is trial and error**, however, we will try to stay as general as possible.\n\n#### `gamma`\n\nThe **discount factor** is a crucial hyperparameter in reinforcement learning, as it will determine how much you care about the future.\n\nA **high discount factor** will make the agent **care more about the future**, while a **low discount factor** will make the agent **care more about the immediate reward**.\n\n#### `epsilon_init`, `epsilon_min`, `epsilon_decay`\n\n`epsilon` is the **exploration rate**, and is pretty straightforward:\n\nEverytime the agent has to take an action, it will choose a random action with a probability of `epsilon`.\n\n1. Generate a random number between 0 and 1.\n2. If the number is less than `epsilon`, take a random action.\n3. Otherwise, take the action given by the model.\n\nThe exploration rate will start at `epsilon_init`, and will decay at each step until it reaches `epsilon_min`.\n\nIt is generally **a good idea to start with a very high exploration rate**, like `0.9`.\n\nFurthermore, you might want to keep `epsilon_min` a bit higher than `0` to keep some exploration in the model, even if it means performing worse during training.\n\nFor example, your model might stagnate and frequently hit walls because of a random action taken, but during evaluation, `epsilon` will be `0` and the model will perform better.\n\n#### `lr`\n\nThe **learning rate** is the rate at which the model will update its parameters. You should already know that.\n\nThis one is particularly hard to arbitrate, so you might want to try different values, anywhere between `0.0001` and `0.1`.\n\n#### `max_memory`\n\nThe **maximum size of the replay memory** will determine how much the model can learn from the past.\n\nIn a game like Snake, where the state is not complex over time and rather instantaneous, you can keep this value low if you want to save memory.\n\n#### `batch_size`\n\nThe **size of the batch** used to train the model will determine how much the model will learn from each transition.\n\nGenerally, the bigger the batch, the more the model will learn, but the slower the training will be.\n\nIf you are short on memory, you might want to keep this value low, but never below 32.\n\nYou can check out this [discussion](https://ai.stackexchange.com/questions/23254/is-there-a-logical-method-of-deducing-an-optimal-batch-size-when-training-a-deep).\n\n### Rewards\n\nIn this project, the subject gives indications about the rewards you should give to the agent:\n\n\u003e - If the snake eats a red apple: a negative reward.\n\u003e - If the snake eats a green apple: a positive reward.\n\u003e - If the snake eats nothing: a smaller negative reward.\n\u003e - If the snake is Game over (hit a wall, hit itself, null length): a bigger negative reward.\n\nThe relative magnitudes of the rewards are important. **If they are too low** for something we want the agent to do, **it will not care about it**.\n\nAn example of rewards could be:\n\n- If the snake eats a red apple: `-25`\n- If the snake eats a green apple: `25`\n- If the snake eats nothing: `-2.5`\n- If the snake is Game over: `-100`\n\nHowever, once again, the best way to tune them is trial and error.\n\n#### Tip if your snake starts to go in circles\n\nA frequent problem with the Snake game is that the snake will start to go in circles, because it constitutes a safe way to minimize the rewards.\n\nThis might happen if the reward for eating a green apple is too low, but not only.\n\nOne trick for this is first to pass an indication of how much the snake is moving in the state, and then adapt the \"eat nothing\" reward to this.\n\n\u003e 💡 Simply penalizing it will have little to no effect, the agent also needs to receive this indication as an input to be able to exploit it.\n\nFor this, you can use the **standard deviation**.\n\nIf the standard deviation of the snake's position is low, it means it is going in circles.\n\nIf the standard deviation is high, it means it is exploring.\n\nLet's take our base reward $-2.5$, and make it proportional to the standard deviation.\n\n$$\n\\text{eatNothingReward} =  \\frac{-2.5}{\\text{std}^3}\n$$\n\nWhere $\\text{std}$ is the mean of the standard deviation of the $x$ and $y$ positions of the snake.\n\nFor a standard deviation of $0.5$, the reward will be $-20$, and for a standard deviation of $2$, the reward will be $-0.3125$.\n\nThis will make less attractive for the snake to repeat the same patterns.\n\n## About this repository\n\nThe models available in this repository were trained using the following hyperparameters:\n\n| Value | final_1000.pth |\n| --- | --- |\n| `game_width` | 800 |\n| `game_height` | 800 |\n| `block_size` | 80 |\n| `training_best_score` | 17 |\n| `training_mean_score` | 3.139 |\n| `testing_best_score` | 43 |\n| `testing_mean_score` | 21.903 |\n| `network` | 13-42-42-3 |\n| `gamma` | 0.95 |\n| `epsilon_init` | 0.9 |\n| `epsilon_min` | 0.2 |\n| `epsilon_decay` | 0.995 |\n| `lr` | 0.01 |\n| `max_memory` | 1000000 |\n| `batch_size` | 1024 |\n| `alive_reward` | -2.5 |\n| `death_reward` | -100 |\n| `green_apple_reward` | 25 |\n| `red_apple_reward` | -25 |\n\n## Resources\n\n- [📺 YouTube − Reinforcement Learning: Crash Course AI #9](https://www.youtube.com/watch?v=nIgIv4IfJ6s)\n- [📺 YouTube − Reinforcement Learning from scratch](https://www.youtube.com/watch?v=vXtfdGphr3c)\n- [📺 YouTube − Neural Network Learns to Play Snake](https://www.youtube.com/watch?v=zIkBYwdkuTk)\n- [📺 YouTube − Python + PyTorch + Pygame Reinforcement Learning – Train an AI to Play Snake](https://www.youtube.com/watch?v=L8ypSXwyBds)\n- [📖 PyTorch − Reinforcement Learning (DQN) Tutorial](https://pytorch.org/tutorials/intermediate/reinforcement_q_learning.html)\n- [📖 HuggingFace − The Deep Q-Learning Algorithm](https://huggingface.co/learn/deep-rl-course/unit3/deep-q-algorithm)\n- [📖 arXiv − Accelerated Methods for Deep Reinforcement Learning](https://arxiv.org/pdf/1803.02811): information about batch sizes.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleogaudin%2Flearn2slither","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fleogaudin%2Flearn2slither","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fleogaudin%2Flearn2slither/lists"}