{"id":16836328,"url":"https://github.com/aleju/mario-ai","last_synced_at":"2025-04-05T08:08:47.933Z","repository":{"id":44758390,"uuid":"57999007","full_name":"aleju/mario-ai","owner":"aleju","description":"Playing Mario with Deep Reinforcement Learning","archived":false,"fork":false,"pushed_at":"2016-05-26T22:37:21.000Z","size":538,"stargazers_count":693,"open_issues_count":3,"forks_count":142,"subscribers_count":48,"default_branch":"master","last_synced_at":"2025-03-29T07:09:35.868Z","etag":null,"topics":["agent","deep-learning","deep-reinforcement-learning","machine-learning","mario","reward","torch"],"latest_commit_sha":null,"homepage":"","language":"Lua","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/aleju.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null}},"created_at":"2016-05-03T20:14:18.000Z","updated_at":"2025-03-28T13:05:19.000Z","dependencies_parsed_at":"2022-09-23T05:04:20.023Z","dependency_job_id":null,"html_url":"https://github.com/aleju/mario-ai","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fmario-ai","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fmario-ai/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fmario-ai/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/aleju%2Fmario-ai/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/aleju","download_url":"https://codeload.github.com/aleju/mario-ai/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":247305935,"owners_count":20917208,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["agent","deep-learning","deep-reinforcement-learning","machine-learning","mario","reward","torch"],"created_at":"2024-10-13T12:13:07.625Z","updated_at":"2025-04-05T08:08:47.914Z","avatar_url":"https://github.com/aleju.png","language":"Lua","funding_links":[],"categories":["Model Zoo","Lua"],"sub_categories":["Reinforcement Learning"],"readme":"# About\n\nThis project contains code to train a model that automatically plays the first level of Super Mario World using only raw pixels as the input (no hand-engineered features).\nThe used technique is deep Q-learning, as described in the [Atari paper](http://arxiv.org/abs/1312.5602) ([Summary](https://github.com/aleju/papers/blob/master/neural-nets/Playing_Atari_with_Deep_Reinforcement_Learning.md)), combined with a [Spatial Transformer](https://arxiv.org/abs/1506.02025).\n\n# Video\n\n[![Model playing SMW](images/youtube.png?raw=true)](https://www.youtube.com/watch?v=L4KBBAwF_bE)\n\n# Methodology\n\n## Basics, replay memory\n\nThe training method is deep Q-learning with a replay memory, i.e. the model observes sequences of screens,\nsaves them into its memory and later trains on them, where \"training\" means that it learns to accurately predict the expected action reward values\n(\"action\" means \"press button X\") based on the collected memories.\nThe replay memory has by default a size of 250k entries.\nWhen it starts to get full, new entries replace older ones.\nFor the training batches, examples are chosen randomly (uniform distribution) and rewards of memories are reestimated based on what the network has learned so far.\n\n## Inputs, outputs, actions\n\nEach example's input has the following structure:\n* The last T actions, each as two-hot-vectors. (Two, because the model can choose two buttons: One arrow-button and one of A/B/X/Y.)\n* The last T screenshots, each downscaled to size 32x32 (grayscale, slightly cropped).\n* The last screenshot, at size 64x64 (grayscale, slightly cropped).\n\nT is currently set to 4 (note that this includes the last state of the sequence). Screens are captured at every 5th frame.\nEach example's output are the action reward values of the chosen action (received direct reward + discounted Q-value of the next state).\nThe model can choose two actions per state: One arrow button (up, down, right, left) and one of the other control buttons (A, B, X, Y).\nThis is different from the Atari-model, in which the agent could only pick one button at a time.\n(Without this change, the agent could theoretically not make many jumps, which force you to keep the A button pressed and move to the right.)\nAs the reward function is constructed in such a way that it is almost never 0, exactly two of each example's output values are expected to be non-zero.\n\n## Reward function\n\nThe agent gets the following rewards:\n* X-Difference reward: `+0.5` if the agent moved to the right, `+1.0` if it moved *fast* to the right (8 pixels or more compared to the last game state), `-1.0` if it moved to the left and `-1.5` if it moved *fast* to the left (-8 pixels or more).\n* Level finished: `+2.0` while the level-finished-animation is playing.\n* Death: `-3.0` while the death animation is playing.\n\nThe `gamma` (discount for expected/indirect rewards) is set to `0.9`.\n\nTraining the model only on score increases (like in the Atari paper) would most likely not work, because enemies respawn when their spawning location moves outside of the screen, so the agent could just kill them again and again, each time increasing its score.\n\n## Error function\n\nA selective MSE is used to train the agent. That is, for each example gradients are calculated just like they would be for a MSE.\nHowever, the gradients of all action values are set to 0 if their target reward was 0.\nThat's because each example contains only the received reward for one pair of chosen buttons (arrow button, other button).\nOther pairs of actions would have been possible, but the agent didn't choose them and so the reward for them is unclear.\nTheir reward values (per example) are set to 0, but not because they were truely 0, but instead because we don't know what reward the agent would have received if it had chosen them.\nBackpropagating gradient for them (i.e. if the agent predicts a value unequal to 0) is therefore not reasonable.\n\nThis implementation can afford to differentiate between the chosen and not chosen buttons (in the target vector) based on the reward being unequal to 0, because the received reward of a chosen button is (here) almost never exactly 0 (due to the construction of the reward function).\nOther implementations might need to take more care of this step.\n\n## Policy\n\nThe policy is an epsilon-greedy one, which starts at epsilon=0.8 and anneals that down to 0.1 at the 400k-th chosen action.\nWhenever according to the policy a random action should be chosen, the agent throws a coin (i.e. 50:50 chance) and either randomizes one of its two (arrows, other buttons) actions or it randomizes both of them.\n\n## Model architecture\n\nThe model consists of three branches:\n* Action history: Lists previously chosen actions. Added so that the network can e.g. learn that it should release the A-button on the ground sometimes (keeping it pressed non-stop will prevent Mario from jumping). Also added so that the network can learn to keep A pressed for long/high jumps.\n  * This branch just uses one linear hidden layer.\n* Screenshot history: Lists the screenshots of the state chain (including the last state). All screenshots are downscaled to 32x32 (grayscale). This branch is intended to let the network spot movements.\n  * This branch uses a few strided convolutional layers.\n  * Some RNN-architecture might be better here.\n* Last screenshot: This branch receives the last state's screenshot in 64x64 (grayscale). It is intended to let the network make in-depth decisions based on the current state.\n  * It has one sub-branch that applies convolutions to the whole image.\n  * It has one sub-branch that applies convolutions to an area of interest, using a Spatial Transformer to extract that area.\n\nAt the end of the branches, everything is merged to one vector, fed through a hidden layer, before reaching the output neurons. These output neurons predict the expected reward per pressed button.\n\nOverview of the network:\n\n![Q architecture](images/Q11.png?raw=true \"Q architecture\")\n\nThe Spatial Transformer requires a localization network, which is shown below:\n\n![Localization net architecture](images/localization_net2.png?raw=true \"Localization net architecture\")\n\nBoth networks have overall about 6.6M parameters.\n\n\n# Limitations\n\nThe agent is trained only on the first level (first to the right in the overworld at the start).\nOther levels suffer significantly more from various difficulties with which the agent can hardly deal. Some of these are:\n* Jumping puzzles. The agent will usually just jump to right and straight into its death.\n* Huge cannons balls. To get past them you have to jump on them or duck under them (big mario) or walk under them (small mario). Jumping on top of them is even rather hard for a human novice player. Ducking or walking under them is very hard for the agent due to the epsilon-greedy policy, which will randomly make mario jump and then instantly die.\n* High walls/tubes. The agent has to *keep* A pressed to get over them. Again, hard to learn and runs contrary to epsilon-greedy.\n* Horizontal tubes. These are sometimes located at the end of areas and you are supposed to walk into them to get to the next area. The agent has a tendency to instead jump on them (because it loves to jump) and then keep walking to the right, hitting the wall.\n\nThe first level has hardly any of these difficulties and therefore lends itself to DQN, which is why it is used here.\nTraining on any level and then testing on another one is also rather difficult, because each level seems to introduce new things, like new and quite different enemies or new mechanics (climbing, new items, objects that squeeze you to death, etc.).\n\n\n# Usage\n\n## Basic requirements\n\n* Ubuntu.\n* Quite some time. This is not an easy install.\n* Around 2GB of disk space for the network and replay memory.\n* An NVIDIA GPU with 4+ GB of memory.\n* CUDA. Version 7 or newer should do.\n* CUDNN. Version 4 or newer should do.\n\n\n## Install procedure\n\n* Make sure that you have lua 5.1 installed. I had problems with 5.2 in torch.\n* Make sure that you have gcc 4.9 or higher installed. The emulator will compile happily with gcc \u003c4.9, but then sometimes throw errors when you actually use it.\n* Install torch.\n  * Follow the steps from [torch.ch](http://torch.ch/docs/getting-started.html#_)\n  * Make sure that the following packages are installed (`luarocks install packageName`): `nn`, `cudnn`, `paths`, `image`, `display`. display is usually not part of torch.\n* Install the spatial transformer module for torch:\n  * Clone the stnbhdw repository to some directory: `git clone https://github.com/qassemoquab/stnbhwd.git`\n  * Switch to that directory: `cd stnbhwd`\n  * Compile the module: `luarocks make stnbhwd-scm-1.rockspec`\n* Install SQLite3\n  * `sudo apt-get install sqlite3 libsqlite3-dev`\n  * `luarocks install lsqlite3`\n* Compile the emulator:\n  * Download the source code of [lsnes rr2 beta23](http://tasvideos.org/Lsnes.html). **Not version rr1!** (Note that other emulators than lsnes will likely not work with the code in this repository.)\n  * Extract the emulator source code and open the created directory.\n  * Open `source/src/libray/lua.cpp` and insert the following code under `namespace {`:\n    ```\n    #ifndef LUA_OK\n    #define LUA_OK 0\n    #endif\n\n    #ifdef LUA_ERRGCMM\n    \tREGISTER_LONG_CONSTANT(\"LUA_ERRGCMM\", LUA_ERRGCMM, CONST_PERSISTENT | CONST_CS);\n    #endif\n    ```\n    This makes the emulator run in lua 5.1. Newer versions (than beta23) of lsnes rr2 might not need this.\n  * Open `source/include/core/controller.hpp` and change the function `do_button_action` from private to public. Simply cut the line `void do_button_action(const std::string\u0026 name, short newstate, int mode);` in the `private:` block and paste it into the `public:` block.\n  * Open `source/src/lua/input.cpp` and before `lua::functions LUA_input_fns(...` (at the end of the file) insert:\n    ```\n    \tint do_button_action(lua::state\u0026 L, lua::parameters\u0026 P)\n    \t{\n    \t\tauto\u0026 core = CORE();\n\n    \t\tstd::string name;\n    \t\tshort newstate;\n    \t\tint mode;\n\n    \t\tP(name, newstate, mode);\n    \t\tcore.buttons-\u003edo_button_action(name, newstate, mode);\n    \t\treturn 1;\n    \t}\n    ```\n    This method was necessary to actually press buttons from custom lua scripts. All of the emulator's default lua functions for that would just never work, because `core.lua2-\u003einput_controllerdata` apparently never gets set (which btw will let these functions silently fail, i.e. without any error).\n  * Again in `source/src/lua/input.cpp`, at the block `lua::functions LUA_input_fns(...`, add `do_button_action` to the lua commands that can be called from lua scripts loaded in the emulator. To do that, change the line `{\"controller_info\", controller_info},` to `{\"controller_info\", controller_info}, {\"do_button_action\", do_button_action},` .\n  * Switch back to `source/`.\n  * Compile the emulator with `make`.\n    * You might might encounter problems during this step that will require lots of googling to solve. No better way here.\n    * If you encounter problems with portaudio, deactivate it in the file `options.build`.\n    * If you encounter problems with something like libwxgtk, then install package `libwxgtk3.0-dev` and not version 2.8-dev, as that package's official page might tell you to do.\n  * From `source/` execute `sudo cp lsnes /usr/bin/ \u0026\u0026 sudo chown root:root /usr/bin/lsnes`. After that, you can start lsnes by simply typing `lsnes` in a console window.\n* Now create a ramdisk. That will be used to save screenshots from the game (in order to get the pixel values). Do the following:\n  * `sudo mkdir /media/ramdisk`\n  * `sudo chmod 777 /media/ramdisk`\n  * `sudo mount -t tmpfs -o size=128M none /media/ramdisk \u0026\u0026 mkdir /media/ramdisk/mario-ai-screenshots`\n  * Note: You can choose a different path. Then you will have to change `SCREENSHOT_FILEPATH` in `config.lua`.\n  * Note: You don't *have* to use a ramdisk, but your hard drive will probably not like the constant wear from lots of screenshots being saved.\n\n\n## Training\n\n* Clone this repository via `git clone https://github.com/aleju/mario-ai.git`.\n* `cd` into the created directory.\n* Download a Super Mario World (USA) ROM.\n* Start lsnes (from the repository directory) by using `lsnes` in a terminal window.\n* In the emulator, go to `Configure -\u003e Settings -\u003e Advanced` and set the lua memory limit to 1024MB. (Only has to be done once.)\n* Configure your controller buttons (`Configure -\u003e Settings -\u003e Controller`). Play until the overworld pops up. There, move to the right and start that level. Play that level a bit and save a handful or so of states via the emulator's `File -\u003e Save -\u003e State` to the subdirectory `states/train`. Name doesn't matter, but they have to end in `.lsmv`. (Try to spread the states over the whole level.)\n* Start the display server by opening a command window and using `th -ldisplay.start`. If that doesn't work you haven't installed display yet, use `luarocks install display`.\n* Open the display server output by opening `http://localhost:8000/` in your browser.\n* Now start the training via `Tools -\u003e Run Lua script...` and select `train.lua`.\n* Expected training time: Maybe 10 hours, less with good hardware. (About 0.5M actions.)\n* You can stop the training via `Tools -\u003e Reset Lua VM`.\n* If you want to restart the training from scratch (e.g. for a second run), you will have to delete the files in `learned/`. Note that you can keep the replay memory (`memory.sqlite`) and train a new network with it.\n\nYou can test the model using `test.lua`. Don't expect it to play amazingly well. The agent will still die a lot, even more so if you ended the training on a bad set of parameters.\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faleju%2Fmario-ai","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Faleju%2Fmario-ai","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Faleju%2Fmario-ai/lists"}