{"id":20798158,"url":"https://github.com/samcfuchs/2048-rl","last_synced_at":"2026-05-26T20:05:56.430Z","repository":{"id":93775272,"uuid":"374248517","full_name":"Samcfuchs/2048-RL","owner":"Samcfuchs","description":null,"archived":false,"fork":false,"pushed_at":"2021-07-16T16:50:58.000Z","size":278,"stargazers_count":0,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"master","last_synced_at":"2025-12-12T21:57:49.255Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/Samcfuchs.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-06-06T02:06:38.000Z","updated_at":"2021-07-16T16:51:01.000Z","dependencies_parsed_at":"2023-03-01T13:00:12.755Z","dependency_job_id":null,"html_url":"https://github.com/Samcfuchs/2048-RL","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/Samcfuchs/2048-RL","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samcfuchs%2F2048-RL","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samcfuchs%2F2048-RL/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samcfuchs%2F2048-RL/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samcfuchs%2F2048-RL/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/Samcfuchs","download_url":"https://codeload.github.com/Samcfuchs/2048-RL/tar.gz/refs/heads/master","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/Samcfuchs%2F2048-RL/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":33536737,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-05-26T15:22:16.424Z","status":"ssl_error","status_checked_at":"2026-05-26T15:22:15.568Z","response_time":63,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.5:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-17T16:36:48.979Z","updated_at":"2026-05-26T20:05:56.404Z","avatar_url":"https://github.com/Samcfuchs.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# Learning 2048\n\nI design a neural network that uses reinforcement learning to select ideal moves\nin the popular game 2048.\n\n## Baseline testing\n\nWe create a baseline model which chooses a direction randomly. Scores tend to\nfluctuate a lot with these models, presumably because it's possible for the game\nto end very quickly, or just as likely very slowly. \n\n## Evaluation Techniques\n\nIt's useful to keep an eye on the distribution of moves--we want to notice when\nthe model is converging to prefer the same move every time, and when it doesn't.\nI compare the move distribution of the untrained model to its move distribution\nafter training to see whether it has changed its behavior.\n\nIn these preliminary stages, I often set an arbitrary reward for the model\nchoosing one of the directions--this is a much simpler behavior for the model to\nlearn than the actual game, and we can tune the model's efficiency and examine\nits behavior in response to this simple reward system before moving to a much\nmore complex one.\n\nWe also deviate from the standard 2048 ruleset by implementing a penalty for\nattempting an illegal move. This way, we intend that the learner will gradually\nlearn which moves are and which moves are not legal--this is also a lower-order\nbehavior for the model to learn, and so the rate at which the model selects\nthese illegal moves is another way to determine how the network is developing.\n\n## Model Architecture\n\nMy initial design is a simple feedforward network with a single hidden layer. I\ntake in the raw numerical values of the board and flatten that matrix to form an\ninput vector of size 16. The output of the network is a vector of four values\nwhich correspond to the four possible moves: up, down, left, and right. I\nselect the move that the model has scored most highly and enact it in the game.\n\nIn some cases, a move is impossible--this occurs if there are no tiles that can\nmove in that direction. In this case, we select the next best-rated move. If\nthere are no legal moves, then the game has ended.\n\n### Improvements\n\nThis initial architecture is probably not very good. To improve upon it, some\nkey features should be implemented.\n\nWe should prefer convolutional layers to linear ones, as they'll be better\nequipped to handle the dynamic nature of positioning. However, I can't eschew\nlinear layers entirely, since the placement of tile groups on the board is very\nimportant to higher-level strategy.\n\nThe input encoding is far too simple. We should at minimum prefer a labeling\nsystem (i.e. $$log_2x$$) that will eliminate the order-of-magnitude differences\nbetween different tiles. One-hot encoding of the inputs is another option which\nwe can consider, perhaps combining it with some really weird convolutional layer\nshapes.\n\nIt's hard to say whether these flaws will limit our model's performance on\nbaseline tasks, or whether they should be reserved for later development.\n\nThis feedforward network has 5380 parameters:\n- layer 1: 4352\n- layer 2: 1028\n\n## Convolutional Architecture\n\nI compose a new convolutional architecture to approach the problem. Instead of\nrepresenting with labels, I create a one-hot vector categorizing each tile as\none of the 11 possible powers of two. This gives us an input vector with the\nshape 11x4x4, on which we perform 2-dimensional convolutions with a 3x3 kernel.\nWe leave the image size unchanged (with a stride of 1) for both layers of\nconvolution and pooling.\n\nAfter 100 epochs of training, we see a significant increase in performance on\nthe one-directional task. The model learns to favor two directions, the scoring\none and another. This shows that not only is the model learning to select the\nhighest-scoring direction, it also learns to extend the game so that it can\nenter more moves in that direction. This behavior, of extending the game, is a\nhuge step forward for the model. The \"real\" scoring behavior of 2048 is quite\nsparse, but these results give me reason to believe that this model can improve.\n\n## SmartCNN\n\nBecause diagonal relationships aren't very important in 2048, we might see\nbetter performance by focusing exclusively on the salient vertical and\nhorizontal relationships. This model develops its understanding of the board\nstate more as a network of connected nodes rather than a coherent grid.\n\nMy feature set is similar to that of the previous architecture, but with one\nadded feature to capture the boolean \"emptiness\" of the board: a mask\nrepresenting which squares are occupied and which are empty. This makes it\neasier for the model to understand where the next tile may appear. This model\nalso performs a normalization step which places the highest-value corner tile in\nthe upper left-hand corner of the board. This allows the model some invariance\nwith respect to *which* corner it accumulates tiles in. Because the game is\nrotationally symmetric, we can expect the model to learn patterns more quickly\nunder normalized conditions.\n\nThis CNN uses two separate convolutional layers: one with a (1x2) kernel to\ncapture horizontal relationships, and one with a (2x1) kernel to capture\nrelationships between vertical tiles, each with four output features. With these\nsmaller kernels, I no longer use any padding, which eliminates another weakness\nof the previous model. I concatenate these two (12x4) matrices to create the\nfeature vector which is processed by two linear layers. One of the key\nadvantages of CNN architectures is invariance to translation, but in this case,\nwe want to preserve locational relationships in order for the model to make more\ncircumspect decisions, so I omit pooling steps and other aggregations\n\n## Some Results\n\nWith these parameters, I see a steady increase in average game scores for my\nmodel:\n\n- epsilon = 0.95 # Probability of choosing a random action\n- lr = 1e-6 # Gradient descent step size\n- batch_size = 256\n- gamma = 0.99\n- hidden_size = 256 # Size of model hidden layer\n- memory_size = int(1e5) # Number of moves in our training corpus each epoch\n- training_iterations = int(2e3) # Number of batches to train on\n- epochs = 60\n\nThis is with just one hidden layer.\n\n## Resources\n\n- https://towardsdatascience.com/the-bellman-equation-59258a0d3fa7\n- https://www.toptal.com/deep-learning/pytorch-reinforcement-learning-tutorial\n\n- https://www.mit.edu/~adedieu/pdf/2048.pdf\n\n- https://neuro.cs.ut.ee/demystifying-deep-reinforcement-learning/\n- https://cs.uwaterloo.ca/~mli/zalevine-dqn-2048.pdf\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamcfuchs%2F2048-rl","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fsamcfuchs%2F2048-rl","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fsamcfuchs%2F2048-rl/lists"}