{"id":17437352,"url":"https://github.com/blixt/gsm8k-math-solver","last_synced_at":"2026-01-22T14:59:54.811Z","repository":{"id":237312876,"uuid":"794267999","full_name":"blixt/gsm8k-math-solver","owner":"blixt","description":"An experiment to solve GSM8K with local models.","archived":false,"fork":false,"pushed_at":"2024-05-23T12:11:09.000Z","size":82,"stargazers_count":2,"open_issues_count":0,"forks_count":0,"subscribers_count":1,"default_branch":"main","last_synced_at":"2025-10-11T15:58:39.909Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"TypeScript","has_issues":false,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/blixt.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-04-30T19:29:34.000Z","updated_at":"2024-11-09T03:09:45.000Z","dependencies_parsed_at":"2024-10-19T11:22:33.165Z","dependency_job_id":"329d35bc-80ef-4ca8-8f8c-46ffcc19fda2","html_url":"https://github.com/blixt/gsm8k-math-solver","commit_stats":null,"previous_names":["blixt/gsm8k-math-solver"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/blixt/gsm8k-math-solver","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blixt%2Fgsm8k-math-solver","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blixt%2Fgsm8k-math-solver/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blixt%2Fgsm8k-math-solver/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blixt%2Fgsm8k-math-solver/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/blixt","download_url":"https://codeload.github.com/blixt/gsm8k-math-solver/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/blixt%2Fgsm8k-math-solver/sbom","scorecard":null,"host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":286080680,"owners_count":28664827,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2026-01-22T14:01:31.714Z","status":"ssl_error","status_checked_at":"2026-01-22T13:59:23.143Z","response_time":144,"last_error":"SSL_connect returned=1 errno=0 peeraddr=140.82.121.6:443 state=error: unexpected eof while reading","robots_txt_status":"success","robots_txt_updated_at":"2025-07-24T06:49:26.215Z","robots_txt_url":"https://github.com/robots.txt","online":false,"can_crawl_api":true,"host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-10-17T11:48:22.272Z","updated_at":"2026-01-22T14:59:54.792Z","avatar_url":"https://github.com/blixt.png","language":"TypeScript","funding_links":[],"categories":[],"sub_categories":[],"readme":"# gsm8k-math-solver\n\nTo run, first make sure you have [Yarn](https://yarnpkg.com/) and\n[Ollama](https://ollama.com/) installed.\n\nThen, clone the repository with submodules:\n\n```sh\ngit clone --recurse-submodules https://github.com/blixt/gsm8k-math-solver.git\n```\n\nInstall dependencies:\n\n```sh\nyarn install\n```\n\nFinally, open the folder in VS Code, go to \"Run and Debug\" and then \"Run\" and\nyou should see output in the Debug Console.\n\nIf you don't use VS Code, just run with `yarn run start`.\n\nYou can also pass a question as an argument to the script, for example:\n\n```sh\nyarn run start \"How many words does this prompt contain?\"\n```\n\nWhich should result in this:\n\n\u003cimg width=\"816\" alt=\"image\" src=\"https://github.com/blixt/gsm8k-math-solver/assets/158591/928af94f-0420-4cdf-83ec-153ebd4bc1d4\"\u003e\n\n## The theory\n\nBy making the LLM write executable JavaScript and also show how that code\nevaluates step by step, we can solve more complex problems because code\nnaturally splits them up into logical units, and the LLM has seen a lot of code\nto be able to predict its behavior.\n\nFurthermore, while the comments contain the answers by the LLM, also being able\nto evaluate the code allows a kind of cross-referencing to verify that the LLM\nis consistent. If there is a mismatch, one could conceivably get the LLM to\nself-correct.\n\n## The reality\n\nIt seems the most recent LLMs (as of writing this, Llama 3, GPT-4 Turbo, Claude\n3 Opus) have learned to navigate multi-step reasoning well enough to\nautomatically answer most of the GSM8K problems as correctly as any framework\nwould. The code that the LLM is encouraged to write with the framework presented\nhere actually does sometimes end up being more correct than what the LLM can\nreason its way to, but the main reason LLMs fail equally with or without this\nframework is interpreting nuances in the problem (such as misinterpreting when a\nvalue should be compounding versus not).\n\nEven less capable LLMs are right some percentage of the time, so it seems\nreasonable to assume that some form of sampling would work, but most likely not\na majority sampling, so probably most frameworks will only be able to raise the\nmodel's capability by a relatively small percentage, while model capabilities\nfrom version to version seem to raise it by a lot more.\n\nFinally, the GSM8K data set is by now mostly learned by LLMs as far as I can\ntell, so it's not even a good benchmark anymore. For example, just asking Llama\n3 70B one of the problems in the test data set of GSM8K, you will not only get a\nreasonable looking answer, you will even get the characteristic `#### 123` line\nat the end which is exactly how solutions in GSM8K are formatted.\n\n## Further investigation\n\nIt would be interesting to see this framework used for other types of problems,\nsuch as using the evaluation comments as a way to get the LLM to verify the code\nthat it writes, say for a unit test which should be easy to run and verify.\n\nThe GSM8K data set contains a lot of algebraic problems, which is not a good\nmatch for JavaScript. Maybe a better idea would have been to use Python with a\nlibrary like SymPy available to the LLM so that it doesn't have to use loops.\n\n## Examples\n\n### Question 1\n\n(680) The elevator in Jack's building supports a maximum load of 700 kg. An\nadult weighs an average of 80 kg. If Jack rides the elevator with 8 other\nadults, by how much will they have exceeded the maximum load of the elevator?\n\n### Solution 1 by Llama 3 70B 🟢\n\n```javascript\nconst maxLoad = 700 // [1] 700\nconst adultWeight = 80 // [2] 80\nconst totalAdults = 9 // [3] 9 (Jack + 8 others)\nconst totalWeight = totalAdults * adultWeight // [4] 9 * 80 === 720\nconst excessLoad = totalWeight - maxLoad // [5] 720 - 700 === 20\nconsole.log(`They will have exceeded the maximum load by ${excessLoad} kg.`) // [6] \"They will have exceeded the maximum load by 20 kg.\"\n```\n\nNote that both the code and the comments have been generated by the LLM. In\ntheory this code can also be run to verify that the LLM \"evaluated\" the code\ncorrectly.\n\n### Question 2\n\n(1103) A jellyfish tank has numerous jellyfish in it. A fifth of the jellyfish\nare large, and a third of the large jellyfish change color from green to blue\nunder UV light. The other jellyfish are small and always stay blue. When a UV\nlight turned on, 6 jellyfish changed color. How many jellyfish are in the tank?\n\n### Solution 2 by Llama 3 70B 🟡\n\nIn this solution, the LLM chose to use a loop to search for the answer, which\nmade the output very verbose. Most likely this could've been done much more\nelegantly. Note that the `(...)` is also generated by the LLM.\n\nNote that it doesn't always solve this question.\n\n```javascript\nlet totalJellyfish = 0; // [1] 0\nwhile (true) { // [2] (true)\n    const largeJellyfish = totalJellyfish / 5; // [3] totalJellyfish / 5 === (totalJellyfish is unknown)\n    const blueLargeJellyfish = largeJellyfish / 3; // [4] largeJellyfish / 3 === (largeJellyfish is unknown)\n    if (blueLargeJellyfish === 6) { // [5] (blueLargeJellyfish === 6) === false\n        break; // [6] n/a\n    }\n    totalJellyfish++; // [7] totalJellyfish + 1 === (totalJellyfish is unknown)\n}\n// Loop continues:\n// [2] while (true) === true\n// [3] largeJellyfish = totalJellyfish / 5 === 15 / 5 === 3\n// [4] blueLargeJellyfish = 3 / 3 === 1\n// [5] if (1 === 6) === false\n// [6] n/a\n// [7] totalJellyfish = 15 + 1 === 16\n// (...)\n// Loop continues:\n// [2] while (true) === true\n// [3] largeJellyfish = totalJellyfish / 5 === 30 / 5 === 6\n// [4] blueLargeJellyfish = 6 / 3 === 2\n// [5] if (2 === 6) === false\n// [6] n/a\n// [7] totalJellyfish = 30 + 1 === 31\n// (...)\n// Loop continues:\n// [2] while (true) === true\n// [3] largeJellyfish = totalJellyfish / 5 === 45 / 5 === 9\n// [4] blueLargeJellyfish = 9 / 3 === 3\n// [5] if (3 === 6) === false\n// [6] n/a\n// [7] totalJellyfish = 45 + 1 === 46\n// (...)\n// Loop continues:\n// [2] while (true) === true\n// [3] largeJellyfish = totalJellyfish / 5 === 60 / 5 === 12\n// [4] blueLargeJellyfish = 12 / 3 === 4\n// [5] if (4 === 6) === false\n// [6] n/a\n// [7] totalJellyfish = 60 + 1 === 61\n// (...)\n// Loop continues:\n// [2] while (true) === true\n// [3] largeJellyfish = totalJellyfish / 5 === 75 / 5 === 15\n// [4] blueLargeJellyfish = 15 / 3 === 5\n// [5] if (5 === 6) === false\n// [6] n/a\n// [7] totalJellyfish = 75 + 1 === 76\n// (...)\n// Loop continues:\n// [2] while (true) === true\n// [3] largeJellyfish = totalJellyfish / 5 === 90 / 5 === 18\n// [4] blueLargeJellyfish = 18 / 3 === 6\n// [5] if (6 === 6) === true\n// [6] break\n// Loop ended.\nconsole.log(`There are ${totalJellyfish} jellyfish.`); // [8] \"There are 90 jellyfish.\"\n```\n\n### Question 3\n\n(1176) A teacher uses a 5-inch piece of chalk to write math equations on a\nchalkboard for his students. The teacher likes to conserve chalk, so he tries to\nonly use 20% of the chalk each day. Since the teacher cannot write with a very\nsmall piece of chalk, he recycles the chalk when it is smaller than 2 inches. On\nMonday the teacher used a new piece of chalk. His students need extra help that\nday, so he ended up writing more than usual. He used up 45% of the chalk by the\nend of the day. If the teacher goes back to using only 20% of the chalk each\nday, how many days does he have before he has to recycle this piece?\n\n### Solution 3 by Llama 3 70B 🔴\n\nWhile the code below does output the correct answer (2 days), the LLM failed to\n\"evaluate\" the code correctly. Furthermore, the LLM often fails to solve this\nproblem as it tends to miss that the chalk length should decrease by 20% of its\ncurrent length, not the original length.\n\n```javascript\nconst originalChalkLength = 5 // [1] 5\nlet chalkLength = originalChalkLength // [2] 5\nchalkLength *= (1 - 0.45) // [3] 5 * (1 - 0.45) === 2.75\nlet daysBeforeRecycling = 0 // [4] 0\n\nwhile (chalkLength \u003e= 2) { // [5] while (2.75 \u003e= 2) === true\n    chalkLength *= (1 - 0.20) // [6] 2.75 * (1 - 0.20) === 2.2\n    daysBeforeRecycling++ // [7] 0 + 1 === 1\n    // Loop continues:\n    // [5] while (2.2 \u003e= 2) === true\n    // [6] 2.2 * (1 - 0.20) === 1.76\n    // [7] 1 + 1 === 2\n    // (...);\n    // Loop continues:\n    // [5] while (1.42 \u003e= 2) === false\n    // [8] break;\n    // Loop ended.\n}\nconsole.log(`The teacher has ${daysBeforeRecycling} days before he has to recycle this piece.`); // [9] \"The teacher has 3 days before he has to recycle this piece.\"\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblixt%2Fgsm8k-math-solver","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fblixt%2Fgsm8k-math-solver","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fblixt%2Fgsm8k-math-solver/lists"}