{"id":13429994,"url":"https://github.com/openai/grade-school-math","last_synced_at":"2025-05-16T08:03:18.933Z","repository":{"id":37943150,"uuid":"419462905","full_name":"openai/grade-school-math","owner":"openai","description":null,"archived":false,"fork":false,"pushed_at":"2024-01-21T10:54:11.000Z","size":3078,"stargazers_count":1243,"open_issues_count":20,"forks_count":171,"subscribers_count":15,"default_branch":"master","last_synced_at":"2025-04-08T17:17:32.051Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":null,"language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/openai.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2021-10-20T19:24:47.000Z","updated_at":"2025-04-08T02:27:04.000Z","dependencies_parsed_at":"2024-01-14T03:48:04.366Z","dependency_job_id":"3fc0b593-2f30-48e7-8694-3ae20bcae1fa","html_url":"https://github.com/openai/grade-school-math","commit_stats":null,"previous_names":[],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fgrade-school-math","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fgrade-school-math/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fgrade-school-math/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/openai%2Fgrade-school-math/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/openai","download_url":"https://codeload.github.com/openai/grade-school-math/tar.gz/refs/heads/master","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254493381,"owners_count":22080126,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-07-31T02:00:48.923Z","updated_at":"2025-05-16T08:03:18.775Z","avatar_url":"https://github.com/openai.png","language":"Python","readme":"**Status**: Archive (code is provided as-is, no updates expected)\n\n# Grade School Math\n\n#### [[Blog Post]](https://openai.com/blog/grade-school-math/) [[Paper]](https://arxiv.org/abs/2110.14168)\n\nState-of-the-art language models can match human performance on many tasks, but they still struggle to robustly perform multi-step mathematical reasoning. To diagnose the failures of current models and support research, we're releasing GSM8K, a dataset of 8.5K high quality linguistically diverse grade school math word problems. We find that even the largest transformer models fail to achieve high test performance, despite the conceptual simplicity of this problem distribution.\n\n\u003cp align=\"center\"\u003e\n    \u003cimg src=\"grade_school_math/img/example_problems.png\" height=\"300\"/\u003e\n\u003c/p\u003e\n\n## Dataset Details\n\nGSM8K consists of 8.5K high quality grade school math problems created by human problem writers. We segmented these into 7.5K training problems and 1K test problems. These problems take between 2 and 8 steps to solve, and solutions primarily involve performing a sequence of elementary calculations using basic arithmetic operations (+ - / \\*) to reach the final answer. A bright middle school student should be able to solve every problem.\n\nThe raw data files can be found in:\n\n- `grade_school_math/data/train.jsonl`\n- `grade_school_math/data/test.jsonl`\n\nEach line of those files corresponds to a single grade school math problem, saved as a json dictionary (with a \"question\" key and an \"answer\" key). The answer is formatted such that it uses calculation annotations and so that the final numeric solution is the final line of the solution, preceded by `####`.\n\n### Calculation Annotations\n\nOur models frequently fail to accurately perform calculations. Although larger models make fewer arithmetic mistakes than smaller models, this remains a common source of errors. To mitigate this issue, we train our models to use a calculator by injecting calculation annotations into the training set. At training time, we simply finetune on this language data as is. At test time, a calculator will override sampling when the model chooses to use these annotations. An example implementation of the calculator sampling can be found in `calculator.py`.\n\nIf you would like to remove the calculator annotations, simply remove any string that starts with `\u003c\u003c` and ends with `\u003e\u003e`.\n\n### Solution Extracting\n\nTo extract the final numeric solution for a particular question, simply parse the completion to extract the numeric value immediately following the `####` token. Some example python code to do so is shown in `dataset.py:is_correct`.\n\n### Socratic Dataset\n\nDuring our research, we also investigated a modified solution format that injects automatically generated \"Socratic subquestions\" before each step. Although we ultimately did not use this format for any experiments in the paper, we make this data available to anyone who is interested.\n\nWe show an example below, with the socratic subquestions in bold:\n\n\u003cpre\u003e\nA carnival snack booth made $50 selling popcorn each day. It made three times as much selling cotton candy. For a 5-day activity, the booth has to pay $30 rent and $75 for the cost of the ingredients. How much did the booth earn for 5 days after paying the rent and the cost of ingredients?\n\u003cb\u003eHow much did the booth make selling cotton candy each day? **\u003c/b\u003e The booth made $50 x 3 = $\u003c\u003c50*3=150\u003e\u003e150 selling cotton candy each day.\n\u003cb\u003eHow much did the booth make in a day? **\u003c/b\u003e In a day, the booth made a total of $150 + $50 = $\u003c\u003c150+50=200\u003e\u003e200.\n\u003cb\u003eHow much did the booth make in 5 days? **\u003c/b\u003e In 5 days, they made a total of $200 x 5 = $\u003c\u003c200*5=1000\u003e\u003e1000.\n\u003cb\u003eHow much did the booth have to pay? **\u003c/b\u003e The booth has to pay a total of $30 + $75 = $\u003c\u003c30+75=105\u003e\u003e105.\n\u003cb\u003eHow much did the booth earn after paying the rent and the cost of ingredients? **\u003c/b\u003e Thus, the booth earned $1000 - $105 = $\u003c\u003c1000-105=895\u003e\u003e895.\n\u003c/pre\u003e\n\nWe generated each Socratic subquestion by conditioning on each ground truth (contractor-provided) step in a solution, using a model specifically finetuned for this task (on around 800 examples). To construct the full Socratic dataset, each step in the solution was prefixed by the model-generated Socratic subquestion. Steps were otherwise left untouched.\n\nThese data files can be found in:\n\n- `grade_school_math/data/train_socratic.jsonl`\n- `grade_school_math/data/test_socratic.jsonl`\n\n## View Model Solutions\n\nFor each test question, we provide solutions generated from 6B finetuning, 6B verification, 175B finetuning and 175B verification. This data can be found in:\n\n- `grade_school_math/data/example_model_solutions.jsonl`\n\nTo view these results problem-by-problem, run:\n\n```bash\npython view_model_solutions.py\n```\n\nNote: These model-generated samples used a slightly older version of the calculator. Previous implementation bugs led to calculator failures in roughly 1% of model samples. Those issues have been fixed in the codebase, but since the samples have not been regenerated, occasional calculation errors are present.\n\n## Citation\n\nPlease use the below BibTeX entry to cite this dataset:\n\n```\n@article{cobbe2021gsm8k,\n  title={Training Verifiers to Solve Math Word Problems},\n  author={Cobbe, Karl and Kosaraju, Vineet and Bavarian, Mohammad and Chen, Mark and Jun, Heewoo and Kaiser, Lukasz and Plappert, Matthias and Tworek, Jerry and Hilton, Jacob and Nakano, Reiichiro and Hesse, Christopher and Schulman, John},\n  journal={arXiv preprint arXiv:2110.14168},\n  year={2021}\n}\n```\n\n# Usage\n\nWe present a basic example of training a GPT2 sized model and using the calculator in the sampling process. We include this code for illustrative purposes only. This pipeline was not used for any experiments in the paper.\n\n**Training a Model**\n\n```bash\npython train.py\n```\n\n**Sampling from the Model**\n\n```bash\npython sample.py\n```\n\nThe core calculator sampling logic can be found in `calculator.py:sample`. Note that this code is inefficient as implemented. Specifically, the function does not support batches, and does not cache activations from previous tokens.\n","funding_links":[],"categories":["3 Reasoning Tasks","Arithmetic Reasoning","Python","Datasets-or-Benchmark","其他_NLP自然语言处理","(Training) Datasets","Benchmark","5. 数据集","Benchmarks \u0026 Datasets","Reasoning \u0026 Math","📏 Evaluation Benchmarks"],"sub_categories":["3.2 Mathematical Reasoning","垂直领域","其他_文本生成、文本对话","Math","5.2 训练数据","Domain-Specific Benchmarks"],"project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fgrade-school-math","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fopenai%2Fgrade-school-math","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fopenai%2Fgrade-school-math/lists"}