{"id":21223472,"url":"https://github.com/hkust-nlp/dart-math","last_synced_at":"2025-07-10T14:30:42.901Z","repository":{"id":242755891,"uuid":"807526338","full_name":"hkust-nlp/dart-math","owner":"hkust-nlp","description":"[NeurIPS'24] Official code for *🎯DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving*","archived":false,"fork":false,"pushed_at":"2024-12-10T04:55:00.000Z","size":4383,"stargazers_count":100,"open_issues_count":3,"forks_count":4,"subscribers_count":0,"default_branch":"main","last_synced_at":"2025-04-05T14:22:12.515Z","etag":null,"topics":["deep-learning","llm","llm-evaluation","llm-inference","llm-training","mathematics","nlp"],"latest_commit_sha":null,"homepage":"https://hkust-nlp.github.io/dart-math/","language":"Jupyter Notebook","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":"mit","status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/hkust-nlp.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":"LICENSE","code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-05-29T09:13:03.000Z","updated_at":"2025-04-01T08:48:26.000Z","dependencies_parsed_at":"2024-07-24T02:47:33.888Z","dependency_job_id":"f028d803-739b-4ae8-b718-c873e1c33981","html_url":"https://github.com/hkust-nlp/dart-math","commit_stats":null,"previous_names":["hkust-nlp/dart-math"],"tags_count":0,"template":false,"template_full_name":null,"purl":"pkg:github/hkust-nlp/dart-math","repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Fdart-math","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Fdart-math/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Fdart-math/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Fdart-math/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/hkust-nlp","download_url":"https://codeload.github.com/hkust-nlp/dart-math/tar.gz/refs/heads/main","sbom_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/hkust-nlp%2Fdart-math/sbom","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":264590732,"owners_count":23633625,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":["deep-learning","llm","llm-evaluation","llm-inference","llm-training","mathematics","nlp"],"created_at":"2024-11-20T22:52:02.476Z","updated_at":"2025-07-10T14:30:42.894Z","avatar_url":"https://github.com/hkust-nlp.png","language":"Jupyter Notebook","funding_links":[],"categories":["SFT Statistics"],"sub_categories":["Code \u0026 Math"],"readme":"# 🎯DART-Math\n\n\n\u003c!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! --\u003e\n\n\u003e Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving\n\u003e \\[NeurIPS 2024\\]\n\u003e\n\u003e [Yuxuan Tong](https://tongyx361.github.io), Xiwen Zhang, Rui Wang,\n\u003e Ruidong Wu, [Junxian He](https://jxhe.github.io)\n\n📝 [Paper@arXiv](https://arxiv.org/abs/2407.13690) \\| 🤗\n[Datasets\u0026Models@HF](https://huggingface.co/collections/hkust-nlp/dart-math-665704599b35de59f8fdf6c1)\n\\| 🐱 [Code@GitHub](https://github.com/hkust-nlp/dart-math) \\| 💡\n[Slides](https://docs.google.com/presentation/d/1ZBPsM5Ww3XbQo3zAE6y-lpfsWLsK6cAbCsF6lbNHDY4/edit?usp=sharing)\n\\| 🏆 [Published@NeurIPS\n2024](https://nips.cc/virtual/2024/poster/92959)\n\n🐦\n[Thread@X(Twitter)](https://x.com/tongyx361/status/1811413243350454455)\n\\| 🐶 [中文博客@知乎](https://zhuanlan.zhihu.com/p/708371895) \\| 📊\n[Leaderboard@PapersWithCode](https://paperswithcode.com/paper/dart-math-difficulty-aware-rejection-tuning#results)\n\\| 📑\n[BibTeX](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#%EF%B8%8F-citation)\n\n\u003e \\[!IMPORTANT\\]\n\u003e\n\u003e 🔥 **News!!!**\n\u003e\n\u003e - \\[2024/09/25\\] 🎉 *DART-Math* is accepted to [*NeurIPS\n\u003e   2024*](https://nips.cc/virtual/2024/poster/92959)!\n\u003e - \\[2024/07/21\\] Excited to find **our [`DART-Math-DSMath-7B`\n\u003e   (Prop2Diff)](https://huggingface.co/hkust-nlp/dart-math-dsmath-7b-prop2diff)\n\u003e   [comparable](https://github.com/project-numina/aimo-progress-prize/blob/main/report/numina_dataset.pdf)\n\u003e   to the AIMO winner\n\u003e   [NuminaMath-7B](https://huggingface.co/AI-MO/NuminaMath-7B-CoT)** on\n\u003e   CoT, but based solely on\n\u003e   [MATH](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math-query-info)\n\u003e   \u0026\n\u003e   [GSM8K](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info)\n\u003e   prompt set, leaving much room to improve! Besides, our [`DART`\n\u003e   method](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#dars--difficulty-aware-rejection-sampling)\n\u003e   is also fully [compatible with tool-integrated\n\u003e   reasoning](https://github.com/hkust-nlp/dart-math?tab=readme-ov-file#tool-integrated-reasoning-reasoning-in-natural-language-interleaved-with-python-code).\n\u003e   Join the discussion under this [X\n\u003e   thread](https://x.com/tongyx361/status/1815112376649134172)!\n\n![](https://neurips.cc/media/PosterPDFs/NeurIPS%202024/92959.png)\n\n\u003cdiv align=\"center\"\u003e\n\n\u003cimg src=\"https://tongyx361.github.io/assets/dart-math/main-results.png\" alt=\"Main results averaged on 2 in-domain and 4 challenging out-of-domain mathematical reasoning benchmarks.\" height=300px\u003e\n\u003cimg src=\"https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png\" alt=\"Number of responses v.s. query descending in difficulty in DART-Math datasets and similar-sized VRT baseline\" height=300px\u003e\n\n\u003c/div\u003e\n\n\u003cdiv align=\"left\"\u003e\n\n\u003csup\u003e Figure 1: \u003cstrong\u003eLeft:\u003c/strong\u003e Average accuracy on 6\nmathematical benchmarks. We compare with models fine-tuned on the best,\npublic instruction tuning datasets for mathematical problem-solving:\nMetaMath \u003ca href=\"https://openreview.net/forum?id=N8N0hgNDRt\"\u003e(Yu et\nal., 2024)\u003c/a\u003e with 395K examples, MMIQC\n\u003ca href=\"https://arxiv.org/abs/2401.09003\"\u003e(Liu et al., 2024a)\u003c/a\u003e with\n2.3 million examples, as well as vanilla rejection tuning (VRT) with\n590K examples. Both \u003cem\u003eDART-Math (Uniform)\u003c/em\u003e and \u003cem\u003eDART-Math\n(Prop2Diff)\u003c/em\u003e use 590K training examples. \u003cstrong\u003eRight:\u003c/strong\u003e\nNumber of responses for each query descending by difficulty across 3\nsynthesis strategies. Queries are from the MATH training split\n\u003ca href=\"https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/be83ab3ecd0db773eb2dc1b0a17836a1-Abstract-round2.html\"\u003e(Hendrycks\net al., 2021)\u003c/a\u003e. VRT is the baseline biased towards easy queries,\nwhile \u003cem\u003eUniform\u003c/em\u003e and \u003cem\u003eProp2Diff\u003c/em\u003e are proposed in this work\nto balance and bias towards difficult queries respectively. Points are\nslightly shifted and downsampled for clarity. \u003c/sup\u003e\n\n\u003c/div\u003e\n\n| Dataset | Setting | \\# of Samples | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [College](https://github.com/hkust-nlp/dart-math/tree/main/data/dsets/mwpbench/college-math-test.jsonl) | Download |\n|:---|:---|---:|---:|---:|---:|:--:|\n| `DART-Math-Uniform` | Unifrom | 591k | 52.9 | **88.2** | 40.1 | 🤗 [HuggingFace](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |\n| `DART-Math-Hard` | Prop2Diff | 585k | **53.6** | 86.8 | **40.7** | 🤗 [HuggingFace](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |\n| `DART-Math-Pool-MATH` | – | 1615k | – | – | – | 🤗 [HuggingFace](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math) |\n| `DART-Math-Pool-GSM8K` | – | 2739k | – | – | – | 🤗 [HuggingFace](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k) |\n\n\u003csup\u003eMATH and GSM8K are **in-domain**, while College(Math) is\n**out-of-domain**. Performance here are of `DART-Math` models fine-tuned\nfrom\n[DeepSeekMath-7B](https://huggingface.co/deepseek-ai/deepseek-math-7b-base).\n**Bold** means the best score on the respective base model here.\u003c/sup\u003e\n\n| Model | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [CollegeMath](https://github.com/hkust-nlp/dart-math/tree/main/data/dsets/mwpbench/college-math-test.jsonl) | Download |\n|:---|---:|---:|---:|:--:|\n| `DART-Math-Llama3-70B` (Uniform) | 54.9 | **90.4** | **38.5** | 🤗 [HuggingFace](https://huggingface.co/hkust-nlp/dart-math-llama3-70b-uniform) |\n| `DART-Math-Llama3-70B` (Prop2Diff) | **56.1** | 89.6 | 37.9 | 🤗 [HuggingFace](https://huggingface.co/hkust-nlp/dart-math-llama3-70b-prop2diff) |\n| `DART-Math-DSMath-7B` (Uniform) | 52.9 | **88.2** | 40.1 | 🤗 [HuggingFace](https://huggingface.co/hkust-nlp/dart-math-dsmath-7b-uniform) |\n| `DART-Math-DSMath-7B` (Prop2Diff) | **53.6** | 86.8 | **40.7** | 🤗 [HuggingFace](https://huggingface.co/hkust-nlp/dart-math-dsmath-7b-prop2diff) |\n| `DART-Math-Mistral-7B` (Uniform) | 43.5 | **82.6** | 26.9 | 🤗 [HuggingFace](https://huggingface.co/hkust-nlp/dart-math-mistral-7b-uniform) |\n| `DART-Math-Mistral-7B` (Prop2Diff) | **45.5** | 81.1 | **29.4** | 🤗 [HuggingFace](https://huggingface.co/hkust-nlp/dart-math-mistral-7b-prop2diff) |\n| `DART-Math-Llama3-8B` (Uniform) | 45.3 | **82.5** | 27.1 | 🤗 [HuggingFace](https://huggingface.co/hkust-nlp/dart-math-llama3-8b-uniform) |\n| `DART-Math-Llama3-8B` (Prop2Diff) | **46.6** | 81.1 | **28.8** | 🤗 [HuggingFace](https://huggingface.co/hkust-nlp/dart-math-llama3-8b-prop2diff) |\n\n\u003csup\u003eMATH and GSM8K are \u003cb\u003ein-domain\u003c/b\u003e, while CollegeMath is\n\u003cb\u003eout-of-domain\u003c/b\u003e. **Bold** means the best score on the respective\nbase model here.\u003c/sup\u003e\n\n## `DART-Math` Models: SOTA on Various In-Domain and Out-of-Domain Benchmarks\n\n`DART-Math` models achieve performance **superior or competitive to\nprevious SOTAs** on 2 in-domain and 4 challenging out-of-domain\nmathematical reasoning benchmarks, despite using **much smaller\ndatasets** and **no proprietary model like GPT-4**.\n\n| Model | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [College](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | [DM](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/deepmind-mathematics.json) | [Olympiad](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/olympiadbench/OE_TO_maths_en_COMP.json) | [Theorem](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/theoremqa.json) | AVG |\n|:---|---:|---:|---:|---:|---:|---:|---:|\n| GPT-4 (0314) | [52.6](https://arxiv.org/abs/2403.04706) | [94.7](https://arxiv.org/abs/2403.04706) | [24.4](https://arxiv.org/abs/2403.02884) | – | – | – | – |\n| Llama3-70B-MetaMath | 44.9 | 88.0 | 31.9 | 53.2 | 11.6 | 21.9 | 41.9 |\n| [`DART-Math-Llama3-70B`](https://huggingface.co/hkust-nlp/dart-math-llama3-70b-prop2diff) | **56.1** | **89.6** | **37.9** | **64.1** | **20.0** | **28.2** | **49.3** |\n| DeepSeekMath-7B-MetaMath | 43.7 | 81.8 | 33.7 | 53.0 | 13.6 | 23.2 | 41.5 |\n| [DeepSeekMath-7B-RL](https://huggingface.co/deepseek-ai/deepseek-math-7b-rl) | 53.1 | 88.4 | 41.3 | 58.3 | 18.7 | 35.9 | 49.3 |\n| [`DART-Math-DSMath-7B`](https://huggingface.co/hkust-nlp/dart-math-dsmath-7b-prop2diff) | **53.6** | **86.8** | **40.7** | **61.6** | **21.7** | **32.2** | **49.4** |\n| Mistral-7B-MetaMath | 29.8 | 76.5 | 19.3 | 28.0 | 5.9 | 14.0 | 28.9 |\n| [`DART-Math-Mistral-7B`](https://huggingface.co/hkust-nlp/dart-math-mistral-7b-prop2diff) | **45.5** | **81.1** | **29.4** | **45.1** | **14.7** | **17.0** | **38.8** |\n| Llama3-8B-MetaMath | 32.5 | 77.3 | 20.6 | 35.0 | 5.5 | 13.8 | 30.8 |\n| [`DART-Math-Llama3-8B`](https://huggingface.co/hkust-nlp/dart-math-llama3-8b-prop2diff) | **46.6** | **81.1** | **28.8** | **48.0** | **14.5** | **19.4** | **39.7** |\n\n\u003csup\u003e**Abbreviations**: College (CollegeMath), DM (DeepMind\nMathematics), Olympiad (OlympiadBench-Math), Theorem (TheoremQA).\n**Bold** means the best score by SFT on the respective base model here.\n`DART-Math` models here are fine-tuned on the [`DART-Math-Hard`\ndataset](https://huggingface.co/datasets/hkust-nlp/dart-math-hard).\u003c/sup\u003e\n\n## `DART-Math` Datasets: SOTA \u0026 Data-Efficient \u0026 Open-Source\n\n`DART-Math` are the **state-of-the-art** and **data-efficient**\n**open-source** instruction tuning datasets for mathematical reasoning.\n\nMost of previous datasets are **constructed with ChatGPT**, and many of\nthem are **not open-source**, especially for ones of the best\nperformance.\n\n| Math SFT Dataset | \\# of Samples | [MATH](https://huggingface.co/datasets/hendrycks/competition_math) | [GSM8K](https://huggingface.co/datasets/gsm8k) | [College](https://github.com/hkust-nlp/dart-math/tree/main/data/eval-dsets/mwpbench/college-math-test.jsonl) | Synthesis Agent(s) | Open-Source |\n|:---|---:|---:|---:|---:|:---|:--:|\n| [WizardMath](https://arxiv.org/abs/2308.09583) | 96k | 32.3 | 80.4 | 23.1 | GPT-4 | ✗ |\n| [MetaMathQA](https://arxiv.org/abs/2309.12284) | 395k | 29.8 | 76.5 | 19.3 | GPT-3.5 | [✓](https://huggingface.co/datasets/meta-math/MetaMathQA) |\n| [MMIQC](https://arxiv.org/abs/2401.09003) | **2294k** | 37.4 | 75.4 | *28.5* | **GPT-4+GPT-3.5+Human** | [**✓**](https://huggingface.co/datasets/Vivacem/MMIQC) |\n| [Orca-Math](https://arxiv.org/abs/2402.14830) | 200k | – | – | – | GPT-4 | [✓](https://huggingface.co/datasets/microsoft/orca-math-word-problems-200k) |\n| [Xwin-Math-V1.1](https://arxiv.org/abs/2403.04706) | **1440k** | *45.5* | **84.9** | 27.6 | **GPT-4** | **✗** |\n| [KPMath-Plus](https://arxiv.org/abs/2403.02333) | **1576k** | **46.8** | 82.1 | – | **GPT-4** | **✗** |\n| [MathScaleQA](https://arxiv.org/abs/2403.02884) | 2021k | 35.2 | 74.8 | 21.8 | GPT-3.5+Human | ✗ |\n| [`DART-Math-Uniform`](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) | **591k** | 43.5 | *82.6* | 26.9 | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-uniform) |\n| [`DART-Math-Hard`](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) | **585k** | *45.5* | 81.1 | **29.4** | **DeepSeekMath-7B-RL** | [**✓**](https://huggingface.co/datasets/hkust-nlp/dart-math-hard) |\n\n\u003csup\u003eMATH and GSM8K are **in-domain**, while College(Math) is\n**out-of-domain**. Performance here are of models fine-tuned from\n[Mistral-7B](https://huggingface.co/mistralai/Mistral-7B-v0.1), except\nfor Xwin-Math-V1.1 based on\n[Llama2-7B](https://huggingface.co/meta-llama/Llama-2-7b-hf).\n**Bold**/*Italic* means the best/second best score here.\u003c/sup\u003e\n\n## `DARS` – Difficulty-Aware Rejection Sampling\n\nOur analysis of previous datasets reveals **severe biases towards easy\nqueries**, with **frequent failures to generate any correct response for\nthe most challenging queries**.\n\nThis primarily arises from their constuction method, **vanilla rejection\nsampling**, where **the same number** of responses are sampled for each\nquery, yet the likelihood of obtaining correct responses for difficult\nqueries is significantly lower, sometimes even zero.\n\nMotivated by the observation above and the intuitive that difficult\nsamples are critical for learning complexing reasoning, we propose\n**Difficulty-Aware Rejection Sampling** (`DARS`) to eliminate the bias\ntowards easy queries. Specifically, we introduce two strategies to\nincrease the number of correct responses for difficult queries:\n\n1.  **Uniform**, which involves sampling responses for each query until\n    **each query accumulates $k_u$ correct responses**, where $k_u$ is a\n    preset hyperparameter determined by the desired size of the\n    synthetic dataset;\n2.  **Prop2Diff**, where we continue sampling responses until the number\n    of correct responses for each query is **proportional to its\n    difficulty score**. The most challenging queries will receive $k_p$\n    responses and kp is a hyperparameter. This method introduces a\n    deliberate bias in the opposite direction to vanilla rejection\n    sampling, towards more difficult queries, inspired by previous works\n    that demonstrate **difficult samples can be more effective to\n    enhance model capabilities** ([Sorscher et al.,\n    2022](https://proceedings.neurips.cc/paper_files/paper/2022/hash/7b75da9b61eda40fa35453ee5d077df6-Abstract-Conference.html);\n    [Liu et al., 2024b](https://openreview.net/forum?id=BTKAeLqLMw)).\n\nSee [Figure 1\n(Right)](https://tongyx361.github.io/assets/dart-math/main-nresp-vs-query.png)\nfor examples of `DART-Math-Uniform` by `DARS-Uniform` and\n`DART-Math-Hard` by `DARS-Prop2Diff`.\n\n## 🚀 Quick Start / Reproduction\n\n### ⚙️ Setup\n\nWe recommend using [Conda](https://docs.conda.io/projects/miniconda) and\n[pip](https://pip.pypa.io/en/stable/#) to manage your environment. Run\nthe following commands to setup your environment:\n\n``` shell\ngit clone https://github.com/hkust-nlp/dart-math.git \u0026\u0026 cd dart-math\nconda create --name dart-math --yes python=3.11\nconda activate dart-math\npip install -r requirements.txt\npip install flash-attn --no-build-isolation\n```\n\nFor common users/developers, please just run the following command the\ninstall the `dart-math` package:\n\n``` shell\npip install -e \".\"\n```\n\nFor intended contributors, we recommend installing the package with the\n`dev` extras:\n\n``` shell\npip install -e \".[dev]\"\npre-commit install\nconda install quarto -c conda-forge # for building the documentation\n```\n\n### 🔨 Training\n\nWe implement an efficient training pipeline utilizing various\ntechniques. Notably, [**sequence\npacking**](https://hkust-nlp.github.io/dart-math/train.html#sequence-packing)\naccelerates training by 6-8x in our setting and possibly more in other\nsettings. (See [how to integrate sequence packing in 4 lines of\ncode](https://hkust-nlp.github.io/dart-math/train.html#accelerating-several-times-with-sequence-packing-in-4-lines-of-code).)\n\nPlease refer to\n\n- the [training Python\n  script](https://github.com/hkust-nlp/dart-math/blob/main/pipeline/train.py)\n  for code of training based on the [HuggingFace\n  `Trainer`](https://huggingface.co/docs/transformers/en/main_classes/trainer)\n  and utilizing [sequence\n  packing](https://hkust-nlp.github.io/dart-math/train.html#sequence-packing).\n- the\n  [single-node](https://github.com/hkust-nlp/dart-math/blob/main/scripts/train-single-node.sh)/[multi-node](https://github.com/hkust-nlp/dart-math/blob/main/scripts/train-multi-node.sh)\n  training `bash` script for code of training based on [HuggingFace\n  `accelerate`](https://huggingface.co/docs/accelerate/index) and\n  [`deepspeed`](https://www.deepspeed.ai)\n\nHere, we provide some example commands as well as reproduction\ninstructions for our work:\n\n#### Single-Node Training\n\nFor example, to reproduce training `DART-Math-Llama3-8B-Prop2Diff` on a\nnode of 8 A100 GPUs, please run the following command:\n\n``` shell\nbash scripts/train-single-node.sh \\\n    --data_path \"hkust-nlp/dart-math-hard\" \\\n    --model_path \"meta-llama/Meta-Llama-3-8B\" \\\n    --lr \"5e-5\" --bs 64 --n_grad_acc_steps 1 --n_epochs 1 \\\n    --gpu_ids \"0,1,2,3,4,5,6,7\" \\\n    --output_dir \"models/dart-math-llama3-8b-prop2diff\"\n```\n\nTo reproduce other training settings, just refer to the paper and modify\nthe `--data_path`, `--model_path`, `--lr`, `--n_grad_acc_steps`,\n`--n_epochs` and `--output_dir` arguments accordingly.\n\n#### Multi-Node Training\n\nTo reproduce training `DART-Math-Llama3-70B-Prop2Diff` on 4 nodes of 8\nA100 GPUs, please first edit the `cfgs/deepspeed/hostfile` according to\nyour enviroment and then run the following command:\n\n``` shell\nbash scripts/train-multi-node.sh \\\n    --data_path \"hkust-nlp/dart-math-hard\" \\\n    --model_path \"meta-llama/Meta-Llama-3-70B\" \\\n    --lr \"2e-5\" --bs 64 --n_grad_acc_steps 1 --n_epochs 1 \\\n    --n_nodes 4 \\\n    --output_dir \"models/dart-math-llama3-70b-prop2diff\"\n```\n\nTo reproduce training `DART-Math-Llama3-70B-Uniform` on 4 nodes of 8\nA100 GPUs, just change `--data_path` to `\"hkust-nlp/dart-math-uniform\"`.\n\n\u003cdetails\u003e\n\n\u003csummary\u003e\n\nThe off-the-shelf command to train `DART-Math-Llama3-70B-Uniform`\n\u003c/summary\u003e\n\n``` shell\nbash scripts/train-multi-node.sh \\\n    --data_path \"hkust-nlp/dart-math-uniform\" \\\n    --model_path \"meta-llama/Meta-Llama-3-70B\" \\\n    --lr \"2e-5\" --bs 64 --n_grad_acc_steps 1 --n_epochs 1 \\\n    --n_nodes 4 \\\n    --output_dir \"models/dart-math-llama3-70b-prop2diff\"\n```\n\n\u003c/details\u003e\n\n### ⚖️ Evaluation\n\nWe utilize [vLLM](https://docs.vllm.ai/en/latest/index.html) to\naccelerate inference and an elaborate answer extraction and correctness\njudgement pipeline based on regular expressions and\n[SymPy](https://www.sympy.org) symbolic calculation, which is able to\ncorrectly process\n\n- most **mathematical objects** such as matrices (vectors), intervals,\n  symbols besides numbers,\n- as well as some **special texts** like bool expressions, dates and\n  times.\n\nFor example, to reproduce one pass of greedy decoding with\n`DART-Math-Mistral-7B-Prop2Diff` on the 6 benchmarks in Table 2 on GPU\n0, please run the following command:\n\n``` shell\nCUDA_VISIBLE_DEVICES=\"0\" python pipeline/gen.py \\\n    --gen_save_path \"data/res/dart-math-mistral-7b-prop2diff.jsonl\" \\\n    --model_name_or_path \"hkust-nlp/dart-math-mistral-7b-prop2diff\" \\\n    --datasets \"math/test\" \"gsm8k/test\" \"mwpbench/college-math/test\" \"deepmind-mathematics\" \\\n        \"olympiadbench/OE_TO_maths_en_COMP\" \"theoremqa\" \\\n    --max_new_toks 2048 --temperature 0 \\\n    --prompt_template \"cot\" --n_shots -1 \\\n    --inf_seed -1 \\\n    --max_n_trials 1\n```\n\nTo reproduce other inference settings, just refer to the paper and\nmodify the `--model_name_or_path` and `--gen_save_path` arguments\naccordingly.\n\n- We observed that Llama-3-8B(-Base) tends to decode EoS immediately\n  sometimes. Try use `--ignore_eos` as a workaround.\n\nFor other general inference settings, please modify the command or\ndirectly modify the\n[script](https://github.com/hkust-nlp/dart-math/blob/main/pipeline/gen.py).\n\n- To test **base** models, please add the corresponding **ID** to\n  `BASE_MODEL_IDS` from\n  [dart_math.utils](https://github.com/hkust-nlp/dart-math/blob/main/dart_math/utils.py).\n- To test **instruct** models, please add the corresponding **prompt\n  template** to `PROMPT_TEMPLATE_ID2DICT` from\n  [dart_math.utils](https://github.com/hkust-nlp/dart-math/blob/main/dart_math/utils.py)\n  and specify with `--prompt_template`.\n\nYou can also add the `--gen_only` option to only generate responses\nwithout evaluation and use the\n[`EvaluatorMathBatch`](https://hkust-nlp.github.io/dart-math/eval.html#evaluatormathbatch)\nto grade the generations by yourself. Please check the [grading\nscript](pipeline/grade.py) for example.\n\n### 🗂 Data Synthesis\n\nOur data synthesis pipeline is compatible with the evaluation pipeline,\nplease **modify the `--min_n_corrects` and `--max_n_trials` arguments**\nto meet your needs.\n\nFor example, to reproduce the **synthesis of `DART-Math-Uniform`**,\namortizing the workload to multiple GPUs, please run the following\ncommand:\n\n``` shell\ngpu_ids_list=(\"0\" \"1\" \"2\" \"3\" \"4\" \"5\" \"6\" \"7\")\nmin_n_corrects=40\nmin_n_corrects_per_gpu=$((min_n_corrects / ${#gpu_ids_list[@]})) # 5 here\n\nmkdir -p logs\nfor gpu_ids in \"${gpu_ids_list[@]}\"; do\n    exp_name=\"dart-math-uniform-gpu${gpu_ids}\"\n    CUDA_VISIBLE_DEVICES=\"${gpu_ids}\" python pipeline/gen.py \\\n        --gen_save_path \"data/res/${exp_name}.jsonl\" \\\n        --model_name_or_path \"deepseek-ai/deepseek-math-7b-rl\" \\\n        --datasets \"math/train\" \"gsm8k-fix/train\" \\\n        --max_new_toks 2048 --temperature 1.6 --top_p 0.95 \\\n        --prompt_template \"deepseekmath\" --n_shots 0 \\\n        --inf_seed -1 \\\n        --min_n_corrects \"${min_n_corrects_per_gpu}\" --max_n_trials 0 \\\n        \u003e\"logs/${exp_name}.log\" 2\u003e\u00261 \u0026\n    # NOTE: `--max_n_trials 0` means possible infinite trials, kill the job manually when needed\ndone\n```\n\n\u003csup\u003eNOTE: Some **erroneous labels** exist in the GSM8K dataset, so we\ntried to fix them and produced\n[`gsm8k-fix`](https://huggingface.co/datasets/hkust-nlp/gsm8k-fix).\n\u003c/sup\u003e\n\nTo reproduce the data synthesis of the **Vanilla Rejection Tuning (VRT)\nbaseline** in the paper, just set\n`--max_n_trials 52 --min_n_corrects 0`.\n\n\u003cdetails\u003e\n\n\u003csummary\u003e\n\nThe off-the-shelf command to reproduce the synthesis of the Vanilla\nRejection Tuning (VRT) baseline in the paper\n\u003c/summary\u003e\n\n``` shell\nCUDA_VISIBLE_DEVICES=\"0\" python pipeline/gen.py \\\n    --gen_save_path \"data/res/dart-math-uniform.jsonl\" \\\n    --model_name_or_path \"deepseek-ai/deepseek-math-7b-rl\" \\\n    --datasets \"math/train\" \"gsm8k-fix/train\" \\\n    --max_new_tokens 2048 --temperature 1.6 --top_p 0.95 \\\n    --prompt_template \"cot\" --n_shots 0 \\\n    --inf_seed -1 \\\n    --max_n_trials 52 --min_n_corrects 0 # no requirement for correct responses\n```\n\n\u003c/details\u003e\n\n\u003cdetails\u003e\n\n\u003csummary\u003e\n\nSo sorry that it still need some manual efforts to reproduce the data\nsynthesis of `DART-Math-Prop2Diff`. For now, please follow the\ninstructions in the paper\n\u003c/summary\u003e\n\n1.  Calculate “fail rate” (`1-pass_rate`) for each query in MATH and\n    GSM8K training sets (see the `pass_rate` field of query information\n    in\n    [MATH](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-math-query-info)\n    and\n    [GSM8K](https://huggingface.co/datasets/hkust-nlp/dart-math-pool-gsm8k-query-info)).\n2.  Calculate the target number of correct responses for each query in\n    the final training set. Note that we try to ensure at least one\n    correct response for each query in the `DART-Math` datasets, which\n    you could implement by rounding **up** when calculating the response\n    number for each query.\n3.  Sample responses for each query until the target number of correct\n    ones is met (thus proportional to its “fail rate”).\n\n\u003c/details\u003e\n\nAfter the synthesis, you can use the [curation\nscript](pipeline/curate.py) to curate the final dataset.\n\n## [`dart-math` Package](https://hkust-nlp.github.io/dart-math): Efficient and Flexible Training \u0026 Inference \u0026 Evaluation Pipelines\n\nWe package our code of effcient and flexible training \u0026 inference \u0026\nevaluation pipelines into `dart-math` and document it at [this\nwebsite](https://hkust-nlp.github.io/dart-math/quick-start.html).\n\nThe `dart-math` package provides the following useful features besides\nones mentioned above:\n\n### **Tool-integrated reasoning**: reasoning in natural language interleaved with Python code\n\nExample command to evaluate DeepSeekMath-7B-RL with tool-integrated\nreasoning (following the DeepSeekMath offical setting):\n\n``` shell\nCUDA_VISIBLE_DEVICES=\"0\" python pipeline/gen.py \\\n    --gen_save_path \"data/res/dsmath-7b-rl-tool-math-test.jsonl\" \\\n    --model_name_or_path \"deepseek-ai/deepseek-math-7b-rl\" \\\n    --datasets \"math-test\" \\\n    --max_new_toks 2048 --temperature 0 \\\n    --prompt_template \"deepseekmath-tool\" --n_shots 0 \\\n    --max_n_calls 1 --trunc_len 50 50 \\\n    --inf_seed -1 \\\n    --max_n_trials 1\n# Reproduced performance (with our evaluator): 56.08%\n# (58.8% reported originally with DeepSeekMath evaluator)\n```\n\nFor other general inference settings, please modify the options related\nto the [`Generator.code_exec_cfg`\nattribute](https://hkust-nlp.github.io/dart-math/gen.html#:~:text=means%20no%20evaluation.-,code_exec_cfg,-dart_math.exec.CodeExecCfg)\nin the command or the\n[script](https://github.com/hkust-nlp/dart-math/blob/main/pipeline/gen.py).\n\n## 🍀 Contribution\n\n### File Structure\n\n``` tree\ndart-math\n├── data\n├── cfgs # Configurations\n├── utils # Repository utilities\n├── dart_math # Package code for common utilities\n├── nbs # Notebooks and other files to run tests and generate documentation with https://nbdev.fast.ai\n├── pipeline # Reusable (Python / Shell) scripts or notebooks\n└── scripts # Setting-specific scripts\n```\n\n### Checklist Before Commit\n\n#### [`prepare-commit.sh`](utils/prepare-commit.sh)\n\nRun the [`prepare-commit.sh`](utils/prepare-commit.sh) to clean the\nnotebooks and export scripts for pipeline notebooks, generate\ndocumentation, run tests, render README if needed:\n\n``` shell\nbash utils/prepare-commit.sh\n```\n\nPlease refer to the comments in the script for how it works.\n\n#### Manual Modification List\n\n- Add `if __name__ == \"__main__\":` to scripts that might use vLLM tensor\n  parallelism\n  - [`gen.py`](pipeline/gen.py)\n\n## 🌟 Star History\n\n\u003ca href=\"https://star-history.com/#hkust-nlp/dart-math\u0026Date\"\u003e \u003cpicture\u003e\n\u003csource media=\"(prefers-color-scheme: dark)\" srcset=\"https://api.star-history.com/svg?repos=hkust-nlp/dart-math\u0026type=Date\u0026theme=dark\" /\u003e\n\u003csource media=\"(prefers-color-scheme: light)\" srcset=\"https://api.star-history.com/svg?repos=hkust-nlp/dart-math\u0026type=Date\" /\u003e\n\u003cimg alt=\"Star History Chart\" src=\"https://api.star-history.com/svg?repos=hkust-nlp/dart-math\u0026type=Date\" /\u003e\n\u003c/picture\u003e \u003c/a\u003e\n\n## 🙏 Acknowledgements\n\nThanks to:\n\n- [`nbdev`](https://nbdev.fast.ai/) for generating the [wonderful\n  documentation website](https://hkust-nlp.github.io/dart-math),\n- [`stanford_alpaca`](https://github.com/tatsu-lab/stanford_alpaca) for\n  reference code about training,\n- [`functionary`](https://github.com/MeetKai/functionary/tree/main/functionary/train/packing)\n  for reference code about [sequence\n  packing](https://hkust-nlp.github.io/dart-math/train.html#sequence-packing).\n- @HYZ17 for extensive tests and helpful suggestions.\n\n## ☕️ Citation\n\nIf you find our data, model or code useful for your work, please kindly\ncite [our paper](https://arxiv.org/abs/2407.13690):\n\n``` latex\n@article{tong2024dartmath,\n  title={DART-Math: Difficulty-Aware Rejection Tuning for Mathematical Problem-Solving},\n  author={Yuxuan Tong and Xiwen Zhang and Rui Wang and Ruidong Wu and Junxian He},\n  year={2024},\n  eprint={2407.13690},\n  archivePrefix={arXiv},\n  primaryClass={cs.CL},\n  url={https://arxiv.org/abs/2407.13690},\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhkust-nlp%2Fdart-math","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fhkust-nlp%2Fdart-math","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fhkust-nlp%2Fdart-math/lists"}