{"id":20215935,"url":"https://github.com/thudm/rest-mcts","last_synced_at":"2025-05-15T18:10:29.621Z","repository":{"id":243157360,"uuid":"811240578","full_name":"THUDM/ReST-MCTS","owner":"THUDM","description":"ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search (NeurIPS 2024)","archived":false,"fork":false,"pushed_at":"2025-01-20T06:01:03.000Z","size":3283,"stargazers_count":610,"open_issues_count":24,"forks_count":48,"subscribers_count":3,"default_branch":"main","last_synced_at":"2025-04-11T23:58:36.416Z","etag":null,"topics":[],"latest_commit_sha":null,"homepage":"","language":"Python","has_issues":true,"has_wiki":null,"has_pages":null,"mirror_url":null,"source_name":null,"license":null,"status":null,"scm":"git","pull_requests_enabled":true,"icon_url":"https://github.com/THUDM.png","metadata":{"files":{"readme":"README.md","changelog":null,"contributing":null,"funding":null,"license":null,"code_of_conduct":null,"threat_model":null,"audit":null,"citation":null,"codeowners":null,"security":null,"support":null,"governance":null,"roadmap":null,"authors":null,"dei":null,"publiccode":null,"codemeta":null}},"created_at":"2024-06-06T08:03:14.000Z","updated_at":"2025-04-11T23:05:30.000Z","dependencies_parsed_at":"2024-06-07T03:23:33.622Z","dependency_job_id":"3efed852-8c2f-4cf0-9acc-503a5af778bd","html_url":"https://github.com/THUDM/ReST-MCTS","commit_stats":null,"previous_names":["zhangdan0602/rest-mcts"],"tags_count":0,"template":false,"template_full_name":null,"repository_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FReST-MCTS","tags_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FReST-MCTS/tags","releases_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FReST-MCTS/releases","manifests_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories/THUDM%2FReST-MCTS/manifests","owner_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners/THUDM","download_url":"https://codeload.github.com/THUDM/ReST-MCTS/tar.gz/refs/heads/main","host":{"name":"GitHub","url":"https://github.com","kind":"github","repositories_count":254394724,"owners_count":22063984,"icon_url":"https://github.com/github.png","version":null,"created_at":"2022-05-30T11:31:42.601Z","updated_at":"2022-07-04T15:15:14.044Z","host_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub","repositories_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repositories","repository_names_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/repository_names","owners_url":"https://repos.ecosyste.ms/api/v1/hosts/GitHub/owners"}},"keywords":[],"created_at":"2024-11-14T06:25:33.089Z","updated_at":"2025-05-15T18:10:29.541Z","avatar_url":"https://github.com/THUDM.png","language":"Python","funding_links":[],"categories":[],"sub_categories":[],"readme":"# ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search\n\n\u003cp align=\"center\"\u003e\n📃 \u003ca href=\"https://arxiv.org/abs/2406.03816\" target=\"_blank\"\u003e[ReST-MCTS*]\u003c/a\u003e \n\u003ca href=\"https://github.com/THUDM/ReST-MCTS\" target=\"_blank\"\u003e[GitHub]\u003c/a\u003e\n\u003ca href=\"https://rest-mcts.github.io/\" target=\"_blank\"\u003e[Website]\u003c/a\u003e \u003cbr\u003e\n\u003c/p\u003e\n\nWe develop a reinforced self-training approach, called **ReST-MCTS***, based on integrating process reward guidance with tree search MCTS* for collecting higher-quality reasoning traces as well as per-step value to train policy and reward models. **ReST-MCTS*** circumvents the per-step manual annotation typically used to train process rewards by tree-search-based reinforcement learning: Given oracle final correct answers, **ReST-MCTS*** is able to infer the correct process rewards by estimating the probability this step can help lead to the correct answer. These inferred rewards serve dual purposes: they act as value targets for further refining the process reward model and also facilitate the selection of high-quality traces for policy model self-training.\n\n![](./assets/overall.png)\n\n## **Table of Contents**\n\n- [Key Differences](#introduction)\n- [Getting Started](#started)\n- [Data \u0026 Model](#data\u0026model)\n- [Self-training](#Self-training)\n- [Leaderboard](#Leaderboard)\n- [Citation](#Citation)\n\n## **Key Differences**\nWe summary the key differences between existing self-improvement methods and our approach. Train refers to whether to train a reward model.\n![](./assets/comparison.png)\n\n## **Getting Started**\n\n### **Prepare Env**\nConsidering the different dependency versions of `transformers` for Mistral (or Llama) and SciGLM, you should install different environments through miniconda and install corresponding required packages by:\n \nrunning Mistral (or Llama)\n```bash\npip install -r requirements_mistral.txt\n```\n\nor running SciGLM\n```bash\npip install -r requirements_sciglm.txt\n```\nNote that for some models on huggingface like the GLM series, you may need to install specific versions of `transformers`.\n\nThe Python version for running GLM is 3.11. The Python version for running Mistral or Llama is 3.12.\n\n### **Model Implementation**\n#### **MCTS\\* Search**\nTo run MCTS* search, you should implement a policy as well as a process reward model (value model).\nYou can download initial checkpoint and directly set these models by providing the model paths in the file `models/model.py`, substituting `INFERENCE_MODEL_DIR`, `VALUE_BASE_MODEL_DIR` and `VALUE_MODEL_STATE_DICT`.\n\n##### **Policy Model**\n`INFERENCE_MODEL_DIR` is the local path to the policy model, model could be \u003ca href=\"https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/\" target=\"_blank\"\u003e[Llama3-8B-Instruct]\u003c/a\u003e, \u003ca href=\"https://huggingface.co/meta-math/MetaMath-Mistral-7B\" target=\"_blank\"\u003e[Mistral-7B: MetaMATH]\u003c/a\u003e, and \u003ca href=\"https://huggingface.co/zd21/SciGLM-6B\" target=\"_blank\"\u003e[SciGLM-6B]\u003c/a\u003e.\n\n##### **Process Reward Model**\n`VALUE_BASE_MODEL_DIR` is the local path to the value model. Considering the different dependency versions of `transformers`, Mistral-7B is adopted as the backbone of the value model when the policy model is \u003ca href=\"https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct/\" target=\"_blank\"\u003e[Llama3-8B-Instruct]\u003c/a\u003e or \u003ca href=\"https://huggingface.co/meta-math/MetaMath-Mistral-7B\" target=\"_blank\"\u003e[Mistral-7B: MetaMATH]\u003c/a\u003e. When the policy model is \u003ca href=\"https://huggingface.co/zd21/SciGLM-6B\" target=\"_blank\"\u003e[SciGLM-6B]\u003c/a\u003e, we use \u003ca href=\"https://huggingface.co/THUDM/chatglm3-6b\" target=\"_blank\"\u003e[ChatGLM3-6B]\u003c/a\u003e as the backbone of the value model.\n\nAiming to gather value train data for science, we integrate questions of a lean science dataset $D_{sci}$ within \u003ca href=\"https://rest-mcts.github.io/\" target=\"_blank\"\u003e[SciInstruct]\u003c/a\u003e to construct $D_{V_0}$. This dataset consists of 11,554 questions, where each question is paired with a correct step-by-step solution. (See **Fine-grained dataset for science and math.** in Section 4.1 of \u003ca href=\"https://arxiv.org/pdf/2406.03816\" target=\"_blank\"\u003e[the paper]\u003c/a\u003e for more details.)\n\nYou can download [[$D_{V_0}$](https://huggingface.co/datasets/zd21/ReST-MCTS-PRM-0th)] and put them in `PRM/data` to train Mistral-7B as the initial process reward model and obtain `VALUE_MODEL_STATE_DICT`.\nWe also provide `PRM/train_VM_chatglm.py` and `PRM/train_VM_mistral.py`.\n\nThe experimental settings are as follows:\n\nFor ChatGLM3-6B, learning rate (lr) is 2e-5, the number of epochs is 2 or 3, and batch size is 3.\n\nFor Mistral, learning rate (lr) is 3e-6, the number of epochs is 2 or 3, and batch size is 3.\n\n##### **Model Setting**\nWe now only provide the implementation of the `llama`, `glm` and `mistral` as policy, with `glm` and `mistral` as value model in `models/model.py`.\nIf you are trying with other models, you can refer to our implementation and modify relevant codes to implement the corresponding models.\nOnce you've implemented the policy and value model, you should modify the `LOCAL_INFERENCE_IDX` and `LOCAL_VALUE_IDX` in `models/model.py` to the corresponding model index.\n\n### **Data Preparation**\nBefore running search for evaluation or generation, you have to make sure your target question dataset is in the correct format. \nThe data file should be a json file with items in the following format:\n```json\n{\n  \"content\": \"Calculate the sum of the first 10 prime numbers.\",\n  \"answer\": \"129\"\n}\n```\nThe `content` entry is required, serving as the question. While the `answer` entry is optional, it is used for evaluation.\n\n### **Run MCTS\\* Search**\nThe implementation of MCTS* search can be found in `MCTS`. We provide a search interface in `MCTS/task.py`. To run MCTS* search for a single question, you can refer to the following script:\n\n```python\nfrom MCTS.task import *\nquestion = \"Calculate the sum of the first 10 prime numbers.\"\ntask = MCTS_Task(question, 'llama', 'local', lang='en')\noutput = task.run()\nprint(output['solution'])\n```\n\nFor evaluation of MCTS* on benchmarks, you can refer to `evaluate.py`, setting the parameter `--mode` to \"mcts\". You should specify the benchmark name and the exact file (subset) you want to evaluate. A simple demonstration is provided below:\n```bash\npython evaluate.py \\\n  --task_name \"scibench\" \\\n  --file \"thermo\" \\\n  --propose_method \"gpt\" \\\n  --value_method \"local\" \\\n  --mode \"mcts\" \\\n  --evaluate \"scibench\" \\\n  --iteration_limit 50 \\\n  --use_reflection \"simple\" \\\n  --branch 3\n```\nYou can also refer to the `MCTS/args.md` for more details on the search parameters.\n\n## **Data \u0026 Model (take Llama3-8B-Instruct as an example)**\nGiven question set $D_G$, we use three backbones (Llama3-8B-Instruct, Mistral-7b: MetaMATH, and SciGLM-6B) guided by MCTS* to generate synthetic data for policy model and value model. (See **Algorithm 1** of \u003ca href=\"https://arxiv.org/pdf/2406.03816\" target=\"_blank\"\u003e[the paper]\u003c/a\u003e for more details.)\n\n### Policy Data\n\nDownload policy data for training and comparing 1st policy model. Noting that CoT and MCTS only include positive samples. DPO includes both positive and negative samples.\n\n| Backbone             | Iteration | Self-Training        | Full Name                                                     | Link                                                                                                                 |\n|----------------------|-----------|----------------------|---------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|\n| Llama3-8b-Instruct   | 1st       | ReST-EM (CoT)        | ReST-MCTS_Llama3-8b-Instruct_ReST-EM-CoT_1st                  | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Llama3-8b-Instruct_ReST-EM-CoT_1st)]                  |\n|                      |           | Self-Rewarding (DPO) | ReST-MCTS_Llama3-8b-Instruct_Self-Rewarding-DPO_1st           | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Llama3-8b-Instruct_Self-Rewarding-DPO_1st)]           |\n|                      |           | ReST-MCTS            | ReST-MCTS_Llama3-8b-Instruct_ReST-MCTS_Policy_1st             | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Llama3-8b-Instruct_ReST-MCTS_Policy_1st)]             |\n|                      |           |                      |                                                               |                                                                                                                      |\n| Mistral: MetaMATH-7b | 1st       | ReST-EM (CoT)        | ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-EM-CoT_1st        | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-EM-CoT_1st)]        |\n|                      |           | Self-Rewarding (DPO) | ReST-MCTS_Mistral-MetaMATH-7b-Instruct_Self-Rewarding-DPO_1st | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_Self-Rewarding-DPO_1st)] |\n|                      |           | ReST-MCTS            | ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-MCTS_1st          | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-MCTS_1st)]          |\n|                      |           |                      |                                                               |                                                                                                                      |\n| SciGLM-6B            | 1st       | ReST-EM (CoT)        | ReST-MCTS_SciGLM-6B_ReST-EM-CoT_1st                           | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_SciGLM-6B_ReST-EM-CoT_1st)]                           |\n|                      |           | Self-Rewarding (DPO) | ReST-MCTS_SciGLM-6B_Self-Rewarding-DPO_1st                    | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_SciGLM-6B_Self-Rewarding-DPO_1st)]                    |\n|                      |           | ReST-MCTS            | ReST-MCTS_SciGLM-6B_ReST-MCTS_Policy_1st                      | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_SciGLM-6B_ReST-MCTS_Policy_1st)]                      |\n|                      |           |                      |                                                               |                                                                                                                      |\n|                      |           |                      |                                                               |                                                                                                                      |\n\n\nDownload policy data for training and comparing 2nd policy model. \n\n| Backbone             | Iteration | Self-Training        | Full Name                                                     | Link                                                                                                                 |\n|----------------------|-----------|----------------------|---------------------------------------------------------------|----------------------------------------------------------------------------------------------------------------------|\n| Llama3-8b-Instruct   |  2nd       | ReST-EM (CoT)        | ReST-MCTS_Llama3-8b-Instruct_ReST-EM-CoT_2nd                  | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Llama3-8b-Instruct_ReST-EM-CoT_2nd)]                  |\n|                      |           | Self-Rewarding (DPO) | ReST-MCTS_Llama3-8b-Instruct_Self-Rewarding-DPO_2nd           | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Llama3-8b-Instruct_Self-Rewarding-DPO_2nd)]           |\n|                      |           | ReST-MCTS            | ReST-MCTS_Llama3-8b-Instruct_ReST-MCTS_Policy_2nd             | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Llama3-8b-Instruct_ReST-MCTS_Policy_2nd)]             |\n|                      |           |                      |                                                               |                                                                                                                      |\n| Mistral: MetaMATH-7b |  2nd       | ReST-EM (CoT)        | ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-EM-CoT_2nd        | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-EM-CoT_2nd)]        |\n|                      |           | Self-Rewarding (DPO) | ReST-MCTS_Mistral-MetaMATH-7b-Instruct_Self-Rewarding-DPO_2nd | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_Self-Rewarding-DPO_2nd)] |\n|                      |           | ReST-MCTS            | ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-MCTS_2nd          | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_Mistral-MetaMATH-7b-Instruct_ReST-MCTS_2nd)]          |\n|                      |           |                      |                                                               |                                                                                                                      |\n| SciGLM-6B            |  2nd       | ReST-EM (CoT)        | ReST-MCTS_SciGLM-6B_ReST-EM-CoT_2nd                           | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_SciGLM-6B_ReST-EM-CoT_2nd)]                           |\n|                      |           | Self-Rewarding (DPO) | ReST-MCTS_SciGLM-6B_Self-Rewarding-DPO_2nd                    | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_SciGLM-6B_Self-Rewarding-DPO_2nd)]                    |\n|                      |           | ReST-MCTS            | ReST-MCTS_SciGLM-6B_ReST-MCTS_Policy_2nd                      | [[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS_SciGLM-6B_ReST-MCTS_Policy_2nd)]                      |\n|                      |           |                      |                                                               |                                                                                                                      |\n|                      |           |                      |                                                               |                                                                                                                      |\n\n\n### PRM Data\nDownload PRM data (positive and negative samples) for training 1st reward model (Llama3-8b-Instruct):\n[[Hugging Face](https://huggingface.co/datasets/zd21/ReST-MCTS-Llama3-8b-Instruct-PRM-1st)]\n\n\u003c!-- ### Policy Model\nDownload the trained policy model:\n[[Hugging Face](https://huggingface.co/zd21/ReST-MCTS-Llama3-8b-Instruct-Policy-1st)] --\u003e\n\n## **Self-training**\nFor our methods:\n\nRegarding Llama3-8B-Instruct and Mistral-7B: MetaMATH, we use the default repo of \u003ca href=\"https://github.com/TIGER-AI-Lab/MAmmoTH\" target=\"_blank\"\u003e[MAmmoTH]\u003c/a\u003e to train the policy model and evaluate.\n\nRegarding SciGLM-6B, we use the default repo of \u003ca href=\"https://github.com/THUDM/SciGLM\" target=\"_blank\"\u003e[SciGLM]\u003c/a\u003e to train the policy model and evaluate.\n\nWe also implement self-rewarding as our baseline in ./self_train/self_train_dpo.py.\n\n## **Leaderboard**\n\nSelf-training Results:\n\n![](./assets/results.png)\n\nAccuracy of Different Verifiers:\n\n![](./assets/vm_results.png)\n\nAccuracy of Different Searches (we also provide the plot code in `figures/plot_math_self_training.py`):\n\n![](./assets/searches.png)\n\n## **Citation**\n\nIf you find our work helpful, please kindly cite our paper:\n\n```\n@article{zhang2024rest,\n  title={ReST-MCTS*: LLM Self-Training via Process Reward Guided Tree Search},\n  author={Zhang, Dan and Zhoubian, Sining and Hu, Ziniu and Yue, Yisong and Dong, Yuxiao and Tang, Jie},\n  journal={arXiv preprint arXiv:2406.03816},\n  year={2024}\n}\n```\n","project_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Frest-mcts","html_url":"https://awesome.ecosyste.ms/projects/github.com%2Fthudm%2Frest-mcts","lists_url":"https://awesome.ecosyste.ms/api/v1/projects/github.com%2Fthudm%2Frest-mcts/lists"}