https://github.com/om-ai-lab/open-agent-leaderboard

Reproducible Language Agent Research
https://github.com/om-ai-lab/open-agent-leaderboard
agent agent-leaderboard chain-of-thought chatgpt doubao graph-of-thoughts gsm8k language-agent llm program-of-thoughts react reasoning tree-of-thoughts
Last synced: 4 months ago
JSON representation
Reproducible Language Agent Research
Host: GitHub
URL: https://github.com/om-ai-lab/open-agent-leaderboard
Owner: om-ai-lab
Created: 2025-01-03T08:01:24.000Z (over 1 year ago)
Default Branch: main
Last Pushed: 2025-06-25T09:24:39.000Z (about 1 year ago)
Last Synced: 2025-06-25T09:37:20.449Z (about 1 year ago)
Topics: agent, agent-leaderboard, chain-of-thought, chatgpt, doubao, graph-of-thoughts, gsm8k, language-agent, llm, program-of-thoughts, react, reasoning, tree-of-thoughts
Language: Python
Homepage:
Size: 34.2 MB
Stars: 27
Watchers: 5
Forks: 2
Open Issues: 0
Metadata Files:
- Readme: README.md
Awesome Lists containing this project

README

          


    🏅 Open Agent Leaderboard 





  [🤗 HF Leaderboard] 

  [📄 Paper]



## 🎉 Updates

- 2025/5/23: The paper "Unifying Language Agent Algorithms with Graph-based Orchestration Engine for Reproducible Agent Research" was accepted to ACL 2025 Systems Demonstration Track.🎉

- 2025/2/11: Add deepseek-r1:1.5b, a new dataset MATH-500, and a new algorithm ToT into the leaderboard.

- 2025/1/23: Add gpt-4o, Qwen2.5-72B-Instruct, Qwen2.5-7B-Instruct, Qwen2-1.5B-Instruct, Qwen2-0.5B-Instruct, Llama-3.3-70B-Instruct, Llama-3.1-8B-Instruct, Internllm2_5-7B into the leaderboard.

- 2025/1/07: The Open Agent Leaderboard is released.

## 📖 Introduction

This project aims to provide a fair comparison of various agents by evaluating their performance on different datasets and LLMs. Built on top of the [OmAgent](https://github.com/om-ai-lab/OmAgent) framework, it allows for simple, quick, and accurate assessments of agents.

Supported benchmark datasets:

- [gsm8k](https://huggingface.co/datasets/openai/gsm8k)

- [AQuA](https://github.com/google-deepmind/AQuA)

- [MATH-500](https://huggingface.co/datasets/HuggingFaceH4/MATH-500)

Supported algorithms:

- IO: Input-Output Direct Prompting (Baseline)

- [CoT: Chain-of-thought prompting elicits reasoning in large language models](https://arxiv.org/abs/2201.11903), [Large Language Models are Zero-Shot Reasoners](https://arxiv.org/pdf/2205.11916)

- [SC-CoT: Self-Consistency Improves Chain of Thought Reasoning in Language Models](https://arxiv.org/abs/2203.11171)

- [PoT: Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks](https://arxiv.org/abs/2211.12588)

- [ReAct: ReAct: Synergizing Reasoning and Acting in Language Models](https://arxiv.org/abs/2210.03629)

- [ToT: Tree of Thoughts: Deliberate Problem Solving with Large Language Models](https://arxiv.org/abs/2305.10601)

Supported LLMs:

- gpt-3.5-turbo

- gpt-4o

- Doubao-lite-32k

- Qwen2.5-72B-Instruct

- Qwen2.5-7B-Instruct

- Qwen2-1.5B-Instruct

- Qwen2-0.5B-Instruct

- Llama-3.3-70B-Instruct

- Llama-3.1-8B-Instruct

- Internllm2_5-7B

- deepseek-r1:1.5b

## 🌟 Graph-based Workflow Orchestration Engine

At the core of AGORA is a graph-based orchestration engine designed for modularity and scalability. As shown in the figure below, the system uses a Directed Acyclic Graph (DAG) where each node represents a task. Tasks are either simple tasks—developer-defined custom logic—or logical tasks—built-in control flows such as branching and looping.

Built on the Conductor library, this engine provides visual representations of workflows, making agent behavior intuitive to trace and debug. It also supports asynchronous, distributed execution, which is ideal for managing long-running, complex agent workflows. 

![image](figs/OPS5.png)

## 🏅 Leaderboards

**Math tasks**

| **Rank** | 
|----------|---- 
| **1**    | SC-CoT 
| **2**    | CoT 
| **3**    | SC-CoT 
| **4**    | SC-CoT 
| **5**    | CoT 
| **6**    | CoT 
| **7**    | IO 
| **8**    | SC-CoT 
| **9**    | IO 
| **10**   | CoT 
| **11**   | SC-CoT 
| **12**   | ReAct-Pro* 
| **13**   | CoT 
| **14**   | ReAct-Pro* 
| **15**   | PoT 
| **16**   | PoT 
| **17**   | ReAct-Pro* 
| **18**   | ReAct-Pro* 
| **19**   | IO 
| **20**   | IO 
| **21**   | PoT 
| **22**   | CoT 
| **23**   | IO 
| **24**   | PoT 
| **25**   | ToT 
| **26**   | CoT 
| **27**   | CoT 
| **28**   | IO 
| **29**   | ToT 
| **30**   | ToT 
| **31**   | ReAct-Pro* 
| **32**   | SC-CoT 
| **33**   | SC-CoT 
| **34**   | PoT 
| **35**   | PoT 
| **36**   | ReAct-Pro* 
| **37**   | CoT 
| **38**   | ReAct-Pro* 
| **39**   | IO 
| **40**   | ToT 
| **41**   | SC-CoT 
| **42**   | ToT 
| **43**   | ToT 
| **44**   | ReAct-Pro* 
| **45**   | CoT 
| **46**   | PoT 
| **47**   | IO 
| **48**   | SC-CoT 
| **49**   | PoT 
| **50**   | ReAct-Pro* 
| **51**   | ToT 
| **52**   | IO 
| **53**   | CoT 
| **54**   | PoT 
| **55**   | ReAct-Pro* 
| **56**   | ToT 
| **57**   | IO 
| **58**   | ToT 
| **59**   | PoT 
| **60**   | ToT 
| **61**   | IO 
| **62**   | ReAct-Pro* 
| **63**   | ToT 
| **64**   | PoT 
| **65**   | SC-CoT 
| **66**   | SC-CoT

**Algorithm** | **LLM**                | **Eval Date** | **Avg Score** | **gsm8k-Score** | **gsm8k-Cost($)** | **AQuA-Score** | **AQuA-Cost($)** | **MATH-500-Score** | **MATH-500-Cost($)** | -----------|------------------------|---------------|---------------|-----------------|-------------------|----------------|------------------|--------------------|----------------------| | Qwen2.5-72B-Instruct   | 2025/1/22     | 86.67         | 94.77           | 4.045             | 85.43          | 0.4186           | 79.8               | 1.8504               | | Qwen2.5-72B-Instruct   | 2025/1/22     | 86.43         | 92.87           | 0.7195            | 86.22          | 0.0808           | 80.2               | 0.349                | | gpt-4o                 | 2025/1/22     | 85.07         | 94.77           | 18.2044           | 85.83          | 5.2456           | 74.6               | 12.3611              | | Llama-3.3-70B-Instruct | 2025/1/22     | 84.09         | 95.22           | 3.7895            | 84.65          | 0.4438           | 72.4               | 1.7845               | | Llama-3.3-70B-Instruct | 2025/1/22     | 82.86         | 93.93           | 0.687             | 83.46          | 0.0927           | 71.2               | 0.3463               | | gpt-4o                 | 2025/1/22     | 81.59         | 94.09           | 4.5367            | 82.68          | 1.0417           | 68                 | 3.0569               | | Llama-3.3-70B-Instruct | 2025/1/22     | 81.45         | 92.27           | 0.4709            | 82.68          | 0.0798           | 69.4               | 0.2386               | | Qwen2.5-7B-Instruct    | 2025/1/22     | 80.57         | 90.98           | 0                 | 79.53          | 0                | 71.2               | 0                    | | Qwen2.5-72B-Instruct   | 2025/1/22     | 80.34         | 86.58           | 0.4899            | 84.25          | 0.0742           | 70.2               | 0.2506               | | Qwen2.5-7B-Instruct    | 2025/1/22     | 78.73         | 85.67           | 0                 | 80.71          | 0                | 69.8               | 0                    | | Doubao-lite-32k        | 2025/1/7      | 77.92         | 91.58           | 0.1118            | 76.37          | 0.0279           | 65.8               | 0.0734               | | Llama-3.3-70B-Instruct | 2025/1/22     | 77.12         | 87.64           | 10.1124           | 79.13          | 0.768            | 64.6               | 3.1806               | | Doubao-lite-32k        | 2025/1/7      | 77            | 89.31           | 0.0558            | 82.68          | 0.0066           | 59                 | 0.0255               | | Qwen2.5-72B-Instruct   | 2025/1/22     | 74.43         | 87.26           | 10.5479           | 73.23          | 0.3177           | 62.8               | 3.4541               | | Qwen2.5-72B-Instruct   | 2025/1/22     | 71.58         | 92.34           | 0.7054            | 75.2           | 0.1645           | 47.2               | 0.233                | | gpt-4o                 | 2025/1/22     | 71.5          | 93.1            | 4.2166            | 75.2           | 1.6087           | 46.2               | 1.5994               | | Doubao-lite-32k        | 2025/1/7      | 70.12         | 85.6            | 0.2512            | 77.56          | 0.0445           | 47.2               | 0.186                | | Qwen2.5-7B-Instruct    | 2025/1/22     | 68.69         | 82.87           | 0                 | 74.41          | 0                | 48.8               | 0                    | | gpt-4o                 | 2025/1/22     | 68.6          | 88.4            | 3.3463            | 75.59          | 1.1453           | 41.8               | 2.7907               | | Qwen2.5-7B-Instruct    | 2025/1/22     | 65.13         | 57.24           | 0                 | 78.74          | 0                | 59.4               | 0                    | | Llama-3.3-70B-Instruct | 2025/1/22     | 65.07         | 73.09           | 0.9736            | 79.53          | 0.1746           | 42.6               | 0.2839               | | deepseek-r1:1.5b       | 2025/1/23     | 63.9          | 70.66           | 0                 | 71.65          | 0                | 49.4               | 0                    | | Doubao-lite-32k        | 2025/1/7      | 62.85         | 72.02           | 0.0354            | 79.13          | 0.0058           | 37.4               | 0.0187               | | Doubao-lite-32k        | 2025/1/7      | 61.29         | 79.61           | 0.0576            | 71.65          | 0.0147           | 32.6               | 0.0144               | | Qwen2.5-72B-Instruct   | 2025/1/22     | 60.26         | 88.88           | 23.5911           | 81.1           | 3.7389           | 10.8               | 9.0421               | | gpt-3.5-turbo          | 2025/1/7      | 59.84         | 78.7            | 0.6788            | 61.02          | 0.0957           | 39.8               | 0.3189               | | Internllm2_5-7B        | 2025/1/22     | 59.02         | 77.71           | 0                 | 52.76          | 0                | 46.6               | 0                    | | deepseek-r1:1.5b       | 2025/1/22     | 58.95         | 64.14           | 0                 | 68.9           | 0                | 43.8               | 0                    | | Llama-3.3-70B-Instruct | 2025/1/22     | 58.79         | 91.89           | 20.8753           | 83.07          | 2.9404           | 1.4                | 8.2699               | | gpt-4o                 | 2025/1/22     | 58.61         | 91.13           | 86.8581           | 81.5           | 8.5295           | 3.2                | 40.8094              | | gpt-4o                 | 2025/1/22     | 58.26         | 63.31           | 39.0751           | 57.48          | 2.304            | 54                 | 17.7735              | | deepseek-r1:1.5b       | 2025/2/10     | 57.91         | 69.07           | 0                 | 57.87          | 0                | 46.8               | 0                    | | gpt-3.5-turbo          | 2025/1/7      | 56.25         | 69.29           | 2.5203            | 58.66          | 0.3277           | 40.8               | 1.2308               | | Qwen2.5-7B-Instruct    | 2025/1/22     | 55.51         | 58.83           | 0                 | 68.11          | 0                | 39.6               | 0                    | | gpt-3.5-turbo          | 2025/1/7      | 55.04         | 76.88           | 0.6902            | 59.45          | 0.1748           | 28.8               | 0.168                | | gpt-3.5-turbo          | 2025/1/7      | 54.43         | 74.91           | 3.4633            | 64.57          | 0.4928           | 23.8               | 2.0406               | | Llama-3.1-8B-Instruct  | 2025/1/22     | 53.96         | 75.44           | 0                 | 60.63          | 0                | 25.8               | 0                    | | Llama-3.1-8B-Instruct  | 2025/1/22     | 50.7          | 67.78           | 0                 | 55.51          | 0                | 28.8               | 0                    | | Llama-3.1-8B-Instruct  | 2025/1/22     | 48.98         | 57.16           | 0                 | 51.18          | 0                | 38.6               | 0                    | | gpt-3.5-turbo          | 2025/1/7      | 44.94         | 67.93           | 9.1707            | 57.09          | 1.1513           | 9.8                | 5.2914               | | Llama-3.1-8B-Instruct  | 2025/1/22     | 44.54         | 54.36           | 0                 | 59.45          | 0                | 19.8               | 0                    | | Qwen2.5-7B-Instruct    | 2025/1/22     | 42.52         | 72.21           | 0                 | 53.94          | 0                | 1.4                | 0                    | | Llama-3.1-8B-Instruct  | 2025/1/22     | 41.97         | 65.05           | 0                 | 59.06          | 0                | 1.8                | 0                    | | deepseek-r1:1.5b       | 2025/2/10     | 38.22         | 35.94           | 0                 | 54.33          | 0                | 24.4               | 0                    | | Qwen2-1.5B-Instruct    | 2025/1/22     | 37.08         | 55.5            | 0                 | 40.55          | 0                | 15.2               | 0                    | | Llama-3.1-8B-Instruct  | 2025/1/22     | 33.56         | 38.67           | 0                 | 36.61          | 0                | 25.4               | 0                    | | gpt-3.5-turbo          | 2025/1/7      | 31.34         | 37.83           | 0.3328            | 38.98          | 0.038            | 17.2               | 0.2436               | | Internllm2_5-7B        | 2025/1/22     | 30.81         | 44.66           | 0                 | 38.58          | 0                | 9.2                | 0                    | | Internllm2_5-7B        | 2025/1/22     | 29.94         | 38.21           | 0                 | 36.61          | 0                | 15                 | 0                    | | Internllm2_5-7B        | 2025/1/22     | 29.75         | 33.51           | 0                 | 40.94          | 0                | 14.8               | 0                    | | Doubao-lite-32k        | 2025/1/7      | 28.1          | 37.83           | 0.8739            | 45.28          | 0.0881           | 1.2                | 0.2371               | | Internllm2_5-7B        | 2025/1/22     | 27.35         | 11.6            | 0                 | 47.64          | 0                | 22.8               | 0                    | | Qwen2-0.5B-Instruct    | 2025/1/22     | 25.07         | 35.94           | 0                 | 33.07          | 0                | 6.2                | 0                    | | deepseek-r1:1.5b       | 2025/2/10     | 22.54         | 11.9            | 0                 | 54.72          | 0                | 1                  | 0                    | | Qwen2-1.5B-Instruct    | 2025/1/22     | 19.55         | 24.87           | 0                 | 25.59          | 0                | 8.2                | 0                    | | Internllm2_5-7B        | 2025/1/22     | 18.96         | 20.85           | 0                 | 35.83          | 0                | 0.2                | 0                    | | Qwen2-1.5B-Instruct    | 2025/1/22     | 17.6          | 16.68           | 0                 | 29.13          | 0                | 7                  | 0                    | | Qwen2-1.5B-Instruct    | 2025/1/22     | 17.31         | 19.64           | 0                 | 31.5           | 0                | 0.8                | 0                    | | Qwen2-1.5B-Instruct    | 2025/1/22     | 16.67         | 18.5            | 0                 | 30.71          | 0                | 0.8                | 0                    | | deepseek-r1:1.5b       | 2025/2/10     | 16.11         | 23.12           | 0                 | 24.8           | 0                | 0.4                | 0                    | | Qwen2-0.5B-Instruct    | 2025/1/22     | 14.83         | 14.71           | 0                 | 27.17          | 0                | 2.6                | 0                    | | Qwen2-0.5B-Instruct    | 2025/1/22     | 10.76         | 7.66            | 0                 | 24.02          | 0                | 0.6                | 0                    | | Qwen2-0.5B-Instruct    | 2025/1/22     | 9.97          | 0               | 0                 | 29.92          | 0                | 0                  | 0                    | | Qwen2-0.5B-Instruct    | 2025/1/22     | 8.98          | 9.63            | 0                 | 17.32          | 0                | 0                  | 0                    | | Qwen2-0.5B-Instruct    | 2025/1/22     | 7.9           | 4.17            | 0                 | 17.32          | 0                | 2.2                | 0                    | | Qwen2-1.5B-Instruct    | 2025/1/22     | 6.94          | 8.19            | 0                 | 10.63          | 0                | 2                  | 0                    |

Evaluation details can be found in the [Evaluation Details](#evaluation-details) section and [huggingface leaderboard](https://huggingface.co/spaces/omlab/open-agent-leaderboard).

- IO (Input-Output) is the baseline method that directly prompts the model with the question and expects an answer without any intermediate reasoning steps. It represents the most basic way of using language models and serves as a reference point for evaluating the effectiveness of other algorithms.

- ReAct-Pro\*: We modified ReAct to ReAct-Pro, following the [Reflexion](https://github.com/noahshinn/reflexion) repository. Comparasion with the original ReAct repo can be found in the [Compare to ReAct](#comparison-react-with-react-pro) section.

![Leaderboard Visualization](figs/average_score_vs_cost_by_algorithm_llm_2.png)

## 🛠️ How to Install

1. Clone the repository:

   ```bash

   git clone https://github.com/om-ai-lab/open-agent-leaderboard.git

   cd open-agent-leaderboard

   ```

2. Install dependencies:

   ```bash

   pip install -r requirements.txt

   ```

## 🏗️ How to Evaluate Agents

### Step 1. Implement your agent in the [`omagent`](https://github.com/om-ai-lab/OmAgent) repository

Navigate to the agent repository:

    git clone https://github.com/om-ai-lab/OmAgent.git

    cd OmAgent

Set up the environment:

    pip install -e omagent-core

Implement your agent in the [`omagent`](https://github.com/om-ai-lab/OmAgent) repository, check the `examples/cot` folder.

### Step 2. Inference in OmAgent Repository

Run the inference script (cot as an example):

    cd examples/cot

    python eval_demo.py --model_id your_model_id --dataset_name your_dataset_name --dataset_path your_dataset_path --output_path your_output_path --output_name your_output_name --cot_method your_cot_method

#### Output Format

The output results are saved in JSON format and include the following fields:

- `id`: The unique identifier of the sample.

- `question`: The input question provided to the model.

- `last_output`: The raw output generated by the model.

- `output_postprocess` (optional): The processed output after cleansing.

- `ground_truth` (optional): The correct answer for the sample.

- `prompt_tokens`: The number of tokens in the input prompt.

- `completion_tokens`: The number of tokens in the model's output.

Example of an output JSON file:

```json

{

  "dataset": "gsm8k",

  "model_id": "gpt-3.5-turbo",

  "alg": "COT",

  "model_result": [

    {

      "id": 1,

      "question": "Q: There are 15 trees in the grove. Grove workers will plant trees in the grove today.....",

      "last_output": "Janet's ducks lay 16 eggs per day. She eats 3 for breakfast and uses 4 to bake muffins,...",

      "output_postprocess": "Paris",

      "ground_truth": "Paris",

      "prompt_tokens": 10,

      "completion_tokens": 5

    }

  ]

}

```

### Step 3. Evaluate inference results

Run the main script to perform evaluations:

```bash

python main.py --dataset  --model  --method  --output_dir 

```

#### Parameters

- `--random_seed`: Random seed, default is 1.

- `--dataset`: Dataset to use, options are `aqua`, `gsm8k`, `math500`.

- `--minibatch_size`: Minibatch size, default is 1.

- `--max_num_worker`: Maximum number of workers for the data loader, default is 4.

- `--model`: Model used for decoding, options are `gpt-4o-mini`, `gpt-4o`, `gpt-3.5-turbo`.

- `--method`: Method, options are `zero_shot`, `zero_shot_cot`, `few_shot`, `few_shot_cot`.

- `--cot_trigger_no`: Trigger sentence number for chain of thought, default is 1.

- `--max_length`: Maximum length of model output, default is 2048.

- `--max_length_direct`: Maximum length of direct model answer, default is 32.

- `--limit_dataset_size`: Whether to limit the test dataset size, default is 0 (no limit).

- `--output_dir`: Output directory, default is `./outputs/`.

- `--output_path`: Output path, default is empty.

- `--agent`: Agent used for the experiment, options are `cot`, `pot`, `sc_cot`, `react`.

- `--system_prompt`: System prompt, default is empty.

- `--openai_api_key`: OpenAI API key, default is empty.

- `--openai_url`: OpenAI API URL, default is `https://api.openai.com/v1`.

#### Example

```bash

python main.py --output_path example/gsm8k_results_cot.json --dataset gsm8k --method few_shot_cot

```

### Evaluation details

| **Algorithm**  | **Dataset** | **Eval Date** | **LLM**                | **Score** | **Pass rate** | **X-shot** | **Parameters**                                                                                                                | **Samples** | **Total input tokens** | **Average input tokens** | **Total output tokens** | **Average output tokens** | **All tokens** | **Cost($)** |

|----------------|-------------|---------------|------------------------|-----------|---------------|------------|-------------------------------------------------------------------------------------------------------------------------------|-------------|------------------------|--------------------------|-------------------------|---------------------------|----------------|-------------|

| **IO**         | gsm8k       | 2025/1/7      | gpt-3.5-turbo          | 37.83     | 99.92         | 8          |                                                                                                                               | 1,319       | 546,990                | 415                      | 39,563                  | 30                        | 586,553        | 0.3328      |

| **IO**         | gsm8k       | 2025/1/7      | Doubao-lite-32k        | 72.02     | 99.92         | 8          |                                                                                                                               | 1,319       | 617,377                | 468                      | 123,106                 | 93                        | 740,483        | 0.0354      |

| **IO**         | gsm8k       | 2025/1/22     | gpt-4o                 | 88.4      | 100           | 8          |                                                                                                                               | 1,319       | 542,416                | 411                      | 199,030                 | 151                       | 741,446        | 3.3463      |

| **IO**         | gsm8k       | 2025/1/22     | Qwen2.5-72B-Instruct   | 86.58     | 100           | 8          |                                                                                                                               | 1,319       | 555,340                | 421                      | 313,720                 | 238                       | 869,060        | 0.4899      |

| **IO**         | gsm8k       | 2025/1/22     | Llama-3.3-70B-Instruct | 92.27     | 100           | 8          |                                                                                                                               | 1,319       | 583,916                | 443                      | 251,359                 | 191                       | 835,275        | 0.4709      |

| **IO**         | gsm8k       | 2025/1/22     | Qwen2.5-7B-Instruct    | 57.24     | 100           | 8          |                                                                                                                               | 1,319       | 596,229                | 452                      | 291,684                 | 221                       | 887,913        | 0           |

| **IO**         | gsm8k       | 2025/1/22     | Llama-3.1-8B-Instruct  | 57.16     | 99.55         | 8          |                                                                                                                               | 1,319       | 550,941                | 418                      | 1,194,488               | 906                       | 1,745,429      | 0           |

| **IO**         | gsm8k       | 2025/1/22     | Internllm2_5-7B        | 11.6      | 97.95         | 8          |                                                                                                                               | 1,319       | 679,302                | 515                      | 434,426                 | 329                       | 1,113,728      | 0           |

| **IO**         | gsm8k       | 2025/1/22     | Qwen2-1.5B-Instruct    | 16.68     | 100           | 8          |                                                                                                                               | 1,319       | 568,530                | 431                      | 168,466                 | 128                       | 736,996        | 0           |

| **IO**         | gsm8k       | 2025/1/22     | Qwen2-0.5B-Instruct    | 14.71     | 100           | 8          |                                                                                                                               | 1,319       | 568,116                | 431                      | 266,781                 | 202                       | 834,897        | 0           |

| **IO**         | gsm8k       | 2025/1/22     | deepseek-r1:1.5b       | 64.14     | 99.62         | 8          |                                                                                                                               | 1,319       | 561,935                | 426                      | 921,116                 | 698                       | 1,483,051      | 0           |

| **ReAct-Pro*** | gsm8k       | 2025/1/7      | gpt-3.5-turbo          | 74.91     | 99.39         | 8          | max_steps=10                                                                                                                  | 1,319       | 6,506,164              | 4,933                    | 140,122                 | 106                       | 6,646,286      | 3.4633      |

| **ReAct-Pro*** | gsm8k       | 2025/1/7      | Doubao-lite-32k        | 85.6      | 99.62         | 8          | max_steps=10                                                                                                                  | 1,319       | 5,862,016              | 4,444                    | 136,623                 | 104                       | 5,998,639      | 0.2512      |

| **ReAct-Pro*** | gsm8k       | 2025/1/22     | gpt-4o                 | 63.31     | 99.55         | 8          | max_steps=10                                                                                                                  | 1,319       | 14,411,173             | 10,926                   | 304,714                 | 231                       | 14,715,887     | 39.0751     |

| **ReAct-Pro*** | gsm8k       | 2025/1/22     | Qwen2.5-72B-Instruct   | 87.26     | 100           | 8          | max_steps=10                                                                                                                  | 1,319       | 18,160,983             | 13,769                   | 549,454                 | 417                       | 18,710,437     | 10.5479     |

| **ReAct-Pro*** | gsm8k       | 2025/1/22     | Llama-3.3-70B-Instruct | 87.64     | 99.92         | 8          | max_steps=10                                                                                                                  | 1,319       | 17,038,928             | 12,918                   | 898,936                 | 682                       | 17,937,864     | 10.1124     |

| **ReAct-Pro*** | gsm8k       | 2025/1/22     | Qwen2.5-7B-Instruct    | 82.87     | 100           | 8          | max_steps=10                                                                                                                  | 1,319       | 14,355,752             | 10,884                   | 495,162                 | 375                       | 14,850,914     | 0           |

| **ReAct-Pro*** | gsm8k       | 2025/1/22     | Llama-3.1-8B-Instruct  | 67.78     | 98.56         | 8          | max_steps=10                                                                                                                  | 1,319       | 21,044,978             | 15,955                   | 1,790,789               | 1,358                     | 22,835,767     | 0           |

| **ReAct-Pro*** | gsm8k       | 2025/1/22     | Internllm2_5-7B        | 33.51     | 97.95         | 8          | max_steps=10                                                                                                                  | 1,319       | 30,120,070             | 22,836                   | 5,549,919               | 4,208                     | 35,669,989     | 0           |

| **ReAct-Pro*** | gsm8k       | 2025/1/22     | Qwen2-1.5B-Instruct    | 24.87     | 80.21         | 8          | max_steps=10                                                                                                                  | 1,319       | 9,133,603              | 6,925                    | 694,398                 | 526                       | 9,828,001      | 0           |

| **ReAct-Pro*** | gsm8k       | 2025/1/22     | Qwen2-0.5B-Instruct    | 7.66      | 95.22         | 8          | max_steps=10                                                                                                                  | 1,319       | 52,431,343             | 39,751                   | 2,961,268               | 2,245                     | 55,392,611     | 0           |

| **ReAct-Pro*** | gsm8k       | 2025/2/10     | deepseek-r1:1.5b       | 35.94     | 99.62         | 8          | max_steps=10                                                                                                                  | 1,319       | 19,299,381             | 14,632                   | 4,919,696               | 3,730                     | 24,219,077     | 0           |

| **PoT**        | gsm8k       | 2025/1/7      | gpt-3.5-turbo          | 76.88     | 99.24         | 8          |                                                                                                                               | 1,319       | 1,090,418              | 827                      | 96,662                  | 73                        | 1,187,080      | 0.6902      |

| **PoT**        | gsm8k       | 2025/1/7      | Doubao-lite-32k        | 79.61     | 92.57         | 8          |                                                                                                                               | 1,319       | 1,170,038              | 887                      | 118,017                 | 89                        | 1,288,055      | 0.0576      |

| **PoT**        | gsm8k       | 2025/1/22     | gpt-4o                 | 93.1      | 99.77         | 8          |                                                                                                                               | 1,319       | 1,101,672              | 835                      | 146,240                 | 111                       | 1,247,912      | 4.2166      |

| **PoT**        | gsm8k       | 2025/1/22     | Qwen2.5-72B-Instruct   | 92.34     | 99.39         | 8          |                                                                                                                               | 1,319       | 1,106,682              | 839                      | 144,528                 | 110                       | 1,251,210      | 0.7054      |

| **PoT**        | gsm8k       | 2025/1/22     | Llama-3.3-70B-Instruct | 73.09     | 79.61         | 8          |                                                                                                                               | 1,319       | 1,126,025              | 854                      | 601,019                 | 456                       | 1,727,044      | 0.9736      |

| **PoT**        | gsm8k       | 2025/1/22     | Qwen2.5-7B-Instruct    | 58.83     | 70.51         | 8          |                                                                                                                               | 1,319       | 1,145,390              | 868                      | 217,432                 | 165                       | 1,362,822      | 0           |

| **PoT**        | gsm8k       | 2025/1/22     | Llama-3.1-8B-Instruct  | 38.67     | 55.42         | 8          |                                                                                                                               | 1,319       | 1,147,538              | 870                      | 243,573                 | 185                       | 1,391,111      | 0           |

| **PoT**        | gsm8k       | 2025/1/22     | Internllm2_5-7B        | 38.21     | 48.9          | 8          |                                                                                                                               | 1,319       | 1,136,843              | 862                      | 188,106                 | 143                       | 1,324,949      | 0           |

| **PoT**        | gsm8k       | 2025/1/22     | Qwen2-1.5B-Instruct    | 18.5      | 31.01         | 8          |                                                                                                                               | 1,319       | 1,151,528              | 873                      | 175,994                 | 133                       | 1,327,522      | 0           |

| **PoT**        | gsm8k       | 2025/1/22     | Qwen2-0.5B-Instruct    | 9.63      | 16.91         | 8          |                                                                                                                               | 1,319       | 1,151,528              | 873                      | 237,607                 | 180                       | 1,389,135      | 0           |

| **PoT**        | gsm8k       | 2025/2/10     | deepseek-r1:1.5b       | 11.9      | 17.44         | 8          |                                                                                                                               | 1,319       | 1,138,872              | 863                      | 815,637                 | 618                       | 1,954,509      | 0           |

| **CoT**        | gsm8k       | 2025/1/7      | gpt-3.5-turbo          | 78.7      | 100           | 8          |                                                                                                                               | 1,319       | 953,242                | 723                      | 134,799                 | 102                       | 1,088,041      | 0.6788      |

| **CoT**        | gsm8k       | 2025/1/7      | Doubao-lite-32k        | 89.31     | 100           | 8          |                                                                                                                               | 1,319       | 1,042,095              | 790                      | 159,725                 | 121                       | 1,201,820      | 0.0558      |

| **CoT**        | gsm8k       | 2025/1/22     | gpt-4o                 | 94.09     | 100           | 8          |                                                                                                                               | 1,319       | 948,668                | 719                      | 216,498                 | 164                       | 1,165,166      | 4.5367      |

| **CoT**        | gsm8k       | 2025/1/22     | Qwen2.5-72B-Instruct   | 92.87     | 100           | 8          |                                                                                                                               | 1,319       | 1,005,119              | 762                      | 271,133                 | 206                       | 1,276,252      | 0.7195      |

| **CoT**        | gsm8k       | 2025/1/22     | Llama-3.3-70B-Instruct | 93.93     | 100           | 8          |                                                                                                                               | 1,319       | 990,168                | 751                      | 228,497                 | 173                       | 1,218,665      | 0.687       |

| **CoT**        | gsm8k       | 2025/1/22     | Qwen2.5-7B-Instruct    | 85.67     | 100           | 8          |                                                                                                                               | 1,319       | 1,046,008              | 793                      | 244,797                 | 186                       | 1,290,805      | 0           |

| **CoT**        | gsm8k       | 2025/1/22     | Llama-3.1-8B-Instruct  | 75.44     | 99.92         | 8          |                                                                                                                               | 1,319       | 990,168                | 751                      | 258,161                 | 196                       | 1,248,329      | 0           |

| **CoT**        | gsm8k       | 2025/1/22     | Internllm2_5-7B        | 77.71     | 99.7          | 8          |                                                                                                                               | 1,319       | 968,163                | 734                      | 234,000                 | 177                       | 1,202,163      | 0           |

| **CoT**        | gsm8k       | 2025/1/22     | Qwen2-1.5B-Instruct    | 55.5      | 100           | 8          |                                                                                                                               | 1,319       | 1,032,818              | 783                      | 185,707                 | 141                       | 1,218,525      | 0           |

| **CoT**        | gsm8k       | 2025/1/22     | Qwen2-0.5B-Instruct    | 35.94     | 99.92         | 8          |                                                                                                                               | 1,319       | 1,032,818              | 783                      | 190,641                 | 145                       | 1,223,459      | 0           |

| **CoT**        | gsm8k       | 2025/1/23     | deepseek-r1:1.5b       | 70.66     | 99.77         | 8          |                                                                                                                               | 1,319       | 1,011,714              | 767                      | 1,078,911               | 818                       | 2,090,625      | 0           |

| **SC-CoT**     | gsm8k       | 2025/1/7      | gpt-3.5-turbo          | 69.29     | 98.79         | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 895,571                | 679                      | 1,381,678               | 1,048                     | 2,277,249      | 2.5203      |

| **SC-CoT**     | gsm8k       | 2025/1/7      | Doubao-lite-32k        | 91.58     | 99.92         | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 942,182                | 714                      | 893,709                 | 678                       | 1,835,891      | 0.1118      |

| **SC-CoT**     | gsm8k       | 2025/1/22     | gpt-4o                 | 94.77     | 100           | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 894,889                | 678                      | 1,596,716               | 1,211                     | 2,491,605      | 18.2044     |

| **SC-CoT**     | gsm8k       | 2025/1/22     | Qwen2.5-72B-Instruct   | 94.77     | 100           | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 5,370,360              | 4,072                    | 1,804,898               | 1,368                     | 7,175,258      | 4.045       |

| **SC-CoT**     | gsm8k       | 2025/1/22     | Llama-3.3-70B-Instruct | 95.22     | 100           | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 5,295,585              | 4,015                    | 1,426,429               | 1,081                     | 6,722,014      | 3.7895      |

| **SC-CoT**     | gsm8k       | 2025/1/22     | Qwen2.5-7B-Instruct    | 90.98     | 100           | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 5,580,524              | 4,231                    | 1,679,419               | 1,273                     | 7,259,943      | 0           |

| **SC-CoT**     | gsm8k       | 2025/1/22     | Llama-3.1-8B-Instruct  | 54.36     | 99.85         | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 5,136,762              | 3,894                    | 5,819,672               | 4,412                     | 10,956,434     | 0           |

| **SC-CoT**     | gsm8k       | 2025/1/22     | Internllm2_5-7B        | 44.66     | 91.81         | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 5,847,761              | 4,433                    | 2,314,738               | 1,755                     | 8,162,499      | 0           |

| **SC-CoT**     | gsm8k       | 2025/1/22     | Qwen2-1.5B-Instruct    | 8.19      | 68.76         | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 5,439,568              | 4,124                    | 1,946,885               | 1,476                     | 7,386,453      | 0           |

| **SC-CoT**     | gsm8k       | 2025/1/22     | Qwen2-0.5B-Instruct    | 4.17      | 94.47         | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 5,441,962              | 4,126                    | 2,036,805               | 1,544                     | 7,478,767      | 0           |

| **SC-CoT**     | gsm8k       | 2025/2/10     | deepseek-r1:1.5b       | 69.07     | 98.79         | 8          | temperature=1, path_num=5                                                                                                     | 1,319       | 5,407,357              | 4,100                    | 4,622,327               | 3,504                     | 10,029,684     | 0           |

| **ToT**        | gsm8k       | 2025/1/7      | gpt-3.5-turbo          | 67.93     | 99.7          | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 15,920,037             | 12,070                   | 807,138                 | 612                       | 16,727,175     | 9.1707      |

| **ToT**        | gsm8k       | 2025/1/7      | Doubao-lite-32k        | 37.83     | 87.34         | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 19,208,597             | 14,563                   | 1,065,752               | 808                       | 20,274,349     | 0.8739      |

| **ToT**        | gsm8k       | 2025/1/22     | gpt-4o                 | 91.13     | 100           | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 29,445,237             | 22,324                   | 1,324,498               | 1,004                     | 30,769,735     | 86.8581     |

| **ToT**        | gsm8k       | 2025/1/22     | Qwen2.5-72B-Instruct   | 88.88     | 100           | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 40,435,361             | 30,656                   | 1,411,787               | 1,070                     | 41,847,148     | 23.5911     |

| **ToT**        | gsm8k       | 2025/1/22     | Llama-3.3-70B-Instruct | 91.89     | 100           | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 35,096,810             | 26,609                   | 1,932,877               | 1,465                     | 37,029,687     | 20.8753     |

| **ToT**        | gsm8k       | 2025/1/22     | Qwen2.5-7B-Instruct    | 72.21     | 99.01         | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 20,196,528             | 15,312                   | 11,460,791              | 8,689                     | 31,657,319     | 0           |

| **ToT**        | gsm8k       | 2025/1/22     | Llama-3.1-8B-Instruct  | 65.05     | 91.96         | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 15,554,967             | 11,793                   | 877,135                 | 665                       | 16,432,102     | 0           |

| **ToT**        | gsm8k       | 2025/1/22     | Internllm2_5-7B        | 20.85     | 70.13         | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 11,768,118             | 8,922                    | 1,410,011               | 1,069                     | 13,178,129     | 0           |

| **ToT**        | gsm8k       | 2025/1/22     | Qwen2-1.5B-Instruct    | 19.64     | 77.26         | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 12,124,248             | 9,192                    | 634,439                 | 481                       | 12,758,687     | 0           |

| **ToT**        | gsm8k       | 2025/1/22     | Qwen2-0.5B-Instruct    | -         | -             | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | -                      | -                        | -                       | -                         | -              | -           |

| **ToT**        | gsm8k       | 2025/2/10     | deepseek-r1:1.5b       | 23.12     | 72.48         | 8          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 1,319       | 2,738,244              | 2,076                    | 683,242                 | 518                       | 3,421,486      | 0           |

| **IO**         | AQuA        | 2025/1/7      | gpt-3.5-turbo          | 38.98     | 100           | 0          |                                                                                                                               | 254         | 25,701                 | 101                      | 16,770                  | 66                        | 42,471         | 0.038       |

| **IO**         | AQuA        | 2025/1/7      | Doubao-lite-32k        | 79.13     | 100           | 0          |                                                                                                                               | 254         | 33,058                 | 130                      | 54,684                  | 215                       | 87,742         | 0.0058      |

| **IO**         | AQuA        | 2025/1/22     | gpt-4o                 | 75.59     | 97.24         | 0          |                                                                                                                               | 254         | 25,631                 | 101                      | 108,121                 | 426                       | 133,752        | 1.1453      |

| **IO**         | AQuA        | 2025/1/22     | Qwen2.5-72B-Instruct   | 84.25     | 99.61         | 0          |                                                                                                                               | 254         | 25,397                 | 100                      | 106,207                 | 418                       | 131,604        | 0.0742      |

| **IO**         | AQuA        | 2025/1/22     | Llama-3.3-70B-Instruct | 82.68     | 99.21         | 0          |                                                                                                                               | 254         | 32,809                 | 129                      | 108,758                 | 428                       | 141,567        | 0.0798      |

| **IO**         | AQuA        | 2025/1/22     | Qwen2.5-7B-Instruct    | 78.74     | 98.43         | 0          |                                                                                                                               | 254         | 33,271                 | 131                      | 104,500                 | 411                       | 137,771        | 0           |

| **IO**         | AQuA        | 2025/1/22     | Llama-3.1-8B-Instruct  | 51.18     | 98.82         | 0          |                                                                                                                               | 254         | 26,459                 | 104                      | 106,647                 | 420                       | 133,106        | 0           |

| **IO**         | AQuA        | 2025/1/22     | Internllm2_5-7B        | 47.64     | 90.94         | 0          |                                                                                                                               | 254         | 50,232                 | 198                      | 134,809                 | 531                       | 185,041        | 0           |

| **IO**         | AQuA        | 2025/1/22     | Qwen2-1.5B-Instruct    | 29.13     | 97.64         | 0          |                                                                                                                               | 254         | 27,937                 | 110                      | 43,110                  | 170                       | 71,047         | 0           |

| **IO**         | AQuA        | 2025/1/22     | Qwen2-0.5B-Instruct    | 27.17     | 98.82         | 0          |                                                                                                                               | 254         | 27,937                 | 110                      | 82,478                  | 325                       | 110,415        | 0           |

| **IO**         | AQuA        | 2025/1/22     | deepseek-r1:1.5b       | 68.9      | 94.88         | 0          |                                                                                                                               | 254         | 26,667                 | 105                      | 325,100                 | 1,280                     | 351,767        | 0           |

| **CoT**        | AQuA        | 2025/1/7      | gpt-3.5-turbo          | 61.02     | 93.7          | 0          |                                                                                                                               | 254         | 25,447                 | 100                      | 55,346                  | 218                       | 80,793         | 0.0957      |

| **CoT**        | AQuA        | 2025/1/7      | Doubao-lite-32k        | 82.68     | 97.24         | 0          |                                                                                                                               | 254         | 27,978                 | 110                      | 66,599                  | 262                       | 94,577         | 0.0066      |

| **CoT**        | AQuA        | 2025/1/22     | gpt-4o                 | 82.68     | 98.03         | 0          |                                                                                                                               | 254         | 25,123                 | 99                       | 97,894                  | 385                       | 123,017        | 1.0417      |

| **CoT**        | AQuA        | 2025/1/22     | Qwen2.5-72B-Instruct   | 86.22     | 99.21         | 0          |                                                                                                                               | 254         | 25,143                 | 99                       | 118,146                 | 465                       | 143,289        | 0.0808      |

| **CoT**        | AQuA        | 2025/1/22     | Llama-3.3-70B-Instruct | 83.46     | 98.43         | 0          |                                                                                                                               | 254         | 32,555                 | 128                      | 131,834                 | 519                       | 164,389        | 0.0927      |

| **CoT**        | AQuA        | 2025/1/22     | Qwen2.5-7B-Instruct    | 80.71     | 99.61         | 0          |                                                                                                                               | 254         | 33,017                 | 130                      | 116,719                 | 460                       | 149,736        | 0           |

| **CoT**        | AQuA        | 2025/1/22     | Llama-3.1-8B-Instruct  | 60.63     | 100           | 0          |                                                                                                                               | 254         | 32,555                 | 128                      | 111,880                 | 440                       | 144,435        | 0           |

| **CoT**        | AQuA        | 2025/1/22     | Internllm2_5-7B        | 52.76     | 89.37         | 0          |                                                                                                                               | 254         | 26,610                 | 105                      | 100,910                 | 397                       | 127,520        | 0           |

| **CoT**        | AQuA        | 2025/1/22     | Qwen2-1.5B-Instruct    | 40.55     | 98.82         | 0          |                                                                                                                               | 254         | 30,477                 | 120                      | 79,563                  | 313                       | 110,040        | 0           |

| **CoT**        | AQuA        | 2025/1/22     | Qwen2-0.5B-Instruct    | 33.07     | 98.82         | 0          |                                                                                                                               | 254         | 30,477                 | 120                      | 86,862                  | 342                       | 117,339        | 0           |

| **CoT**        | AQuA        | 2025/1/23     | deepseek-r1:1.5b       | 71.65     | 96.85         | 0          |                                                                                                                               | 254         | 26,413                 | 104                      | 306,659                 | 1,207                     | 333,072        | 0           |

| **PoT**        | AQuA        | 2025/1/7      | gpt-3.5-turbo          | 59.45     | 100           | 0          |                                                                                                                               | 254         | 225,162                | 886                      | 41,492                  | 163                       | 266,654        | 0.1748      |

| **PoT**        | AQuA        | 2025/1/7      | Doubao-lite-32k        | 71.65     | 96.85         | 0          |                                                                                                                               | 254         | 259,863                | 1,023                    | 49,573                  | 195                       | 309,436        | 0.0147      |

| **PoT**        | AQuA        | 2025/1/22     | gpt-4o                 | 75.2      | 100           | 0          |                                                                                                                               | 254         | 222,717                | 877                      | 105,191                 | 414                       | 327,908        | 1.6087      |

| **PoT**        | AQuA        | 2025/1/22     | Qwen2.5-72B-Instruct   | 75.2      | 100           | 0          |                                                                                                                               | 254         | 249,215                | 981                      | 42,549                  | 168                       | 291,764        | 0.1645      |

| **PoT**        | AQuA        | 2025/1/22     | Llama-3.3-70B-Instruct | 79.53     | 99.21         | 0          |                                                                                                                               | 254         | 240,735                | 948                      | 69,064                  | 272                       | 309,799        | 0.1746      |

| **PoT**        | AQuA        | 2025/1/22     | Qwen2.5-7B-Instruct    | 68.11     | 100           | 0          |                                                                                                                               | 254         | 264,517                | 1,041                    | 49,211                  | 194                       | 313,728        | 0           |

| **PoT**        | AQuA        | 2025/1/22     | Llama-3.1-8B-Instruct  | 36.61     | 96.85         | 0          |                                                                                                                               | 254         | 240,613                | 947                      | 50,301                  | 198                       | 290,914        | 0           |

| **PoT**        | AQuA        | 2025/1/22     | Internllm2_5-7B        | 36.61     | 98.82         | 0          |                                                                                                                               | 254         | 233,505                | 919                      | 68,457                  | 270                       | 301,962        | 0           |

| **PoT**        | AQuA        | 2025/1/22     | Qwen2-1.5B-Instruct    | 30.71     | 96.46         | 0          |                                                                                                                               | 254         | 246,560                | 971                      | 51,915                  | 204                       | 298,475        | 0           |

| **PoT**        | AQuA        | 2025/1/22     | Qwen2-0.5B-Instruct    | 17.32     | 92.13         | 0          |                                                                                                                               | 254         | 258,867                | 1,019                    | 63,414                  | 250                       | 322,281        | 0           |

| **PoT**        | AQuA        | 2025/2/10     | deepseek-r1:1.5b       | 54.72     | 97.24         | 0          |                                                                                                                               | 254         | 250,690                | 987                      | 765,957                 | 3,016                     | 1,016,647      | 0           |

| **SC-CoT**     | AQuA        | 2025/1/22     | gpt-3.5-turbo          | 58.66     | 92.52         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 27,906                 | 110                      | 209,160                 | 823                       | 237,066        | 0.3277      |

| **SC-CoT**     | AQuA        | 2025/1/22     | Doubao-lite-32k        | 76.37     | 91.73         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 31,703                 | 125                      | 325,136                 | 1,280                     | 356,839        | 0.0279      |

| **SC-CoT**     | AQuA        | 2025/1/22     | gpt-4o                 | 85.83     | 99.21         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 27,829                 | 110                      | 517,602                 | 2,038                     | 545,431        | 5.2456      |

| **SC-CoT**     | AQuA        | 2025/1/22     | Qwen2.5-72B-Instruct   | 85.43     | 96.85         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 137,990                | 543                      | 604,562                 | 2,380                     | 742,552        | 0.4186      |

| **SC-CoT**     | AQuA        | 2025/1/22     | Llama-3.3-70B-Instruct | 84.65     | 99.61         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 175,050                | 689                      | 612,262                 | 2,410                     | 787,312        | 0.4438      |

| **SC-CoT**     | AQuA        | 2025/1/22     | Qwen2.5-7B-Instruct    | 79.53     | 100           | 0          | temperature=1, path_num=5                                                                                                     | 254         | 177,972                | 701                      | 567,438                 | 2,234                     | 745,410        | 0           |

| **SC-CoT**     | AQuA        | 2025/1/22     | Llama-3.1-8B-Instruct  | 59.45     | 95.67         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 145,108                | 571                      | 544,969                 | 2,146                     | 690,077        | 0           |

| **SC-CoT**     | AQuA        | 2025/1/22     | Internllm2_5-7B        | 38.58     | 97.24         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 264,557                | 1,042                    | 615,114                 | 2,422                     | 879,671        | 0           |

| **SC-CoT**     | AQuA        | 2025/1/22     | Qwen2-1.5B-Instruct    | 10.63     | 51.57         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 151,410                | 596                      | 550,570                 | 2,168                     | 701,980        | 0           |

| **SC-CoT**     | AQuA        | 2025/1/22     | Qwen2-0.5B-Instruct    | 17.32     | 82.28         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 150,787                | 594                      | 603,126                 | 2,375                     | 753,913        | 0           |

| **SC-CoT**     | AQuA        | 2025/2/10     | deepseek-r1:1.5b       | 57.87     | 74.02         | 0          | temperature=1, path_num=5                                                                                                     | 254         | 144,710                | 570                      | 1,987,401               | 7,824                     | 2,132,111      | 0           |

| **ReAct-Pro*** | AQuA        | 2025/1/7      | gpt-3.5-turbo          | 64.57     | 98.03         | 0          | max_steps=10                                                                                                                  | 254         | 862,614                | 3,396                    | 40,973                  | 161                       | 903,587        | 0.4928      |

| **ReAct-Pro*** | AQuA        | 2025/1/7      | Doubao-lite-32k        | 77.56     | 96.06         | 0          | max_steps=10                                                                                                                  | 254         | 977,890                | 3,850                    | 54,951                  | 216                       | 1,032,841      | 0.0445      |

| **ReAct-Pro*** | AQuA        | 2025/1/22     | gpt-4o                 | 57.48     | 97.24         | 0          | max_steps=10                                                                                                                  | 254         | 615,589                | 2,424                    | 76,507                  | 301                       | 692,096        | 2.304       |

| **ReAct-Pro*** | AQuA        | 2025/1/22     | Qwen2.5-72B-Instruct   | 73.23     | 100           | 0          | max_steps=10                                                                                                                  | 254         | 441,765                | 1,739                    | 121,838                 | 480                       | 563,603        | 0.3177      |

| **ReAct-Pro*** | AQuA        | 2025/1/22     | Llama-3.3-70B-Instruct | 79.13     | 99.61         | 0          | max_steps=10                                                                                                                  | 254         | 1,119,143              | 4,406                    | 243,236                 | 958                       | 1,362,379      | 0.768       |

| **ReAct-Pro*** | AQuA        | 2025/1/22     | Qwen2.5-7B-Instruct    | 74.41     | 99.21         | 0          | max_steps=10                                                                                                                  | 254         | 564,165                | 2,221                    | 131,679                 | 518                       | 695,844        | 0           |

| **ReAct-Pro*** | AQuA        | 2025/1/22     | Llama-3.1-8B-Instruct  | 55.51     | 96.85         | 0          | max_steps=10                                                                                                                  | 254         | 3,764,723              | 14,822                   | 576,098                 | 2,268                     | 4,340,821      | 0           |

| **ReAct-Pro*** | AQuA        | 2025/1/22     | Internllm2_5-7B        | 40.94     | 96.85         | 0          | max_steps=10                                                                                                                  | 254         | 3,592,039              | 14,142                   | 836,762                 | 3,294                     | 4,428,801      | 0           |

| **ReAct-Pro*** | AQuA        | 2025/1/22     | Qwen2-1.5B-Instruct    | 25.59     | 96.06         | 0          | max_steps=10                                                                                                                  | 254         | 4,555,858              | 17,936                   | 516,146                 | 2,032                     | 5,072,004      | 0           |

| **ReAct-Pro*** | AQuA        | 2025/1/22     | Qwen2-0.5B-Instruct    | 24.02     | 96.85         | 0          | max_steps=10                                                                                                                  | 254         | 6,344,167              | 24,977                   | 825,920                 | 3,252                     | 7,170,087      | 0           |

| **ReAct-Pro*** | AQuA        | 2025/2/10     | deepseek-r1:1.5b       | 54.33     | 96.46         | 0          | max_steps=10                                                                                                                  | 254         | 10,578,715             | 41,648                   | 3,866,326               | 15,222                    | 14,445,041     | 0           |

| **ToT**        | AQuA        | 2025/1/7      | gpt-3.5-turbo          | 57.09     | 99.61         | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 1,850,767              | 7,286                    | 150,629                 | 593                       | 2,001,396      | 1.1513      |

| **ToT**        | AQuA        | 2025/1/7      | Doubao-lite-32k        | 45.28     | 74.02         | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 1,850,249              | 7,284                    | 150,301                 | 592                       | 2,000,550      | 0.0881      |

| **ToT**        | AQuA        | 2025/1/22     | gpt-4o                 | 81.5      | 99.21         | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 2,347,538              | 9,242                    | 266,069                 | 1,048                     | 2,613,607      | 8.5295      |

| **ToT**        | AQuA        | 2025/1/22     | Qwen2.5-72B-Instruct   | 81.1      | 99.21         | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 6,371,642              | 25,085                   | 260,613                 | 1,026                     | 6,632,255      | 3.7389      |

| **ToT**        | AQuA        | 2025/1/22     | Llama-3.3-70B-Instruct | 83.07     | 100           | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 4,735,188              | 18,642                   | 480,660                 | 1,892                     | 5,215,848      | 2.9404      |

| **ToT**        | AQuA        | 2025/1/22     | Qwen2.5-7B-Instruct    | 53.94     | 100           | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 8,224,468              | 32,380                   | 378,214                 | 1,489                     | 8,602,682      | 0           |

| **ToT**        | AQuA        | 2025/1/22     | Llama-3.1-8B-Instruct  | 59.06     | 100           | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 4,896,222              | 19,276                   | 843,462                 | 3,321                     | 5,739,684      | 0           |

| **ToT**        | AQuA        | 2025/1/22     | Internllm2_5-7B        | 35.83     | 99.61         | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 4,263,136              | 16,784                   | 471,424                 | 1,856                     | 4,734,560      | 0           |

| **ToT**        | AQuA        | 2025/1/22     | Qwen2-1.5B-Instruct    | 31.5      | 98.82         | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 6,058,022              | 23,850                   | 192,680                 | 759                       | 6,250,702      | 0           |

| **ToT**        | AQuA        | 2025/1/22     | Qwen2-0.5B-Instruct    | 29.92     | 100           | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 8,100,085              | 31,890                   | 600,196                 | 2,363                     | 8,700,281      | 0           |

| **ToT**        | AQuA        | 2025/2/10     | deepseek-r1:1.5b       | 24.8      | 55.51         | 0          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 254         | 605,028                | 2,382                    | 189,484                 | 746                       | 794,512        | 0           |

| **IO**         | MATH-500    | 2025/1/24     | gpt-3.5-turbo          | 17.2      | 100           | 4          |                                                                                                                               | 500         | 154,881                | 310                      | 110,744                 | 221                       | 265,625        | 0.2436      |

| **IO**         | MATH-500    | 2025/1/24     | Doubao-lite-32k        | 37.4      | 100           | 4          |                                                                                                                               | 500         | 166,870                | 334                      | 144,860                 | 290                       | 311,730        | 0.0187      |

| **IO**         | MATH-500    | 2025/1/22     | gpt-4o                 | 41.8      | 100           | 4          |                                                                                                                               | 500         | 153,832                | 308                      | 240,615                 | 481                       | 394,447        | 2.7907      |

| **IO**         | MATH-500    | 2025/1/24     | Qwen2.5-72B-Instruct   | 70.2      | 100           | 4          |                                                                                                                               | 500         | 169,549                | 339                      | 275,042                 | 550                       | 444,591        | 0.2506      |

| **IO**         | MATH-500    | 2025/1/24     | Llama-3.3-70B-Instruct | 69.4      | 100           | 4          |                                                                                                                               | 500         | 155,879                | 312                      | 267,337                 | 535                       | 423,216        | 0.2386      |

| **IO**         | MATH-500    | 2025/1/24     | Qwen2.5-7B-Instruct    | 59.4      | 100           | 4          |                                                                                                                               | 500         | 169,549                | 339                      | 241,813                 | 484                       | 411,362        | 0           |

| **IO**         | MATH-500    | 2025/1/24     | Llama-3.1-8B-Instruct  | 38.6      | 100           | 4          |                                                                                                                               | 500         | 155,563                | 311                      | 348,371                 | 697                       | 503,934        | 0           |

| **IO**         | MATH-500    | 2025/1/24     | Internllm2_5-7B        | 22.8      | 100           | 4          |                                                                                                                               | 500         | 201,883                | 404                      | 266,005                 | 532                       | 467,888        | 0           |

| **IO**         | MATH-500    | 2025/1/24     | Qwen2-1.5B-Instruct    | 7         | 100           | 4          |                                                                                                                               | 500         | 158,777                | 318                      | 255,101                 | 510                       | 413,878        | 0           |

| **IO**         | MATH-500    | 2025/1/24     | Qwen2-0.5B-Instruct    | 2.6       | 100           | 4          |                                                                                                                               | 500         | 159,049                | 318                      | 270,281                 | 541                       | 429,330        | 0           |

| **IO**         | MATH-500    | 2025/1/24     | deepseek-r1:1.5b       | 43.8      | 100           | 4          |                                                                                                                               | 500         | 157,049                | 314                      | 865,499                 | 1,731                     | 1,022,548      | 0           |

| **CoT**        | MATH-500    | 2025/1/24     | gpt-3.5-turbo          | 39.8      | 100           | 4          |                                                                                                                               | 500         | 329,381                | 659                      | 102,815                 | 206                       | 432,196        | 0.3189      |

| **CoT**        | MATH-500    | 2025/1/22     | Doubao-lite-32k        | 59        | 100           | 4          |                                                                                                                               | 500         | 336,370                | 673                      | 143,571                 | 287                       | 479,941        | 0.0255      |

| **CoT**        | MATH-500    | 2025/1/24     | gpt-4o                 | 68        | 100           | 4          |                                                                                                                               | 500         | 329,332                | 659                      | 223,356                 | 447                       | 552,688        | 3.0569      |

| **CoT**        | MATH-500    | 2025/1/22     | Qwen2.5-72B-Instruct   | 80.2      | 100           | 4          |                                                                                                                               | 500         | 338,549                | 677                      | 280,466                 | 561                       | 619,015        | 0.349       |

| **CoT**        | MATH-500    | 2025/1/24     | Llama-3.3-70B-Instruct | 71.2      | 100           | 4          |                                                                                                                               | 500         | 342,879                | 686                      | 271,342                 | 543                       | 614,221        | 0.3463      |

| **CoT**        | MATH-500    | 2025/1/24     | Qwen2.5-7B-Instruct    | 69.8      | 100           | 4          |                                                                                                                               | 500         | 354,049                | 708                      | 263,155                 | 526                       | 617,204        | 0           |

| **CoT**        | MATH-500    | 2025/1/24     | Llama-3.1-8B-Instruct  | 25.8      | 100           | 4          |                                                                                                                               | 500         | 342,879                | 686                      | 282,689                 | 565                       | 625,568        | 0           |

| **CoT**        | MATH-500    | 2025/1/24     | Internllm2_5-7B        | 46.6      | 100           | 4          |                                                                                                                               | 500         | 332,883                | 666                      | 213,891                 | 428                       | 546,774        | 0           |

| **CoT**        | MATH-500    | 2025/1/24     | Qwen2-1.5B-Instruct    | 15.2      | 100           | 4          |                                                                                                                               | 500         | 349,049                | 698                      | 187,328                 | 375                       | 536,377        | 0           |

| **CoT**        | MATH-500    | 2025/1/24     | Qwen2-0.5B-Instruct    | 6.2       | 100           | 4          |                                                                                                                               | 500         | 349,049                | 698                      | 200,139                 | 400                       | 549,188        | 0           |

| **CoT**        | MATH-500    | 2025/1/24     | deepseek-r1:1.5b       | 49.4      | 100           | 4          |                                                                                                                               | 500         | 341,549                | 683                      | 857,580                 | 1,715                     | 1,199,129      | 0           |

| **PoT**        | MATH-500    | 2025/2/10     | gpt-3.5-turbo          | 28.8      | 83.8          | 4          |                                                                                                                               | 500         | 239,902                | 480                      | 32,014                  | 64                        | 271,916        | 0.168       |

| **PoT**        | MATH-500    | 2025/2/10     | Doubao-lite-32k        | 32.6      | 68            | 4          |                                                                                                                               | 500         | 254,377                | 509                      | 48,771                  | 98                        | 303,148        | 0.0144      |

| **PoT**        | MATH-500    | 2025/2/10     | gpt-4o                 | 46.2      | 86.4          | 4          |                                                                                                                               | 500         | 241,357                | 483                      | 99,603                  | 199                       | 340,960        | 1.5994      |

| **PoT**        | MATH-500    | 2025/2/10     | Qwen2.5-72B-Instruct   | 47.2      | 82.2          | 4          |                                                                                                                               | 500         | 242,549                | 485                      | 170,823                 | 342                       | 413,372        | 0.233       |

| **PoT**        | MATH-500    | 2025/2/10     | Llama-3.3-70B-Instruct | 42.6      | 80.2          | 4          |                                                                                                                               | 500         | 253,879                | 508                      | 249,717                 | 499                       | 503,596        | 0.2839      |

| **PoT**        | MATH-500    | 2025/2/10     | Qwen2.5-7B-Instruct    | 39.6      | 74.4          | 4          |                                                                                                                               | 500         | 258,549                | 517                      | 150,263                 | 301                       | 408,812        | 0           |

| **PoT**        | MATH-500    | 2025/2/10     | Llama-3.1-8B-Instruct  | 25.4      | 68.4          | 4          |                                                                                                                               | 500         | 253,879                | 508                      | 208,392                 | 417                       | 462,271        | 0           |

| **PoT**        | MATH-500    | 2025/2/10     | Internllm2_5-7B        | 15        | 32.4          | 4          |                                                                                                                               | 500         | 247,883                | 496                      | 120,826                 | 242                       | 368,709        | 0           |

| **PoT**        | MATH-500    | 2025/2/10     | Qwen2-1.5B-Instruct    | 0.8       | 2.2           | 4          |                                                                                                                               | 500         | 248,509                | 497                      | 538,361                 | 1,077                     | 786,870        | 0           |

| **PoT**        | MATH-500    | 2025/2/10     | Qwen2-0.5B-Instruct    | 0         | 0             | 4          |                                                                                                                               | 500         | 253,549                | 507                      | 183,653                 | 367                       | 437,202        | 0           |

| **PoT**        | MATH-500    | 2025/2/10     | deepseek-r1:1.5b       | 1         | 1.6           | 4          |                                                                                                                               | 500         | 245,549                | 491                      | 785,518                 | 1,571                     | 1,031,067      | 0           |

| **SC-CoT**     | MATH-500    | 2025/2/10     | gpt-3.5-turbo          | 40.8      | 100           | 4          | temperature=1, path_num=5                                                                                                     | 500         | 345,411                | 691                      | 705,408                 | 1,411                     | 1,050,819      | 1.2308      |

| **SC-CoT**     | MATH-500    | 2025/2/10     | Doubao-lite-32k        | 65.8      | 99.8          | 4          | temperature=1, path_num=5                                                                                                     | 500         | 362,390                | 725                      | 715,613                 | 1,431                     | 1,078,003      | 0.0734      |

| **SC-CoT**     | MATH-500    | 2025/2/10     | gpt-4o                 | 74.6      | 100           | 4          | temperature=1, path_num=5                                                                                                     | 500         | 345,347                | 691                      | 1,149,778               | 2,300                     | 1,495,125      | 12.3611     |

| **SC-CoT**     | MATH-500    | 2025/2/10     | Qwen2.5-72B-Instruct   | 79.8      | 100           | 4          | temperature=1, path_num=5                                                                                                     | 500         | 1,775,395              | 3,551                    | 1,506,954               | 3,014                     | 3,282,349      | 1.8504      |

| **SC-CoT**     | MATH-500    | 2025/2/10     | Llama-3.3-70B-Instruct | 72.4      | 100           | 4          | temperature=1, path_num=5                                                                                                     | 500         | 1,797,045              | 3,594                    | 1,368,466               | 2,737                     | 3,165,511      | 1.7845      |

| **SC-CoT**     | MATH-500    | 2025/2/10     | Qwen2.5-7B-Instruct    | 71.2      | 100           | 4          | temperature=1, path_num=5                                                                                                     | 500         | 1,855,922              | 3,712                    | 1,299,553               | 2,599                     | 3,155,475      | 0           |

| **SC-CoT**     | MATH-500    | 2025/2/10     | Llama-3.1-8B-Instruct  | 19.8      | 99.8          | 4          | temperature=1, path_num=5                                                                                                     | 500         | 1,734,545              | 3,469                    | 1,756,289               | 3,513                     | 3,490,834      | 0           |

| **SC-CoT**     | MATH-500    | 2025/2/10     | Internllm2_5-7B        | 9.2       | 97.4          | 4          | temperature=1, path_num=5                                                                                                     | 500         | 1,994,983              | 3,990                    | 1,254,893               | 2,510                     | 3,249,876      | 0           |

| **SC-CoT**     | MATH-500    | 2025/2/10     | Qwen2-1.5B-Instruct    | 2         | 89.4          | 4          | temperature=1, path_num=5                                                                                                     | 500         | 1,805,170              | 3,610                    | 1,333,854               | 2,668                     | 3,139,024      | 0           |

| **SC-CoT**     | MATH-500    | 2025/2/10     | Qwen2-0.5B-Instruct    | 2.2       | 98.8          | 4          | temperature=1, path_num=5                                                                                                     | 500         | 1,808,691              | 3,617                    | 988,991                 | 1,978                     | 2,797,682      | 0           |

| **SC-CoT**     | MATH-500    | 2025/2/10     | deepseek-r1:1.5b       | 46.8      | 99.2          | 4          | temperature=1, path_num=5                                                                                                     | 500         | 1,858,874              | 3,718                    | 12,109,294              | 24,219                    | 13,968,168     | 0           |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | gpt-3.5-turbo          | 23.8      | 100           | 4          | max_steps=10                                                                                                                  | 500         | 3,708,461              | 7,417                    | 124,253                 | 249                       | 3,832,714      | 2.0406      |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | Doubao-lite-32k        | 47.2      | 100           | 4          | max_steps=10                                                                                                                  | 500         | 4,234,620              | 8,469                    | 154,046                 | 308                       | 4,388,666      | 0.186       |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | gpt-4o                 | 54        | 100           | 4          | max_steps=10                                                                                                                  | 500         | 5,834,537              | 11,669                   | 318,718                 | 637                       | 6,153,255      | 17.7735     |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | Qwen2.5-72B-Instruct   | 62.8      | 100           | 4          | max_steps=10                                                                                                                  | 500         | 5,747,268              | 11,495                   | 379,849                 | 760                       | 6,127,117      | 3.4541      |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | Llama-3.3-70B-Instruct | 64.6      | 100           | 4          | max_steps=10                                                                                                                  | 500         | 5,223,611              | 10,447                   | 418,268                 | 837                       | 5,641,879      | 3.1806      |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | Qwen2.5-7B-Instruct    | 48.8      | 100           | 4          | max_steps=10                                                                                                                  | 500         | 4,646,708              | 9,293                    | 343,532                 | 687                       | 4,990,240      | 0           |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | Llama-3.1-8B-Instruct  | 28.8      | 100           | 4          | max_steps=10                                                                                                                  | 500         | 7,486,706              | 14,973                   | 1,276,923               | 2,554                     | 8,763,629      | 0           |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | Internllm2_5-7B        | 14.8      | 100           | 4          | max_steps=10                                                                                                                  | 500         | 11,831,496             | 23,663                   | 2,354,609               | 4,709                     | 14,186,105     | 0           |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | Qwen2-1.5B-Instruct    | 8.2       | 100           | 4          | max_steps=10                                                                                                                  | 500         | 8,430,774              | 16,862                   | 556,287                 | 1,113                     | 8,987,061      | 0           |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | Qwen2-0.5B-Instruct    | 0.6       | 100           | 4          | max_steps=10                                                                                                                  | 500         | 18,137,392             | 36,275                   | 1,305,048               | 2,610                     | 19,442,440     | 0           |

| **ReAct-Pro*** | MATH-500    | 2025/2/10     | deepseek-r1:1.5b       | 24.4      | 100           | 4          | max_steps=10                                                                                                                  | 500         | 20,729,970             | 41,460                   | 9,447,378               | 18,895                    | 30,177,348     | 0           |

| **ToT**        | MATH-500    | 2025/2/10     | gpt-3.5-turbo          | 9.8       | 100           | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 9,711,244              | 19,422                   | 290,523                 | 581                       | 10,001,767     | 5.2914      |

| **ToT**        | MATH-500    | 2025/2/10     | Doubao-lite-32k        | 1.2       | 94.2          | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 5,338,500              | 10,677                   | 226,000                 | 452                       | 5,564,500      | 0.2371      |

| **ToT**        | MATH-500    | 2025/2/10     | gpt-4o                 | 3.2       | 100           | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 14,881,985             | 29,764                   | 360,447                 | 721                       | 15,242,432     | 40.8094     |

| **ToT**        | MATH-500    | 2025/2/10     | Qwen2.5-72B-Instruct   | 10.8      | 100           | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 15,657,730             | 31,315                   | 381,631                 | 763                       | 16,039,361     | 9.0421      |

| **ToT**        | MATH-500    | 2025/2/10     | Llama-3.3-70B-Instruct | 1.4       | 69.8          | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 14,099,500             | 28,199                   | 570,000                 | 1,140                     | 14,669,500     | 8.2699      |

| **ToT**        | MATH-500    | 2025/2/10     | Qwen2.5-7B-Instruct    | 1.4       | 91.6          | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 9,749,000              | 19,498                   | 418,500                 | 837                       | 10,167,500     | 0           |

| **ToT**        | MATH-500    | 2025/2/10     | Llama-3.1-8B-Instruct  | 1.8       | 90.8          | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 7,729,000              | 15,458                   | 1,306,000               | 2,612                     | 9,035,000      | 0           |

| **ToT**        | MATH-500    | 2025/2/10     | Internllm2_5-7B        | 0.2       | 99            | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 7,515,000              | 15,030                   | 835,500                 | 1,671                     | 8,350,500      | 0           |

| **ToT**        | MATH-500    | 2025/2/10     | Qwen2-1.5B-Instruct    | 0.8       | 97.2          | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 4,408,000              | 8,816                    | 127,000                 | 254                       | 4,535,000      | 0           |

| **ToT**        | MATH-500    | 2025/2/10     | Qwen2-0.5B-Instruct    | 0         | 96.2          | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 5,590,500              | 11,181                   | 406,000                 | 812                       | 5,996,500      | 0           |

| **ToT**        | MATH-500    | 2025/2/10     | deepseek-r1:1.5b       | 0.4       | 71.6          | 4          | search_type=bfs, b=1, max_depth=6, max_steps=6, generation_n=1, evaluation_n=3, evaluation_type=vote, use_llm_completion=true | 500         | 1,831,000              | 3,662                    | 110,50
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/om-ai-lab/open-agent-leaderboard

Awesome Lists containing this project

README

🏅 Open Agent Leaderboard