https://github.com/testtimescaling/testtimescaling.github.io

"what, how, where, and how well? a survey on test-time scaling in large language models" repository
https://github.com/testtimescaling/testtimescaling.github.io
awesome-lists large-language-model survey test-time-scaling
Last synced: about 2 months ago
JSON representation
"what, how, where, and how well? a survey on test-time scaling in large language models" repository
Host: GitHub
URL: https://github.com/testtimescaling/testtimescaling.github.io
Owner: testtimescaling
License: mit
Created: 2025-04-08T02:50:44.000Z (6 months ago)
Default Branch: main
Last Pushed: 2025-08-04T01:11:27.000Z (2 months ago)
Last Synced: 2025-08-04T03:38:50.676Z (2 months ago)
Topics: awesome-lists, large-language-model, survey, test-time-scaling
Language: HTML
Homepage: https://arxiv.org/abs/2503.24235
Size: 10.1 MB
Stars: 56
Watchers: 1
Forks: 1
Open Issues: 3
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project

README

          



Awesome Test-time-Scaling in LLMs





  

  

  



Our repository, **Awesome Test-time-Scaling in LLMs**, gathers available papers on test-time scaling, to our current knowledge. Unlike other repositories that categorize papers, we decompose each paper's contributions based on the taxonomy provided by ["What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models"](https://arxiv.org/abs/2503.24235) facilitating easier understand and comparison for readers.



  

  Figure 1: A Visual Map and Comparison: From What to Scale to How to Scale..



## 📢 News and Updates

- **[13/Apr/2025]** 📌 The Second Version is released:

    1. We correct some typos;

    2. We include "Evaluation" and "Agentic" Tasks, which were enhanced by TTS;

    3. We revise the figures and tables, like the color of table 1.

- **[9/Apr/2025]** 📌 Our repository is created.

- **[31/Mar/2025]** 📌 Our initial survey is on [**Arxiv**](https://arxiv.org/abs/2503.24235)!

## 📘 Introduction

As enthusiasm for scaling computation (data and parameters) in the pertaining era gradually diminished, test-time scaling (TTS)—also referred to as “test-time computing”—has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in reasoning-intensive tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering systemic understanding. To fill this gap, we propose a unified, hierarchical framework structured along four orthogonal dimensions of TTS research: **what to scale**, **how to scale**, **where to scale**, and **how well to scale**. Building upon this taxonomy, we conduct a holistic review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique contributions of individual methods within the broader TTS landscape.



  

  Figure 2: omparison of Scaling Paradigms in Pre-training and Test-time Phases..



## 🧬 Taxonomy

### 1. **What to Scale**

``What to scale'' refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference.

- **Parallel Scaling** improves test-time performance by generating multiple outputs in parallel and then aggregating them into a final answer.

- **Sequential Scaling** involves explicitly directing later computations based on intermediate steps.

- **Hybrid Scaling** exploits the complementary benefits of parallel and sequential scaling.

- **Internal Scaling** elicits a model to autonomously determine how much computation to allocate for reasoning during testing within the model’s internal parameters, instead of external human-guided strategies.

### 2. **How to Scale**

- **Tuning**

  - Supervised Fine-Tuning (_SFT_): by training on synthetic or distilled long CoT examples, SFT allows a model to imitate extended reasoning patterns.

  - Reinforcement Learning (_RL_): RL can guide a model’s policy to generate longer or more accurate solutions.

- **Inference**

  - Stimulation (_STI_): It basically stimulates the LLM to generate more and longer samples instead of generating individual samples directly.

  - Verification (_VER_): The verification process plays an important role in the TTS, and it can be adapted to: i) directly selects the output sample among various ones, under the Parallel Scaling paradigm; ii) guides the stimulation process and determines when to stop, under the Sequential Scaling paradigm; iii) serves as the criteria in the search process; iv) determines what sample to aggregate and how to aggregate them, e.g., weights.

  - Search (_SEA_): Search is a time-tested technique for retrieving relevant information from large databases, and it can also systematically explore the potential outputs of LLMs to improve complex reasoning tasks.

  - Aggregation (_AGG_): Aggregation techniques consolidate multiple solutions into a final decision to enhance the reliability and robustness of model predictions at test time.

### 3. **Where to Scale**

- **Reasoning**: Math, Code, Science, Game & Strategy, Medical and so on.

- **General-Purpose**: Basics, Agents,  Knowledge, Open-Ended, Multi-Modal and so on.

### 4. **How Well to Scale**

- **Performance**: This dimension measures the correctness and robustness of outputs.

- **Efficiency**: it captures the cost-benefit tradeoffs of TTS methods.

- **Controllability**: This dimension assesses whether TTS methods adhere to resource or output constraints, such as compute budgets or output lengths.

- **Scalability**: Scalability quantifies how well models improve with more test-time compute (e.g., tokens or steps).

## 🔍 Paper Tables

| 
Method(PapersTitles) | What | How → |        |        |        |        |        | Where | How Well |

|--------|------|-------|--------|--------|--------|--------|--------|-------|-------|

|        |      | SFT   | RL     | STI | SEA | VER | AGG |        |        |

|Scaling llm test-time compute optimally can be more effective than scaling model parameters., |Parallel,
Sequential|✗|✗|✗|Beam,
LookAhead|Verifier|(Weighted) Best-of-N,
Stepwise Aggregation|Math|Pass@1,
FLOPsMatched Evaluation|

|Multi-agent verification: Scaling test-time compute with goal verifiers
., | Parallel | ✗ | ✗ | Self-Repetition | ✗ | Multiple-Agent
Verifiers | Best-of-N | Math,
Code,
General | BoN-MAV (Cons@k),
Pass@1 |

|Evolving Deeper LLM Thinking
,  | Sequential | ✗ | ✗ | Self-Refine | ✗ | Functional | ✗ | Open-Ended | Success Rate,
Token Cost |

| Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models
, | Sequential | ✗ | ✗ | CoT +
Self-Repetition | ✗ | Bandit | ✗ | Game,
Sci,
Math | Accuracy,
Token Cost |

| START: Self-taught reasoner with tools
, | Parallel,
Sequential | Rejection Sampling | ✗ | Hint-infer | ✗ | Tool | ✗ | Math,
Code | Pass@1 |

| " Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding
,  | Sequential | ✗ | ✗ | Adaptive Injection
Decoding | ✗ | ✗ | ✗ | Math,
Logical,
Commonsense | Accuracy |

| Chain of draft: Thinking faster by writing less
,  | Sequential | ✗ | ✗ | Chain-of-Draft | ✗ | ✗ | ✗ | Math,
Symbolic,
Commonsense | Accuracy,
Latency,
Token Cost |

| rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
,  | Hybrid | imitation | ✗ | ✗ | MCTS | PRM | ✗ | Math | Pass@1 |

| Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
,  | Parallel,
Hybrid | ✗ | ✗ | ✗ | DVTS,
Beam Search | PRM | Best-of-N | Math | Pass@1,
Pass@k,
Majority,
FLOPS |

| Tree of thoughts: Deliberate problem solving with large language models
,  | Hybrid | ✗ | ✗ | Propose Prompt,
Self-Repetition | Tree Search | Self-Evaluate | ✗ | Game,
Open-Ended | Success Rate,
LLM-as-a-Judge |

| Mindstar: Enhancing math reasoning in pre-trained llms at inference time
,  | Hybrid | ✗ | ✗ | ✗ | LevinTS | PRM | ✗ | Math | Accuracy,
Token Cost |

| Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving
,  | Hybrid | ✗ | ✗ | ✗ | Reward Balanced
Search | RM | ✗ | Math | Test Error Rate,
FLOPs |

| Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
,  | Hybrid | ✗ | ✗ | Self-Refine | Control Flow Graph | Self-Evaluate | Prompt Synthesis | Math,
Code | Pass@1 |

| PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
,   | Parallel,
Hybrid | ✗ | ✗ | MoA | ✗ | Verification Agent | Selection Agent | Math,
General,
Finance | Accuracy,
F1 Score |

| A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
,   | Hybrid | ✗ | ✗ | ✗ | Particle-based
Monte Carlo | PRM + SSM | Particle Filtering | Math | Pass@1,
Budget vs. Accuracy |

| Archon: An Architecture Search Framework for Inference-Time Techniques
,  | Hybrid | ✗ | ✗ | MoA,
Self-Repetition | ✗ | Verification Agent,
Unit Testing (Ensemble) | Fusion | Math,
Code,
Open-Ended | Pass@1,
Win Rate |

| Wider or deeper? scaling llm inference-time compute with adaptive branching tree search
,  | Hybrid | ✗ | ✗ | Mixture-of-Model | AB-MCTS-(M,A) | ✗ | ✗ | Code | Pass@1,
RMSLE,
ROC-AUC |

| Thinking llms: General instruction following with thought generation
,  | Internal,
Parallel | ✗ | DPO | Think | ✗ | Judge Models | ✗ | Open-Ended | Win Rate |

| Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
,  | Internal,
Hybrid | ✗ | DPO | Diversity Generation | MCTS | Self-Reflect | ✗ | Math | Pass@1 |

| MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving
,  | Internal,
Sequential | imitation | ✗ | MoA | ✗ | Tool | ✗ | Math | Pass@k |

| Offline Reinforcement Learning for LLM Multi-Step Reasoning
,  | Internal,
Sequential | ✗ | OREO | ✗ | Beam Search | Value Function | ✗ | Math,
Agent | Pass@1,
Success Rate |

| DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
,  | Internal | warmup,
GRPO,
Rule-Based | ✗ | ✗ | ✗ | ✗ | Math,
Code,
Sci | Pass@1,
cons@64,
Percentile,
Elo Rating,
Win Rate |

| s1: Simple test-time scaling
,  | Internal | distillation | ✗ | Budget Forcing | ✗ | ✗ | ✗ | Math,
Sci | Pass@1,
Control,
Scaling |

| O1 Replication Journey: A Strategic Progress Report -- Part 1
,  | Internal | imitation | ✗ | ✗ | Journey Learning | PRM,
Critique | Multi-Agents | Math | Accuracy |

| From drafts to answers: Unlocking llm potential via aggregation fine-tuning
,  | Internal,
Parallel | imitation | ✗ | ✗ | ✗ | Fusion | ✗ | Math,
Open-Ended | Win Rate |

| Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
,  | Internal,
Hybrid | imitation,
meta-RL | Think | MCTS,
A* | PRM | ✗ | Math,
Open-Ended | Win Rate |

| ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
,  | Internal,
Sequential | ✗ | PPO,
Trajectory | Thought Template Retrieve | ✗ | ✗ | Math | Pass@1 |

| L1: Controlling how long a reasoning model thinks with reinforcement learning
,  | Internal | ✗ | GRPO,
Length-Penalty | ✗ | ✗ | ✗ | ✗ | Math | Pass@1,
Length Error |

| Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
,  | Internal,
Hybrid | distillation,
imitation | ✗ | Reflection Prompt | MCTS | Self-Critic | ✗ | Math | Pass@1,
Pass@k |
ecosyste.ms

Data

Tools

Indexes

Applications

Experiments

Awesome

https://github.com/testtimescaling/testtimescaling.github.io

Awesome Lists containing this project

README

Awesome Test-time-Scaling in LLMs