https://github.com/testtimescaling/testtimescaling.github.io
"what, how, where, and how well? a survey on test-time scaling in large language models" repository
https://github.com/testtimescaling/testtimescaling.github.io
awesome-lists large-language-model survey test-time-scaling
Last synced: about 2 months ago
JSON representation
"what, how, where, and how well? a survey on test-time scaling in large language models" repository
- Host: GitHub
- URL: https://github.com/testtimescaling/testtimescaling.github.io
- Owner: testtimescaling
- License: mit
- Created: 2025-04-08T02:50:44.000Z (6 months ago)
- Default Branch: main
- Last Pushed: 2025-08-04T01:11:27.000Z (2 months ago)
- Last Synced: 2025-08-04T03:38:50.676Z (2 months ago)
- Topics: awesome-lists, large-language-model, survey, test-time-scaling
- Language: HTML
- Homepage: https://arxiv.org/abs/2503.24235
- Size: 10.1 MB
- Stars: 56
- Watchers: 1
- Forks: 1
- Open Issues: 3
-
Metadata Files:
- Readme: README.md
- License: LICENSE
Awesome Lists containing this project
README
Awesome Test-time-Scaling in LLMs
![]()
![]()
![]()
Our repository, **Awesome Test-time-Scaling in LLMs**, gathers available papers on test-time scaling, to our current knowledge. Unlike other repositories that categorize papers, we decompose each paper's contributions based on the taxonomy provided by ["What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models"](https://arxiv.org/abs/2503.24235) facilitating easier understand and comparison for readers.
![]()
Figure 1: A Visual Map and Comparison: From What to Scale to How to Scale..
## π’ News and Updates
- **[13/Apr/2025]** π The Second Version is released:1. We correct some typos;
2. We include "Evaluation" and "Agentic" Tasks, which were enhanced by TTS;
3. We revise the figures and tables, like the color of table 1.
- **[9/Apr/2025]** π Our repository is created.
- **[31/Mar/2025]** π Our initial survey is on [**Arxiv**](https://arxiv.org/abs/2503.24235)!## π Introduction
As enthusiasm for scaling computation (data and parameters) in the pertaining era gradually diminished, test-time scaling (TTS)βalso referred to as βtest-time computingββhas emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in reasoning-intensive tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering systemic understanding. To fill this gap, we propose a unified, hierarchical framework structured along four orthogonal dimensions of TTS research: **what to scale**, **how to scale**, **where to scale**, and **how well to scale**. Building upon this taxonomy, we conduct a holistic review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique contributions of individual methods within the broader TTS landscape.
![]()
Figure 2: omparison of Scaling Paradigms in Pre-training and Test-time Phases..
## 𧬠Taxonomy
### 1. **What to Scale**
``What to scale'' refers to the specific form of TTS that is expanded or adjusted to enhance an LLMβs performance during inference.
- **Parallel Scaling** improves test-time performance by generating multiple outputs in parallel and then aggregating them into a final answer.
- **Sequential Scaling** involves explicitly directing later computations based on intermediate steps.
- **Hybrid Scaling** exploits the complementary benefits of parallel and sequential scaling.
- **Internal Scaling** elicits a model to autonomously determine how much computation to allocate for reasoning during testing within the modelβs internal parameters, instead of external human-guided strategies.### 2. **How to Scale**
- **Tuning**
- Supervised Fine-Tuning (_SFT_): by training on synthetic or distilled long CoT examples, SFT allows a model to imitate extended reasoning patterns.
- Reinforcement Learning (_RL_): RL can guide a modelβs policy to generate longer or more accurate solutions.
- **Inference**
- Stimulation (_STI_): It basically stimulates the LLM to generate more and longer samples instead of generating individual samples directly.
- Verification (_VER_): The verification process plays an important role in the TTS, and it can be adapted to: i) directly selects the output sample among various ones, under the Parallel Scaling paradigm; ii) guides the stimulation process and determines when to stop, under the Sequential Scaling paradigm; iii) serves as the criteria in the search process; iv) determines what sample to aggregate and how to aggregate them, e.g., weights.
- Search (_SEA_): Search is a time-tested technique for retrieving relevant information from large databases, and it can also systematically explore the potential outputs of LLMs to improve complex reasoning tasks.
- Aggregation (_AGG_): Aggregation techniques consolidate multiple solutions into a final decision to enhance the reliability and robustness of model predictions at test time.### 3. **Where to Scale**
- **Reasoning**: Math, Code, Science, Game & Strategy, Medical and so on.
- **General-Purpose**: Basics, Agents, Knowledge, Open-Ended, Multi-Modal and so on.### 4. **How Well to Scale**
- **Performance**: This dimension measures the correctness and robustness of outputs.
- **Efficiency**: it captures the cost-benefit tradeoffs of TTS methods.
- **Controllability**: This dimension assesses whether TTS methods adhere to resource or output constraints, such as compute budgets or output lengths.
- **Scalability**: Scalability quantifies how well models improve with more test-time compute (e.g., tokens or steps).## π Paper Tables
|Method(PapersTitles)| What | How β | | | | | | Where | How Well |
|--------|------|-------|--------|--------|--------|--------|--------|-------|-------|
| | | SFT | RL | STI | SEA | VER | AGG | | |
|Scaling llm test-time compute optimally can be more effective than scaling model parameters.,|Parallel,
Sequential|β|β|β|Beam,
LookAhead|Verifier|(Weighted) Best-of-N,
Stepwise Aggregation|Math|Pass@1,
FLOPsMatched Evaluation|
|Multi-agent verification: Scaling test-time compute with goal verifiers
.,| Parallel | β | β | Self-Repetition | β | Multiple-Agent
Verifiers | Best-of-N | Math,
Code,
General | BoN-MAV (Cons@k),
Pass@1 |
|Evolving Deeper LLM Thinking
,| Sequential | β | β | Self-Refine | β | Functional | β | Open-Ended | Success Rate,
Token Cost |
| Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models
,| Sequential | β | β | CoT +
Self-Repetition | β | Bandit | β | Game,
Sci,
Math | Accuracy,
Token Cost |
| START: Self-taught reasoner with tools
,| Parallel,
Sequential | Rejection Sampling | β | Hint-infer | β | Tool | β | Math,
Code | Pass@1 |
| " Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding
,| Sequential | β | β | Adaptive Injection
Decoding | β | β | β | Math,
Logical,
Commonsense | Accuracy |
| Chain of draft: Thinking faster by writing less
,| Sequential | β | β | Chain-of-Draft | β | β | β | Math,
Symbolic,
Commonsense | Accuracy,
Latency,
Token Cost |
| rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
,| Hybrid | imitation | β | β | MCTS | PRM | β | Math | Pass@1 |
| Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
,| Parallel,
Hybrid | β | β | β | DVTS,
Beam Search | PRM | Best-of-N | Math | Pass@1,
Pass@k,
Majority,
FLOPS |
| Tree of thoughts: Deliberate problem solving with large language models
,| Hybrid | β | β | Propose Prompt,
Self-Repetition | Tree Search | Self-Evaluate | β | Game,
Open-Ended | Success Rate,
LLM-as-a-Judge |
| Mindstar: Enhancing math reasoning in pre-trained llms at inference time
,| Hybrid | β | β | β | LevinTS | PRM | β | Math | Accuracy,
Token Cost |
| Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving
,| Hybrid | β | β | β | Reward Balanced
Search | RM | β | Math | Test Error Rate,
FLOPs |
| Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
,| Hybrid | β | β | Self-Refine | Control Flow Graph | Self-Evaluate | Prompt Synthesis | Math,
Code | Pass@1 |
| PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
,| Parallel,
Hybrid | β | β | MoA | β | Verification Agent | Selection Agent | Math,
General,
Finance | Accuracy,
F1 Score |
| A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
,| Hybrid | β | β | β | Particle-based
Monte Carlo | PRM + SSM | Particle Filtering | Math | Pass@1,
Budget vs. Accuracy |
| Archon: An Architecture Search Framework for Inference-Time Techniques
,| Hybrid | β | β | MoA,
Self-Repetition | β | Verification Agent,
Unit Testing (Ensemble) | Fusion | Math,
Code,
Open-Ended | Pass@1,
Win Rate |
| Wider or deeper? scaling llm inference-time compute with adaptive branching tree search
,| Hybrid | β | β | Mixture-of-Model | AB-MCTS-(M,A) | β | β | Code | Pass@1,
RMSLE,
ROC-AUC |
| Thinking llms: General instruction following with thought generation
,| Internal,
Parallel | β | DPO | Think | β | Judge Models | β | Open-Ended | Win Rate |
| Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
,| Internal,
Hybrid | β | DPO | Diversity Generation | MCTS | Self-Reflect | β | Math | Pass@1 |
| MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving
,| Internal,
Sequential | imitation | β | MoA | β | Tool | β | Math | Pass@k |
| Offline Reinforcement Learning for LLM Multi-Step Reasoning
,| Internal,
Sequential | β | OREO | β | Beam Search | Value Function | β | Math,
Agent | Pass@1,
Success Rate |
| DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
,| Internal | warmup,
GRPO,
Rule-Based | β | β | β | β | Math,
Code,
Sci | Pass@1,
cons@64,
Percentile,
Elo Rating,
Win Rate |
| s1: Simple test-time scaling
,| Internal | distillation | β | Budget Forcing | β | β | β | Math,
Sci | Pass@1,
Control,
Scaling |
| O1 Replication Journey: A Strategic Progress Report -- Part 1
,| Internal | imitation | β | β | Journey Learning | PRM,
Critique | Multi-Agents | Math | Accuracy |
| From drafts to answers: Unlocking llm potential via aggregation fine-tuning
,| Internal,
Parallel | imitation | β | β | β | Fusion | β | Math,
Open-Ended | Win Rate |
| Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
,| Internal,
Hybrid | imitation,
meta-RL | Think | MCTS,
A* | PRM | β | Math,
Open-Ended | Win Rate |
| ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
,| Internal,
Sequential | β | PPO,
Trajectory | Thought Template Retrieve | β | β | Math | Pass@1 |
| L1: Controlling how long a reasoning model thinks with reinforcement learning
,| Internal | β | GRPO,
Length-Penalty | β | β | β | β | Math | Pass@1,
Length Error |
| Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
,| Internal,
Hybrid | distillation,
imitation | β | Reflection Prompt | MCTS | Self-Critic | β | Math | Pass@1,
Pass@k |