An open API service indexing awesome lists of open source software.

https://github.com/testtimescaling/testtimescaling.github.io

"what, how, where, and how well? a survey on test-time scaling in large language models" repository
https://github.com/testtimescaling/testtimescaling.github.io

awesome-lists large-language-model survey test-time-scaling

Last synced: about 2 months ago
JSON representation

"what, how, where, and how well? a survey on test-time scaling in large language models" repository

Awesome Lists containing this project

README

          


Awesome Test-time-Scaling in LLMs






Our repository, **Awesome Test-time-Scaling in LLMs**, gathers available papers on test-time scaling, to our current knowledge. Unlike other repositories that categorize papers, we decompose each paper's contributions based on the taxonomy provided by ["What, How, Where, and How Well? A Survey on Test-Time Scaling in Large Language Models"](https://arxiv.org/abs/2503.24235) facilitating easier understand and comparison for readers.



Figure 1: A Visual Map and Comparison: From What to Scale to How to Scale..


## πŸ“’ News and Updates
- **[13/Apr/2025]** πŸ“Œ The Second Version is released:

1. We correct some typos;
2. We include "Evaluation" and "Agentic" Tasks, which were enhanced by TTS;
3. We revise the figures and tables, like the color of table 1.
- **[9/Apr/2025]** πŸ“Œ Our repository is created.
- **[31/Mar/2025]** πŸ“Œ Our initial survey is on [**Arxiv**](https://arxiv.org/abs/2503.24235)!

## πŸ“˜ Introduction
As enthusiasm for scaling computation (data and parameters) in the pertaining era gradually diminished, test-time scaling (TTS)β€”also referred to as β€œtest-time computing”—has emerged as a prominent research focus. Recent studies demonstrate that TTS can further elicit the problem-solving capabilities of large language models (LLMs), enabling significant breakthroughs not only in reasoning-intensive tasks, such as mathematics and coding, but also in general tasks like open-ended Q&A. However, despite the explosion of recent efforts in this area, there remains an urgent need for a comprehensive survey offering systemic understanding. To fill this gap, we propose a unified, hierarchical framework structured along four orthogonal dimensions of TTS research: **what to scale**, **how to scale**, **where to scale**, and **how well to scale**. Building upon this taxonomy, we conduct a holistic review of methods, application scenarios, and assessment aspects, and present an organized decomposition that highlights the unique contributions of individual methods within the broader TTS landscape.



Figure 2: omparison of Scaling Paradigms in Pre-training and Test-time Phases..


## 🧬 Taxonomy

### 1. **What to Scale**
``What to scale'' refers to the specific form of TTS that is expanded or adjusted to enhance an LLM’s performance during inference.
- **Parallel Scaling** improves test-time performance by generating multiple outputs in parallel and then aggregating them into a final answer.
- **Sequential Scaling** involves explicitly directing later computations based on intermediate steps.
- **Hybrid Scaling** exploits the complementary benefits of parallel and sequential scaling.
- **Internal Scaling** elicits a model to autonomously determine how much computation to allocate for reasoning during testing within the model’s internal parameters, instead of external human-guided strategies.

### 2. **How to Scale**
- **Tuning**
- Supervised Fine-Tuning (_SFT_): by training on synthetic or distilled long CoT examples, SFT allows a model to imitate extended reasoning patterns.
- Reinforcement Learning (_RL_): RL can guide a model’s policy to generate longer or more accurate solutions.
- **Inference**
- Stimulation (_STI_): It basically stimulates the LLM to generate more and longer samples instead of generating individual samples directly.
- Verification (_VER_): The verification process plays an important role in the TTS, and it can be adapted to: i) directly selects the output sample among various ones, under the Parallel Scaling paradigm; ii) guides the stimulation process and determines when to stop, under the Sequential Scaling paradigm; iii) serves as the criteria in the search process; iv) determines what sample to aggregate and how to aggregate them, e.g., weights.
- Search (_SEA_): Search is a time-tested technique for retrieving relevant information from large databases, and it can also systematically explore the potential outputs of LLMs to improve complex reasoning tasks.
- Aggregation (_AGG_): Aggregation techniques consolidate multiple solutions into a final decision to enhance the reliability and robustness of model predictions at test time.

### 3. **Where to Scale**
- **Reasoning**: Math, Code, Science, Game & Strategy, Medical and so on.
- **General-Purpose**: Basics, Agents, Knowledge, Open-Ended, Multi-Modal and so on.

### 4. **How Well to Scale**
- **Performance**: This dimension measures the correctness and robustness of outputs.
- **Efficiency**: it captures the cost-benefit tradeoffs of TTS methods.
- **Controllability**: This dimension assesses whether TTS methods adhere to resource or output constraints, such as compute budgets or output lengths.
- **Scalability**: Scalability quantifies how well models improve with more test-time compute (e.g., tokens or steps).

## πŸ” Paper Tables
|

Method(PapersTitles)
| What | How β†’ | | | | | | Where | How Well |
|--------|------|-------|--------|--------|--------|--------|--------|-------|-------|
| | | SFT | RL | STI | SEA | VER | AGG | | |
|Scaling llm test-time compute optimally can be more effective than scaling model parameters., arXiv Badge|Parallel,
Sequential|βœ—|βœ—|βœ—|Beam,
LookAhead|Verifier|(Weighted) Best-of-N,
Stepwise Aggregation|Math|Pass@1,
FLOPsMatched Evaluation|
|Multi-agent verification: Scaling test-time compute with goal verifiers
., arXiv Badge| Parallel | βœ— | βœ— | Self-Repetition | βœ— | Multiple-Agent
Verifiers | Best-of-N | Math,
Code,
General | BoN-MAV (Cons@k),
Pass@1 |
|Evolving Deeper LLM Thinking
, arXiv Badge| Sequential | βœ— | βœ— | Self-Refine | βœ— | Functional | βœ— | Open-Ended | Success Rate,
Token Cost |
| Meta-reasoner: Dynamic guidance for optimized inference-time reasoning in large language models
, arXiv Badge| Sequential | βœ— | βœ— | CoT +
Self-Repetition | βœ— | Bandit | βœ— | Game,
Sci,
Math | Accuracy,
Token Cost |
| START: Self-taught reasoner with tools
, arXiv Badge| Parallel,
Sequential | Rejection Sampling | βœ— | Hint-infer | βœ— | Tool | βœ— | Math,
Code | Pass@1 |
| " Well, Keep Thinking": Enhancing LLM Reasoning with Adaptive Injection Decoding
, arXiv Badge| Sequential | βœ— | βœ— | Adaptive Injection
Decoding | βœ— | βœ— | βœ— | Math,
Logical,
Commonsense | Accuracy |
| Chain of draft: Thinking faster by writing less
, arXiv Badge | Sequential | βœ— | βœ— | Chain-of-Draft | βœ— | βœ— | βœ— | Math,
Symbolic,
Commonsense | Accuracy,
Latency,
Token Cost |
| rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking
, arXiv Badge | Hybrid | imitation | βœ— | βœ— | MCTS | PRM | βœ— | Math | Pass@1 |
| Can 1B LLM Surpass 405B LLM? Rethinking Compute-Optimal Test-Time Scaling
, arXiv Badge | Parallel,
Hybrid | βœ— | βœ— | βœ— | DVTS,
Beam Search | PRM | Best-of-N | Math | Pass@1,
Pass@k,
Majority,
FLOPS |
| Tree of thoughts: Deliberate problem solving with large language models
, arXiv Badge | Hybrid | βœ— | βœ— | Propose Prompt,
Self-Repetition | Tree Search | Self-Evaluate | βœ— | Game,
Open-Ended | Success Rate,
LLM-as-a-Judge |
| Mindstar: Enhancing math reasoning in pre-trained llms at inference time
, arXiv Badge | Hybrid | βœ— | βœ— | βœ— | LevinTS | PRM | βœ— | Math | Accuracy,
Token Cost |
| Inference Scaling Laws: An Empirical Analysis of Compute-Optimal Inference for LLM Problem-Solving
, arXiv Badge | Hybrid | βœ— | βœ— | βœ— | Reward Balanced
Search | RM | βœ— | Math | Test Error Rate,
FLOPs |
| Reasoning-as-Logic-Units: Scaling Test-Time Reasoning in Large Language Models Through Logic Unit Alignment
, arXiv Badge | Hybrid | βœ— | βœ— | Self-Refine | Control Flow Graph | Self-Evaluate | Prompt Synthesis | Math,
Code | Pass@1 |
| PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving
, arXiv Badge | Parallel,
Hybrid | βœ— | βœ— | MoA | βœ— | Verification Agent | Selection Agent | Math,
General,
Finance | Accuracy,
F1 Score |
| A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
, arXiv Badge | Hybrid | βœ— | βœ— | βœ— | Particle-based
Monte Carlo | PRM + SSM | Particle Filtering | Math | Pass@1,
Budget vs. Accuracy |
| Archon: An Architecture Search Framework for Inference-Time Techniques
, arXiv Badge | Hybrid | βœ— | βœ— | MoA,
Self-Repetition | βœ— | Verification Agent,
Unit Testing (Ensemble) | Fusion | Math,
Code,
Open-Ended | Pass@1,
Win Rate |
| Wider or deeper? scaling llm inference-time compute with adaptive branching tree search
, arXiv Badge | Hybrid | βœ— | βœ— | Mixture-of-Model | AB-MCTS-(M,A) | βœ— | βœ— | Code | Pass@1,
RMSLE,
ROC-AUC |
| Thinking llms: General instruction following with thought generation
, arXiv Badge | Internal,
Parallel | βœ— | DPO | Think | βœ— | Judge Models | βœ— | Open-Ended | Win Rate |
| Self-Evolved Preference Optimization for Enhancing Mathematical Reasoning in Small Language Models
, arXiv Badge | Internal,
Hybrid | βœ— | DPO | Diversity Generation | MCTS | Self-Reflect | βœ— | Math | Pass@1 |
| MA-LoT: Multi-Agent Lean-based Long Chain-of-Thought Reasoning enhances Formal Theorem Proving
, arXiv Badge | Internal,
Sequential | imitation | βœ— | MoA | βœ— | Tool | βœ— | Math | Pass@k |
| Offline Reinforcement Learning for LLM Multi-Step Reasoning
, arXiv Badge | Internal,
Sequential | βœ— | OREO | βœ— | Beam Search | Value Function | βœ— | Math,
Agent | Pass@1,
Success Rate |
| DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
, arXiv Badge | Internal | warmup,
GRPO,
Rule-Based | βœ— | βœ— | βœ— | βœ— | Math,
Code,
Sci | Pass@1,
cons@64,
Percentile,
Elo Rating,
Win Rate |
| s1: Simple test-time scaling
, arXiv Badge | Internal | distillation | βœ— | Budget Forcing | βœ— | βœ— | βœ— | Math,
Sci | Pass@1,
Control,
Scaling |
| O1 Replication Journey: A Strategic Progress Report -- Part 1
, arXiv Badge | Internal | imitation | βœ— | βœ— | Journey Learning | PRM,
Critique | Multi-Agents | Math | Accuracy |
| From drafts to answers: Unlocking llm potential via aggregation fine-tuning
, arXiv Badge | Internal,
Parallel | imitation | βœ— | βœ— | βœ— | Fusion | βœ— | Math,
Open-Ended | Win Rate |
| Towards System 2 Reasoning in LLMs: Learning How to Think With Meta Chain-of-Though
, arXiv Badge| Internal,
Hybrid | imitation,
meta-RL | Think | MCTS,
A* | PRM | βœ— | Math,
Open-Ended | Win Rate |
| ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates
, arXiv Badge | Internal,
Sequential | βœ— | PPO,
Trajectory | Thought Template Retrieve | βœ— | βœ— | Math | Pass@1 |
| L1: Controlling how long a reasoning model thinks with reinforcement learning
, arXiv Badge | Internal | βœ— | GRPO,
Length-Penalty | βœ— | βœ— | βœ— | βœ— | Math | Pass@1,
Length Error |
| Marco-o1: Towards Open Reasoning Models for Open-Ended Solutions
, arXiv Badge | Internal,
Hybrid | distillation,
imitation | βœ— | Reflection Prompt | MCTS | Self-Critic | βœ— | Math | Pass@1,
Pass@k |