Papers
Topics
Authors
Recent
Search
2000 character limit reached

What If We Allocate Test-Time Compute Adaptively?

Published 1 Feb 2026 in cs.CL | (2602.01070v1)

Abstract: Test-time compute scaling allocates inference computation uniformly, uses fixed sampling strategies, and applies verification only for reranking. In contrast, we propose a verifier-guided adaptive framework treating reasoning as iterative trajectory generation and selection. For each problem, the agent runs multiple inference iterations. In each iteration, it optionally produces a high-level plan, selects a set of reasoning tools and a compute strategy together with an exploration parameter, and then generates a candidate reasoning trajectory. A process reward model (PRM) serves as a unified control signal: within each iteration, step-level PRM scores are aggregated to guide pruning and expansion during generation, and across iterations, aggregated trajectory rewards are used to select the final response. Across datasets, our dynamic, PRM-guided approach consistently outperforms direct test-time scaling, yielding large gains on MATH-500 and several-fold improvements on harder benchmarks such as AIME24 and AMO-Bench. We characterize efficiency using theoretical FLOPs and a compute intensity metric penalizing wasted generation and tool overhead, demonstrating that verification-guided allocation concentrates computation on high-utility reasoning paths.

Summary

  • The paper presents a verifier-guided adaptive inference framework that dynamically allocates compute using a Process Reward Model (PRM) to score candidate reasoning trajectories.
  • It achieves significant empirical gains with accuracy improvements of 10% to over 20% on benchmarks like MATH-500, AIME24, and AMO-Bench while optimizing compute utilization.
  • The approach redefines LLM reasoning as an iterative decision-making process, shifting from static uniform computation to dynamic candidate-based strategies.

Adaptive Test-Time Compute Allocation

Introduction

The paper "What If We Allocate Test-Time Compute Adaptively?" (2602.01070) proposes a novel approach to inference in LLMs by considering adaptive test-time compute allocation. Traditional inference methods often apply computation uniformly, which can be inefficient, particularly for tasks requiring variable difficulty and diverse reasoning strategies. This work addresses these limitations by implementing a verifier-guided, adaptive framework where reasoning is treated as an iterative decision-making process.

Methodology

The core innovation lies in rethinking how inference time computation is allocated by using a dynamic, verifier-guided approach. The proposed system comprises multiple rounds of inference, where each iteration involves planning, tool selection, compute strategy definition, and candidate trajectory generation. In this context, a Process Reward Model (PRM) is central to guiding both intra-iteration and inter-iteration decisions. PRM scores are utilized to aggregate step-level rewards, which are crucial in deciding the final response from multiple candidate reasoning trajectories. This methodology ensures that computation is concentrated on high-utility reasoning paths, leading to performance improvements. Figure 1

Figure 1: Universal reasoning agent architecture. The system generates KK candidate trajectories {I1,,IK}\{I_1,\ldots,I_K\} scored by a PRM, with each iteration following a structured workflow.

Empirical Results

The application of this adaptive framework shows significant empirical gains over traditional static methods. Key results include improved accuracy in datasets such as MATH-500, AIME24, and AMO-Bench, with gains ranging from 10% to over 20%. The paper demonstrates that adaptive inference not only increases accuracy but also enhances compute efficiency. Two metrics, theoretical FLOPs and a compute intensity score, are introduced to evaluate efficiency, revealing that the proposed method optimizes compute utilization by focusing resources on promising trajectories. Figure 2

Figure 2

Figure 2: Llama-3.1-8B-Instruct model performance comparison.

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3

Figure 3: MATH-500: Tool usage shows model-dependent behavior with dynamic adaptation.

Discussion

The implications of this work are multifaceted. Practically, the framework can be employed in any reasoning-intensive domain where intermediate verification signals can be extracted. Theoretically, it provides a compelling case for treating reasoning within LLMs as a trajectory-level decision-making process rather than a linear generation task. The adaptive approach shifts the paradigm in LLM inference from static, uniform compute strategies to dynamic, verifier-assisted learning, promising more reliable and efficient model performance.

Future Work

Several areas are ripe for exploration following this study. Enhancing the quality and robustness of PRMs, especially for complex problems, remains a priority, as their efficacy is critical to the proposed framework. Exploring domain-specific adaptations, expanding beyond mathematical reasoning into areas such as code generation and complex decision-making, would further validate the versatility of the approach. Additionally, integrating learning mechanisms for adaptive controllers and policies could further optimize compute allocation.

Conclusion

This paper establishes a new direction in LLM inference by introducing adaptive test-time compute allocation guided by a PRM. The framework delivers substantial improvements in accuracy and compute efficiency, demonstrating the potential of adaptive strategies to enhance reasoning performance. Such innovations underscore the importance of rethinking inference methodologies to harness the full potential of LLMs in solving complex reasoning problems.

Paper to Video (Beta)

Whiteboard

Explain it Like I'm 14

Overview

This paper asks a simple question: instead of giving every problem the same amount of computer “thinking time,” what if we give easy problems less time and hard problems more time, and use a smart checker to guide that effort? The authors build a system that solves math problems by trying different reasoning paths, checking each step as it goes, and focusing its effort on the most promising paths.

Key Questions

The paper explores three main questions in easy-to-understand terms:

  • How can we use extra compute (computer effort) at test time more wisely, instead of just using the same approach for every problem?
  • Can a step-by-step checker help the model avoid bad reasoning early and steer it toward better solutions?
  • Does this adaptive approach improve accuracy and use compute more efficiently compared to standard methods?

How Did They Do It?

Think of solving a math problem like exploring different routes to a destination. The system doesn’t just pick one route and hope it works. It tries multiple routes (called “iterations”), checks the steps along the way, and chooses the best one.

What is a PRM (Process Reward Model)?

A PRM is like a careful coach or referee that looks at each reasoning step and gives it a score from 0 to 1. High scores mean the step looks correct and consistent. The system uses these scores to:

  • Prune bad branches (stop wasting time on weak paths).
  • Prefer strong branches (spend more effort where steps look solid).
  • Pick the best final answer among several tries.

The agent repeats the following steps for each problem:

  1. Planning: Make a simple high-level plan before diving into details.
  2. Tool selection: Choose helpful “tools” like chain-of-thought (step-by-step reasoning), self-reflection (critique and fix), and numeric checks (verify calculations).
  3. Compute strategy selection: Decide how to explore solutions (for example, try many options or search deeper).
  4. Answer extraction: Present the final answer clearly from the chosen reasoning path.

Importantly, this adaptivity happens without retraining the model. The system uses structured prompts (instructions) to play different roles (planner, tool selector, strategy selector, checker), all powered by the same base LLM plus the PRM.

Search strategies (with everyday analogies)

To make exploration smarter, the system uses different strategies:

  • Best-of-N: Like taking N independent shots at the problem and picking the best one.
  • Beam Search: Like following several promising routes at the same time and dropping weaker ones as you go.
  • Lookahead: Like peeking a few steps ahead down each possible path to see which next step looks most promising, then committing to that single step and repeating.

Throughout generation, the PRM scores each step. Within one try (an iteration), these scores help keep only the best partial solutions. Across multiple tries, the system picks the overall best reasoning path.

Measuring compute in simple terms

  • Theoretical FLOPs: Think of this as the total amount of tiny math operations the computer did—basically the total effort spent across all models (the problem solver, the controllers, and the checker).
  • Compute Intensity: This score tries to capture “how much useful reasoning we did” after penalizing wasted tokens and extra checker/controller overhead. Lower is better: it means the system focused effort where it mattered.

Main Findings

Here’s what the authors found when solving math problems across three datasets (MATH-500, AIME24, and AMO-Bench) using two models (Llama-3.1-8B-Instruct and Qwen-2.5-7B-Instruct):

  • Adaptive beats static: Dynamically choosing tools and strategies, guided by PRM scores, outperforms using a fixed approach for every problem.
  • Big accuracy gains:
    • On MATH-500, accuracy jumped from 43.8% to 65.4% for Llama-3.1-8B, and from 71.2% to 81.4% for Qwen-2.5-7B.
    • On AIME24, it improved from 3.3% to 10.0% (Llama) and from 6.67% to 13.3% (Qwen).
    • On AMO-Bench, it doubled from 2.0% to 4.0% for both models.
  • Difficulty matters: The biggest improvements were on easier-to-medium problems (Levels 1–4 on MATH-500). On the very hardest problems (Level 5), the checker’s signals are less reliable, so gains are smaller.
  • Better compute use: The system spends more effort on high-utility paths and less on obviously weak ones. Even when total compute goes up, it is used more effectively to deliver higher accuracy compared to simply “trying more” without guidance.

Implications and Impact

  • Smarter effort allocation: Instead of treating all problems the same, the system adapts—quickly solving easy ones and investing more in hard ones. This saves time and improves reliability.
  • Training-free and flexible: The approach doesn’t require retraining big models. It plugs in a step-checker (PRM) and uses role-based prompts, so it’s practical to deploy.
  • Beyond math: The core idea—use intermediate verification to guide reasoning and compute—can be applied to other areas like coding (unit tests as verifiers), long-answer questions (consistency checks), or safety-critical decisions (domain-specific checks).
  • Next steps: Stronger, difficulty-aware checkers could further improve results on very hard problems. Learned policies might also optimize when to stop, which tools to use, and how much to explore given strict compute or time budgets.

In short, the paper shows that “think smarter, not just more” is a winning strategy: using a step-by-step checker to adapt how much and where the model thinks leads to more accurate answers and better use of compute.

Collections

Sign up for free to add this paper to one or more collections.