LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

Published 8 May 2026 in cs.CL | (2605.08083v1)

Abstract: Test-time scaling (TTS) has become an effective approach for improving LLM performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.

Abstract PDF Upgrade to Chat

Authors (13)

Summary

The paper introduces AutoTTS, a novel framework that recasts test-time scaling as an algorithmic search, enabling LLMs to autonomously discover improved inference controllers.
The study presents the Confidence Momentum Controller, which uses exponential moving average momentum and coupled width-depth control to balance compute allocation with accuracy.
Experimental results demonstrate that AutoTTS outperforms manual heuristics across benchmarks by reducing token consumption by about 69.5% while maintaining comparable accuracy.

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling (AutoTTS)

Introduction

Test-time scaling (TTS) offers a compelling mechanism for improving LLM performance by adaptively allocating computational resources during inference. Traditionally, TTS strategies—a crucial determinant of inference effectiveness and cost—have relied on manually-crafted heuristics for branching, deepening, pruning, probing, and aggregation. However, this hand-design paradigm is inflexible, non-adaptive to novel settings, and leaves a substantial portion of the computation-allocation space unexplored. The paper "LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling" (2605.08083) proposes a paradigm shift: recasting TTS as an algorithmic search problem over a structured environment, enabling automatic, agent-driven discovery of optimized TTS controllers.

Environment-Driven TTS Strategy Discovery

The core contribution is the AutoTTS framework, which transfers the human designer’s role: from directly encoding TTS heuristics, to constructing offline, replayable environments where LLM agents can discover TTS strategies autonomously.

Environment Construction: The discovery environment abstracts TTS as controller synthesis over pre-collected reasoning trajectories (branches) and probe signals. The structured action space incorporates BRANCH, CONTINUE, PROBE, PRUNE, and ANSWER, allowing agents to allocate inference budget across width (number of branches) and depth (branch extension). Feedback is supplied in the form of both coarse (accuracy, cost) and fine-grained (execution traces) signals. This deterministically replays candidate controllers over problem sets, eliminating repeated LLM calls and enabling affordable, reproducible evaluation.

Search Space Parameterization: To induce tractability—given the high propensity for agents to introduce over-parameterized strategies—AutoTTS enforces a strict beta parameterization: all controller hyperparameters are deterministic, monotonic functions of a single beta parameter. Beta controls the accuracy-efficiency tradeoff along a continuous axis from frugal to liberal compute allocation, severely mitigating overfitting and search-set specific collapse.

Agentic Discovery Loop: The discoverer LLM (Claude Code) operates over multi-round loops: it proposes controller code, receives full feedback (execution traces plus accuracy/cost), diagnoses failure modes, and edits the candidate controller. Traces are pivotal—aggregated statistics alone are insufficient for targeted algorithmic improvement.

The Confidence Momentum Controller (CMC)

The strongest discovered controller, termed the Confidence Momentum Controller (CMC), introduces four key mechanisms—each markedly distinct from the baseline family (ASC, ESC, Parallel-Probe):

Trend-Based Stopping via EMA Momentum: Rather than relying on instantaneous confidence, CMC maintains an exponential moving average (EMA) of answer-pool confidence. Termination only occurs when the EMA is both sufficiently high and not actively declining, robustly avoiding premature halting due to transient confidence spikes.
Coupled Width-Depth Control: Widening (branch spawning) and deepening (branch extension) decisions are linked through the EMA delta. Stagnation or regression in confidence triggers further exploration (widening), whereas strong momentum suppresses new branch allocation, establishing a closed feedback loop that adaptively matches compute to problem uncertainty.
Alignment-Aware Depth Allocation: At each round, branches aligned with the current consensus ("winner" answer) receive enhanced probe budgets, focusing computation on promising lines of reasoning, but the allocation remains adaptive and non-uniform.
Conservative Branch Abandonment: Branches are only abandoned after persistently deviating (multiple rounds) from consensus, and always preserving a core set of active hypotheses, preventing brittle convergence.

These design elements, collectively, yield a highly coordinated controller structure, exhibiting complexity difficult to achieve by manual heuristics.

Experimental Evaluation

AutoTTS is instantiated over Qwen3 models (0.6B–8B) on challenging math reasoning benchmarks (AIME24, AIME25, HMMT25) and non-math tasks (GPQA-Diamond). The replay-based discovery and evaluation regime enables affordable and thorough parameter sweeps: the entire discovery loop costs only \$39.9 and 160 minutes—demonstrating practicality.

Numerical Performance:

The discovered controller improves the accuracy-token Pareto frontier across all model scales and datasets. For example, with beta=0.5, CMC reduces token consumption by approximately 69.5% compared to SC@64, maintaining comparable accuracy (e.g., 45.3 vs 45.2 averaged across four models).
At higher budget settings (beta=1.0), CMC further surpasses all baselines on peak accuracy in 5/8 settings.
Generalization is strong; controllers discovered solely on AIME24 outperform hand-crafted TTS algorithms on held-out AIME25, HMMT25, and even transfer well to DeepSeek-R1-Distill-Llama-8B and GPQA-Diamond.

Ablation studies emphasize that omitting beta parameterization leads to catastrophic generalization failures—even though token usage drops, accuracy collapses, verifying that AutoTTS's search-space formulation is essential for robust discovery.

Analysis and Implications

CMC realizes a controller family whose mechanisms are non-trivially different from all prior hand-crafted and agent-generated baselines; in particular, the use of momentum-aware stopping, feedback-coupled compute allocation, and monotonic, explainable hyperparameter schedules yields strategies with robust generalization properties. Ablations further show that both execution trace feedback and strict single-parameter scheduling are necessary and sufficient for cross-benchmark robustness.

Practically, this framework changes the locus of effort from brittle, case-by-case design of TTS strategies, to the construction of richly instrumented, reusable algorithmic discovery environments. One-time search costs are amortized across model families and tasks, and the environment specification itself becomes the target of continual scientific improvement. This is a marked advancement for automated, unsupervised algorithm design.

Theoretically, casting TTS as controller synthesis, and resolving it via agentic discovery, provides a clear interface between environment abstraction and agent intelligence. The framework readily accommodates further extensions to richer computation spaces (e.g., integration with tree search, verifier-guided reasoning, or interactive verification), supporting the development of controllers with higher degrees of reasoning complexity.

Future Directions

Potential directions include:

Expanding the action/state space abstraction in the discovery environment, supporting hierarchical and multistage TTS strategies.
Generalizing the approach to domains with interaction, black-box feedback, or test-time RL adaptation.
Exploring the performance of open-source (vs. closed-source) coding LLMs as discoverers to promote reproducibility and democratization.

Conclusion

This work introduces a scalable, environment-driven agentic discovery paradigm for TTS, demonstrating that construction of the right environment enables LLM agents to autonomously synthesize controllers that systematically outperform strong hand-designed baselines. By parameterizing the search over a single monotonic beta and tightly integrating behavioral feedback, such strategies generalize across model scale, domain, and task—all at negligible marginal computational cost. This paradigm invites future TTS research to focus on environment design, heralding substantial increases in the efficiency and accessibility of LLM inference (2605.08083).

Markdown Report Issue

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

What this paper is about (big idea)

This paper shows a new way to make AI LLMs think better at the moment they answer a question—without retraining them. The trick is to spend extra “thinking time” more smartly. Instead of people hand‑coding lots of rules (like “try 10 answers, then stop”), the authors build a safe testing “sandbox” where an AI agent can automatically discover better rules on its own. They call this system AutoTTS (Automatic Test‑Time Scaling).

Think of it like coaching a student during a test: you can decide whether they should try more different approaches (wider) or go further with one promising approach (deeper). AutoTTS lets an AI coach learn the best ways to do that.

What questions the paper asks

The paper focuses on simple, practical questions:

Can we automatically discover good “when to branch, continue, prune, or stop” rules for an AI’s test‑time thinking, instead of hand‑crafting them?
Can we make this discovery process cheap and fast?
Will the discovered rules work well on new problems, different datasets, and different model sizes?

How they did it (in everyday language)

Here’s the core idea, step by step:

The “width vs. depth” view

Width = how many different solution attempts you try in parallel (like trying multiple ways to solve a math problem).
Depth = how far you continue each attempt (how many steps you let each attempt develop).

Many existing methods are just different ways of moving through this width–depth space (for example, starting wide, then pruning to the best path and going deeper).

The sandbox (offline replay environment)

Before the search begins, the team asks the underlying LLM to pre‑generate many “thought paths” for each question and break them into chunks (like recording many partial tries in advance).
This creates a “replay library” of what would happen if you continued this path, or peeked at its current answer.
Inside the sandbox, the AI coach (called a “controller”) can:
- start a new attempt (branch),
- continue an existing attempt (go deeper),
- peek at a path’s current guess (probe),
- stop a bad path (prune),
- and finally decide the overall answer (vote/aggregate).
Because all the paths are pre‑recorded, testing a new rule is fast and costs almost nothing—no repeated calls to the big model.

One simple dial (beta parameterization)

To keep the search manageable, every candidate controller must expose only one adjustable knob, called beta (β).
Turning β up or down makes the controller more or less “spendy” with its token budget (how much thinking it does).
Internally, the controller turns this single knob into all its thresholds (e.g., when to probe, prune, or stop), so the search doesn’t get lost in too many settings.

Rich feedback, not just a score

After trying a controller, the system doesn’t only record the final accuracy and cost; it also logs a detailed “execution trace” (which paths it explored, where it stopped, etc.).
An explorer AI (a coding assistant) reads these traces and the history of past attempts, then edits the controller’s code to fix mistakes—much like a coach reviewing game footage and tweaking strategies.

Discovery loop

Repeat for a few rounds: propose a controller → test it in the sandbox → read feedback → improve the code.
Finally, pick the controller that gives the best accuracy for the amount of tokens spent (best “bang for your buck”).

What they found (main results)

Better accuracy–cost trade-off: The discovered controllers consistently beat strong hand‑made strategies across math benchmarks (AIME24 for search; AIME25 and HMMT25 for held‑out testing) and different model sizes (Qwen3 0.6B, 1.7B, 4B, 8B).
Big token savings at similar accuracy: In one example setting, the discovered controller used about 69% fewer tokens than a classic method while keeping accuracy about the same.
Higher peak performance: It didn’t just save tokens—it sometimes reached higher best accuracy than the hand‑crafted methods.
Generalizes beyond the training setup: The controllers also worked well on:
- a different model family (DeepSeek‑R1‑Distill‑Llama‑8B),
- and a different task (GPQA Diamond, not just math).
Cheap and fast to discover: Finding these strategies cost about $39.9 and took around 160 minutes, because evaluation used the pre‑recorded replay data.

Ablation (what mattered most):

The single β knob was crucial. Without it, the search overfits and chooses brittle, overly aggressive rules that don’t generalize.
Detailed execution traces helped the coding agent diagnose why a rule failed and how to fix it. Using only final scores made learning much weaker.

Why this matters (impact and takeaway)

Shifts human effort: Instead of people writing lots of fragile, case‑by‑case rules, researchers design a good discovery environment (the sandbox), and let an AI iterate to find strong strategies. This can scale faster and adapt to new models and tasks.
Saves compute at answer time: Better “thinking management” means fewer tokens spent for similar or better accuracy—good for speed, cost, and energy use.
Practical today: The whole discovery process is inexpensive and quick thanks to the replay idea.
Future directions: Today’s sandbox focuses on width and depth. Adding richer actions (like tree search or verifier‑guided checks) could unlock even better strategies. Also, testing with more open‑source coding agents would make the pipeline more accessible.

In short: AutoTTS shows that with the right sandbox, AIs can learn how to guide other AIs’ test‑time thinking, automatically finding strategies that are both smarter and cheaper.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, intended to guide future research:

Scope of control space: The current environment only supports width–depth actions (branch, continue, probe, prune, stop). It does not model richer structures such as tree search, iterative self-reflection, verifier/tool use, retrieval, debate, or cross-branch communication—how to extend the replay MDP and action/state design to support these remains open.
Offline replay fidelity: Because controllers are evaluated on pre-collected trajectories, controller actions cannot influence token generation (no closed-loop coupling between decisions and model behavior). The gap is to quantify and reduce the sim-to-real gap between offline replay performance and online deployment where actions affect the generations.
Decoding assumptions: All replay data are collected with a fixed temperature (0.7) and chunk size (500 tokens). The sensitivity of discovered strategies to temperature, sampling method (e.g., nucleus/beam), and probe granularity (interval length) is unmeasured.
Probe cost realism: Experiments often treat probing as free (Kprobe=0). Real systems incur nontrivial costs (latency, context growth, logprob reads). Systematically measuring performance under realistic probe costs and varying Kprobe is missing.
Token-only cost metric: Evaluation uses total tokens, not latency, throughput, or hardware utilization. The impact on wall-clock time under parallelism constraints, batching effects, and GPU scheduling remains unexplored.
Environment construction cost: The stated discovery cost excludes the (potentially large) one-time cost of collecting 128 trajectories per (model, problem) pair across four models. Quantifying and amortizing this upfront cost, and its scaling with dataset/model size, is not provided.
Replay dataset size/design: There is no study of how the number of pre-sampled trajectories (N=128), trajectory diversity, or de-duplication affects discovery quality, variance, or generalization.
Selection objective: Controllers are selected by highest accuracy on Qsearch, not via a Pareto-aware criterion. Whether multi-objective selection (e.g., hypervolume) or constrained optimization (fixed budget/latency) yields better frontiers is open.
Beta parameterization trade-offs: Constraining controllers to a single monotone scalar β may prevent discovering non-monotonic, instance-specific, or multi-regime policies. Understanding when β is too restrictive and how to safely relax it (e.g., low-dimensional vectors, learned schedules) is an open design question.
Budget calibration: The mapping from β to actual token usage is not calibrated to hit target budgets. Methods for per-instance budget control (e.g., satisfying hard caps or SLAs) and calibration procedures for β→cost are missing.
Overfitting diagnostics: While β reduces hyperparameter overfitting, there is limited analysis of controller robustness to shifts in task distribution, replay pool composition, or model family beyond small-scale tests.
Statistical robustness: Results lack confidence intervals, variance bars, or significance tests across the 64 replay resamplings. Establishing statistical confidence and reporting variability would strengthen claims.
Per-instance adaptivity: The paper does not analyze how allocated tokens correlate with instance difficulty, nor whether allocation is calibrated or fair (e.g., avoiding under-allocation on hard instances). Difficulty-aware evaluation is absent.
Failure modes and safety: There is no analysis of worst-case behaviors (e.g., catastrophic early stopping or pruning the only correct branch), nor guardrails to prevent unacceptable accuracy drops under budget pressure.
Aggregation design: Although arbitrary aggregators Agg are allowed, the space of aggregation rules is not systematically explored (e.g., verifier-guided selection, confidence-weighted voting, learned aggregators). Ablations on aggregator choice are missing.
Component ablations of the discovered controller: The paper identifies mechanisms (EMA momentum, shared evidence signal, alignment-aware depth, conservative abandonment) but does not ablate these components to quantify each one’s contribution or interactions.
Generalization breadth: Evaluation is concentrated on math (AIME24/25, HMMT25) and one non-math dataset (GPQA). Transfer to coding, planning, multi-hop QA, tool-using tasks, multimodal reasoning, and multilingual settings remains untested.
Model scale/family coverage: Tests span Qwen3 up to 8B and one DeepSeek-R1-distilled model. Performance and discovery stability on larger models (e.g., 14B–70B, MoE), different architectures, and proprietary models remain unknown.
Online deployment studies: There is no end-to-end online experiment where the controller runs with live generation calls. Measuring true runtime, cache/kv memory effects, batching, and end-user latency under load is needed.
Robustness to prompt/domain shifts: The environment fixes prompts and decoding settings. How controllers behave under prompt variations, domain shifts, or instruction changes is unexamined.
Replay granularity and signals: The environment records only intermediate answers at fixed intervals. Leveraging richer signals (hidden-state probes, entropy/logprob, uncertainty estimates) or adaptive probing granularity is left open.
Learning-based discovery alternatives: The paper uses an LLM coding agent to propose controllers. Comparing against or combining with program synthesis, evolutionary search, or offline RL over the replay MDP is an open benchmark question.
Reproducibility across agents: Discovery uses a frontier proprietary coding agent (Claude Code). Whether open-source agents can match performance, and how sensitive outcomes are to agent choice, seed, or prompt variations, is not established.
Controller portability: Updating the base LLM or task typically requires rebuilding the replay environment. Methods to reuse, adapt, or incrementally update replay data and controllers across model upgrades are not proposed.
Evaluation metrics: Beyond accuracy–tokens, metrics like calibration, reliability under budget constraints, Pareto hypervolume, and area under the frontier are not reported; their inclusion could standardize comparisons.
Ethical/compliance considerations: While broader impacts are discussed, there is no analysis of how dynamic TTS controllers might interact with safety filters, hallucination risk under pruning, or compliance constraints at test time.

View Paper Prompt View All Prompts

Practical Applications

Below is a distilled set of practical applications enabled by the paper’s AutoTTS framework—an environment-driven approach that discovers test-time scaling (TTS) controllers via offline replay, beta parameterization (a single “budget knob”), and execution-trace feedback. Each item notes target sectors, potential tools/products/workflows, and feasibility assumptions.

Immediate Applications

Production cost–accuracy optimization for reasoning LLMs
- Sectors: software, cloud/ML platforms, finance (FinOps), enterprise AI
- What it looks like: run a one-time “Controller Discovery” job on your model+task using offline replay; deploy the discovered controller to production to prune/stop/branch adaptively; expose the beta knob to meet per-request cost/latency SLAs; monitor accuracy–token scaling curves
- Tools/workflows: MLOps pipeline steps for (1) trajectory collection, (2) discovery rounds, (3) controller registry with versions, (4) A/B and canary rollouts, (5) dashboards of scaling curves and execution traces
- Assumptions/dependencies: pre-collection of task- and model-specific trajectories/probes; production stack supports fixed-interval decoding and probing; distribution shift is manageable; probe read cost is negligible or accounted for
“Fast vs. Thorough” user-facing knob in assistants and copilots
- Sectors: productivity apps, education/tutoring, customer support, software engineering
- What it looks like: surface beta as a simple UI control that tunes width/depth during inference to prioritize speed or accuracy per request type (e.g., “quick answer” vs. “best-effort solution”)
- Tools/workflows: controller integration in inference middleware (e.g., vLLM/TGI plugin); per-session beta policies; UX prompts explaining quality–latency trade-offs
- Assumptions/dependencies: user contexts reliably map to budget preferences; controller generalizes across typical user tasks; monitoring catches under-budgeting failures
Inference FinOps: token budget reduction without retraining
- Sectors: finance/FinOps, cloud cost management, platform engineering
- What it looks like: per-workload discovered controllers that reduce tokens 50–70% at comparable accuracy (as in math/GPQA results); budget-aware routing that increases beta on “critical queries”
- Tools/workflows: cost dashboards tied to beta settings; workload tagging to assign discovered controllers; policy-based beta assignment (e.g., P1 incidents use higher beta)
- Assumptions/dependencies: similar difficulty distribution between discovery corpus and production; cost savings outweigh the one-time discovery/collection overhead
SLA-driven inference control (latency/energy/CPU-GPU quotas)
- Sectors: cloud, on-device/edge, mobile apps
- What it looks like: schedules that set beta based on current latency target or device power state (e.g., “low-power mode” on mobile uses lower beta)
- Tools/workflows: dynamic policy engine reading queue latency or device telemetry; fallback triggers (raise beta if early uncertainty persists)
- Assumptions/dependencies: reliable latency predictors; controller’s monotone budget mapping (via beta) holds under production conditions
Offline replay for safe A/B testing of TTS strategies
- Sectors: enterprise AI, research platforms
- What it looks like: simulate cost/accuracy curves for multiple controllers using stored branch/probe matrices—no repeated LLM calls—before any prod exposure
- Tools/workflows: replay harness + execution trace viewers; auto-reporting of Pareto frontiers; regression alerts on held-out sets
- Assumptions/dependencies: replay dataset is representative; consistent sampling parameters (temperature, interval length) with target runtime
Domain-tuned controllers for reasoning-heavy verticals
- Sectors: education (math tutors), legal analysis, quantitative research
- What it looks like: collect trajectories on domain benchmarks (e.g., bar exams, finance Q&A), discover bespoke controllers that learn when to branch deeper vs. prune
- Tools/workflows: benchmark-specific replay sets; controller catalogs per domain; policy gating (e.g., conservative branch abandonment in high-risk queries)
- Assumptions/dependencies: adequate, diverse replay data; validation on held-out corpora; guardrails to mitigate domain-specific failure modes
Academic test harness for reproducible TTS research
- Sectors: academia, open-source
- What it looks like: use the replay MDP and execution-trace history to compare new controllers under identical conditions; publish scaling curves and traces for reproducibility
- Tools/workflows: open datasets of replay matrices; standardized evaluation scripts; shared controller interfaces
- Assumptions/dependencies: community adoption of replay-based evaluation; transparent reporting of sampling configs
Controller-assisted on-device LLMs
- Sectors: mobile, embedded/edge AI
- What it looks like: apply discovered controllers to cap branches/depth on-device, reducing battery and memory pressure while keeping acceptable answer quality
- Tools/workflows: lightweight probing intervals; beta presets mapped to device mode (battery saver vs. performance)
- Assumptions/dependencies: feasible interval probing on-device; limited memory footprint for controller logic; acceptable accuracy under tighter budgets

Long-Term Applications

Generalized controller discovery beyond width–depth
- Sectors: software, reasoning research, robotics planning
- What it looks like: environments that support tree search, verifier-guided refinement, hidden-state confidence probes, and tool-use orchestration; agents discover richer control programs
- Tools/workflows: extended action/state spaces; verifier APIs; hidden-state probing APIs; mixed discrete-continuous control search
- Assumptions/dependencies: access to model internals (logits/hidden states); robust verifiers; greater engineering to keep search tractable
AutoTTS-as-a-Service (managed discovery platform)
- Sectors: cloud, enterprise AI
- What it looks like: hosted service for trajectory collection, discovery, and validation; delivery of signed controller artifacts with SLAs and monitoring hooks
- Tools/workflows: multi-tenant replay data vaults; governance around data privacy/IP; automated drift detection prompting re-discovery
- Assumptions/dependencies: security and compliance around storing trajectories; standard interfaces to integrate controllers into varied runtimes
Regulatory auditing and efficiency standards for LLM inference
- Sectors: public policy, standards bodies, energy/sustainability
- What it looks like: audit using replay environments to verify claimed efficiency/accuracy trade-offs; publish standardized “tokens-per-accuracy” metrics and carbon per query
- Tools/workflows: certified replay test suites; reporting templates; third-party verification labs
- Assumptions/dependencies: policy appetite for efficiency labeling; representative audit datasets; cooperation from model vendors
Carbon-aware and risk-aware compute governance at inference time
- Sectors: energy, sustainability, finance (risk), cloud
- What it looks like: adjust beta dynamically using grid carbon intensity, cost spikes, or task risk scores; escalate compute only when uncertainty or risk justifies it
- Tools/workflows: carbon/risk signals feeding a policy engine; audit logs of compute decisions; cost–risk trade-off analytics
- Assumptions/dependencies: reliable carbon and risk signals; clear policies for when to spend compute; organizational buy-in
Integrated TTS controller–hardware co-design
- Sectors: semiconductor, cloud infrastructure
- What it looks like: compilers/runtimes that map discovered control patterns to hardware scheduling (batching, memory prefetch for likely-to-continue branches)
- Tools/workflows: controller-aware schedulers; profiling tools that visualize branch survival and depth progression; kernel-level optimizations for probing
- Assumptions/dependencies: access to low-level schedulers; stable controller interfaces; hardware vendor collaboration
Safety-critical deployments with formalized compute policies
- Sectors: healthcare, legal, finance, autonomous systems
- What it looks like: controllers that encode conservative branch abandonment, alignment-aware depth allocation, and verifier gates; formal testing on domain hazards
- Tools/workflows: hazard libraries; simulation benches via replay; signed policy packs for audits
- Assumptions/dependencies: rigorous domain validation and regulation compliance; robust verifiers; acceptance of test-time compute policies in safety cases
Multi-agent orchestration and tool-use controllers
- Sectors: enterprise automation, knowledge work, DevOps
- What it looks like: controllers that decide not only width/depth for language reasoning but when to call tools, retrieve documents, or delegate to agents under a unified budget
- Tools/workflows: orchestration frameworks with budget APIs; tool success-probes as signals; cross-agent execution traces for discovery
- Assumptions/dependencies: unified observability across tools; consistent cost accounting; new replay schemas spanning tool calls
Standardized replay datasets and leaderboards across domains
- Sectors: academia, open benchmarks
- What it looks like: large, diverse replay banks (math, coding, legal, biomedical) enabling community controller discovery and fair comparisons
- Tools/workflows: dataset governance, licensing; reproducible pipelines; multi-domain Pareto leaderboards
- Assumptions/dependencies: data sharing agreements; privacy-preserving collection; sustained curation effort
Real-time planning controllers for robotics and IoT
- Sectors: robotics, industrial automation, smart devices
- What it looks like: discovered policies that throttle planning depth based on evidence trends and alignment with goals; on-line budget adaptation for tight control loops
- Tools/workflows: simulation-to-replay pipelines; policy verification under latency limits; hardware-in-the-loop tests
- Assumptions/dependencies: low-latency probing; stable alignment signals; safety certification pathways

Notes on feasibility and transferability across all applications:

The method benefits most when tasks tolerate multi-branch sampling and intermediate probes (width–depth). Extensions are needed for tasks without natural branching or with strict single-pass constraints.
Discovery quality depends on the representativeness of the replay dataset and matching runtime settings (e.g., temperature, interval size).
Results are strongest on math and show promising but limited transfer to non-math tasks; domain-specific validation is necessary.
The paper relied on a strong coding agent (Claude Code). Comparable performance with open-source agents remains a research question.

View Paper Prompt View All Prompts

Glossary

Ablation study: An experiment that removes or alters components of a system to measure their impact on performance or efficiency. "We further conduct an ablation study to examine the key design choices including beta-controlled search space and history design, in our discovery framework."
Accuracy-cost Pareto frontier: The set of non-dominated solutions that optimize accuracy for a given computation cost (tokens) and vice versa. "the discovered controller improves the accuracy-cost Pareto frontier over hand-crafted baselines"
Accuracy-efficiency frontier: A curve showing the best achievable accuracy for different levels of computational efficiency. "the discovered controller consistently achieves a stronger accuracy- efficiency frontier"
Admissible action set: The set of actions that are allowed from a given state in the control environment. "the admissible action set is A(St) ={BRANCH : Cost(st)+1≤B} U {CONTINUE(i): ...} U {PROBE(i) : ...} U {PRUNE(i) : ...} U {ANSWER}."
Adaptive-Consistency (ASC): A test-time strategy that adaptively samples and stops when a confidence threshold is met. "ASC [2]: A parallel sampling approach that samples trajectories one by one and stop until reaching a pre-defined threshold."
Agentic discovery: An LLM-driven process where an agent iteratively proposes and refines algorithms using feedback from executions. "fine-grained execution feedback improves agentic discovery for harness engineering."
Aggregation rule: A function that aggregates information from explored branches/states to produce the final answer. "An aggregation rule Agg takes a state as the input and outputs the final answer."
Beta parameterization: Constraining a controller to expose a single scalar β that deterministically sets all internal hyperparameters to make search tractable. "we introduce beta parameterization"
Branch prefixes: Partial sequences generated along a branch up to certain probe intervals. "This directly instantiates the branch prefixes Zi,1, Zi,2, ..."
Code-defined policy: A controller implemented in code that maps states (and β) to actions. "find a code-defined policy 7 that maps a state and a parameter 3 to a distribution over admissible atomic actions"
Controller synthesis: Automatically constructing a decision policy (controller) over a defined control space. "we formulate width-depth TTS as con- troller synthesis over pre-collected reasoning trajectories and probe signals"
Early-stopping self-consistency (ESC): A chunk-based approach that stops sampling when intermediate answers converge. "ESC [3]: A chunk-based hybrid approach that generates trajectories in parallel and terminates early when answer stability is detected within a sliding window."
EMA momentum: Using an exponential moving average trend signal to decide when to halt further computation. "trend-based stopping via EMA momentum"
Execution traces: Detailed logs of the sequence of decisions taken by a controller during evaluation. "receives feedback from scaling curves and execution traces"
Meta-hyper-parameter: A higher-level parameter controlling lower-level hyperparameters of the algorithm. "Here, ß is a meta-hyper-parameter that is used to control all hyper-parameter used in the algorithm."
Offline replay environment: An evaluation setup that uses pre-collected trajectories and probe signals to avoid live LLM calls. "we construct an offline replay environment that moves all LLM calls prior to the discovery process"
Parallel-Probe: A parallel reasoning method that probes branches to inform stopping, pruning, and continuation decisions. "Parallel-Probe [6]: A recent efficient parallel reasoning approach that leverages cross-branch information to dynamically decide when to stop reasoning, prune unpromising branches, or continue computation."
Probe signals: Intermediate answers observed at specific intervals without necessarily advancing generation. "intermediate probe signals"
Reasoning trajectories: Sequences of generated reasoning steps (e.g., chain-of-thought) for a given problem. "we pre-collect N independent reasoning trajectories from the base LLM"
Replay MDP: A Markov Decision Process built from pre-collected data enabling deterministic, replay-based evaluation. "instantiating it over a replay MDP built from pre-collected trajectories"
Scaling curve: A plot showing performance (e.g., accuracy) as a function of inference budget across settings of a tradeoff parameter. "we sweep across multiple betas and record the resulting scaling curve"
Self-Consistency (SC@64): A method that samples many reasoning paths and selects the majority-voted answer. "Self- Consistency (SC@64) [1]: A vanilla parallel reasoning approach that first samples 64 reasoning trajectories and performs majority voting"
Test-time scaling (TTS): Improving model performance by allocating more computation at inference time. "Test-time scaling (TTS) has become an effective approach for improving LLM performance by allocating additional computation during inference."
Width-depth space: A control space where width is the number of explored branches and depth is how far each branch is developed. "A simple example is the width- depth space, where width denotes how many rea- soning branches are explored and depth denotes how far each branch is developed"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Continue Learning

Collections

GitHub

GitHub - zhengkid/AutoTTS: The offical repo for "LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling" · GitHub (30 stars)

Tweets

HackerNews

Zhengkid/AutoTTS: Agentic Discovery for Test-Time Scaling (2 points, 0 comments)