LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
Abstract: Test-time scaling (TTS) has become an effective approach for improving LLM performance by allocating additional computation during inference. However, existing TTS strategies are largely hand-crafted: researchers manually design reasoning patterns and tune heuristics by intuition, leaving much of the computation-allocation space unexplored. We propose an environment-driven framework, AutoTTS, that changes what researchers design: from individual TTS heuristics to environments where TTS strategies can be discovered automatically. The key to AutoTTS lies in environment construction: the discovery environment must make the control space tractable and provide cheap, frequent feedback for TTS search. As a concrete instantiation, we formulate width--depth TTS as controller synthesis over pre-collected reasoning trajectories and probe signals, where controllers decide when to branch, continue, probe, prune, or stop and can be evaluated cheaply without repeated LLM calls. We further introduce beta parameterization to make the search tractable and fine-grained execution trace feedback to improve discovery efficiency by helping the agent diagnose why a TTS program fails. Experiments on mathematical reasoning benchmarks show that the discovered strategies improve the overall accuracy--cost tradeoff over strong manually designed baselines. The discovered strategies generalize to held-out benchmarks and model scales, while the entire discovery costs only $39.9 and 160 minutes. Our data, and code will be open-source at https://github.com/zhengkid/AutoTTS.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Explain it Like I'm 14
What this paper is about (big idea)
This paper shows a new way to make AI LLMs think better at the moment they answer a questionโwithout retraining them. The trick is to spend extra โthinking timeโ more smartly. Instead of people handโcoding lots of rules (like โtry 10 answers, then stopโ), the authors build a safe testing โsandboxโ where an AI agent can automatically discover better rules on its own. They call this system AutoTTS (Automatic TestโTime Scaling).
Think of it like coaching a student during a test: you can decide whether they should try more different approaches (wider) or go further with one promising approach (deeper). AutoTTS lets an AI coach learn the best ways to do that.
What questions the paper asks
The paper focuses on simple, practical questions:
- Can we automatically discover good โwhen to branch, continue, prune, or stopโ rules for an AIโs testโtime thinking, instead of handโcrafting them?
- Can we make this discovery process cheap and fast?
- Will the discovered rules work well on new problems, different datasets, and different model sizes?
How they did it (in everyday language)
Hereโs the core idea, step by step:
The โwidth vs. depthโ view
- Width = how many different solution attempts you try in parallel (like trying multiple ways to solve a math problem).
- Depth = how far you continue each attempt (how many steps you let each attempt develop).
Many existing methods are just different ways of moving through this widthโdepth space (for example, starting wide, then pruning to the best path and going deeper).
The sandbox (offline replay environment)
- Before the search begins, the team asks the underlying LLM to preโgenerate many โthought pathsโ for each question and break them into chunks (like recording many partial tries in advance).
- This creates a โreplay libraryโ of what would happen if you continued this path, or peeked at its current answer.
- Inside the sandbox, the AI coach (called a โcontrollerโ) can:
- start a new attempt (branch),
- continue an existing attempt (go deeper),
- peek at a pathโs current guess (probe),
- stop a bad path (prune),
- and finally decide the overall answer (vote/aggregate).
- Because all the paths are preโrecorded, testing a new rule is fast and costs almost nothingโno repeated calls to the big model.
One simple dial (beta parameterization)
- To keep the search manageable, every candidate controller must expose only one adjustable knob, called beta (ฮฒ).
- Turning ฮฒ up or down makes the controller more or less โspendyโ with its token budget (how much thinking it does).
- Internally, the controller turns this single knob into all its thresholds (e.g., when to probe, prune, or stop), so the search doesnโt get lost in too many settings.
Rich feedback, not just a score
- After trying a controller, the system doesnโt only record the final accuracy and cost; it also logs a detailed โexecution traceโ (which paths it explored, where it stopped, etc.).
- An explorer AI (a coding assistant) reads these traces and the history of past attempts, then edits the controllerโs code to fix mistakesโmuch like a coach reviewing game footage and tweaking strategies.
Discovery loop
- Repeat for a few rounds: propose a controller โ test it in the sandbox โ read feedback โ improve the code.
- Finally, pick the controller that gives the best accuracy for the amount of tokens spent (best โbang for your buckโ).
What they found (main results)
- Better accuracyโcost trade-off: The discovered controllers consistently beat strong handโmade strategies across math benchmarks (AIME24 for search; AIME25 and HMMT25 for heldโout testing) and different model sizes (Qwen3 0.6B, 1.7B, 4B, 8B).
- Big token savings at similar accuracy: In one example setting, the discovered controller used about 69% fewer tokens than a classic method while keeping accuracy about the same.
- Higher peak performance: It didnโt just save tokensโit sometimes reached higher best accuracy than the handโcrafted methods.
- Generalizes beyond the training setup: The controllers also worked well on:
- a different model family (DeepSeekโR1โDistillโLlamaโ8B),
- and a different task (GPQA Diamond, not just math).
- Cheap and fast to discover: Finding these strategies cost about $39.9 and took around 160 minutes, because evaluation used the preโrecorded replay data.
Ablation (what mattered most):
- The single ฮฒ knob was crucial. Without it, the search overfits and chooses brittle, overly aggressive rules that donโt generalize.
- Detailed execution traces helped the coding agent diagnose why a rule failed and how to fix it. Using only final scores made learning much weaker.
Why this matters (impact and takeaway)
- Shifts human effort: Instead of people writing lots of fragile, caseโbyโcase rules, researchers design a good discovery environment (the sandbox), and let an AI iterate to find strong strategies. This can scale faster and adapt to new models and tasks.
- Saves compute at answer time: Better โthinking managementโ means fewer tokens spent for similar or better accuracyโgood for speed, cost, and energy use.
- Practical today: The whole discovery process is inexpensive and quick thanks to the replay idea.
- Future directions: Todayโs sandbox focuses on width and depth. Adding richer actions (like tree search or verifierโguided checks) could unlock even better strategies. Also, testing with more openโsource coding agents would make the pipeline more accessible.
In short: AutoTTS shows that with the right sandbox, AIs can learn how to guide other AIsโ testโtime thinking, automatically finding strategies that are both smarter and cheaper.
Knowledge Gaps
Knowledge gaps, limitations, and open questions
Below is a concise, actionable list of what remains missing, uncertain, or unexplored in the paper, intended to guide future research:
- Scope of control space: The current environment only supports widthโdepth actions (branch, continue, probe, prune, stop). It does not model richer structures such as tree search, iterative self-reflection, verifier/tool use, retrieval, debate, or cross-branch communicationโhow to extend the replay MDP and action/state design to support these remains open.
- Offline replay fidelity: Because controllers are evaluated on pre-collected trajectories, controller actions cannot influence token generation (no closed-loop coupling between decisions and model behavior). The gap is to quantify and reduce the sim-to-real gap between offline replay performance and online deployment where actions affect the generations.
- Decoding assumptions: All replay data are collected with a fixed temperature (0.7) and chunk size (500 tokens). The sensitivity of discovered strategies to temperature, sampling method (e.g., nucleus/beam), and probe granularity (interval length) is unmeasured.
- Probe cost realism: Experiments often treat probing as free (Kprobe=0). Real systems incur nontrivial costs (latency, context growth, logprob reads). Systematically measuring performance under realistic probe costs and varying Kprobe is missing.
- Token-only cost metric: Evaluation uses total tokens, not latency, throughput, or hardware utilization. The impact on wall-clock time under parallelism constraints, batching effects, and GPU scheduling remains unexplored.
- Environment construction cost: The stated discovery cost excludes the (potentially large) one-time cost of collecting 128 trajectories per (model, problem) pair across four models. Quantifying and amortizing this upfront cost, and its scaling with dataset/model size, is not provided.
- Replay dataset size/design: There is no study of how the number of pre-sampled trajectories (N=128), trajectory diversity, or de-duplication affects discovery quality, variance, or generalization.
- Selection objective: Controllers are selected by highest accuracy on Qsearch, not via a Pareto-aware criterion. Whether multi-objective selection (e.g., hypervolume) or constrained optimization (fixed budget/latency) yields better frontiers is open.
- Beta parameterization trade-offs: Constraining controllers to a single monotone scalar ฮฒ may prevent discovering non-monotonic, instance-specific, or multi-regime policies. Understanding when ฮฒ is too restrictive and how to safely relax it (e.g., low-dimensional vectors, learned schedules) is an open design question.
- Budget calibration: The mapping from ฮฒ to actual token usage is not calibrated to hit target budgets. Methods for per-instance budget control (e.g., satisfying hard caps or SLAs) and calibration procedures for ฮฒโcost are missing.
- Overfitting diagnostics: While ฮฒ reduces hyperparameter overfitting, there is limited analysis of controller robustness to shifts in task distribution, replay pool composition, or model family beyond small-scale tests.
- Statistical robustness: Results lack confidence intervals, variance bars, or significance tests across the 64 replay resamplings. Establishing statistical confidence and reporting variability would strengthen claims.
- Per-instance adaptivity: The paper does not analyze how allocated tokens correlate with instance difficulty, nor whether allocation is calibrated or fair (e.g., avoiding under-allocation on hard instances). Difficulty-aware evaluation is absent.
- Failure modes and safety: There is no analysis of worst-case behaviors (e.g., catastrophic early stopping or pruning the only correct branch), nor guardrails to prevent unacceptable accuracy drops under budget pressure.
- Aggregation design: Although arbitrary aggregators Agg are allowed, the space of aggregation rules is not systematically explored (e.g., verifier-guided selection, confidence-weighted voting, learned aggregators). Ablations on aggregator choice are missing.
- Component ablations of the discovered controller: The paper identifies mechanisms (EMA momentum, shared evidence signal, alignment-aware depth, conservative abandonment) but does not ablate these components to quantify each oneโs contribution or interactions.
- Generalization breadth: Evaluation is concentrated on math (AIME24/25, HMMT25) and one non-math dataset (GPQA). Transfer to coding, planning, multi-hop QA, tool-using tasks, multimodal reasoning, and multilingual settings remains untested.
- Model scale/family coverage: Tests span Qwen3 up to 8B and one DeepSeek-R1-distilled model. Performance and discovery stability on larger models (e.g., 14Bโ70B, MoE), different architectures, and proprietary models remain unknown.
- Online deployment studies: There is no end-to-end online experiment where the controller runs with live generation calls. Measuring true runtime, cache/kv memory effects, batching, and end-user latency under load is needed.
- Robustness to prompt/domain shifts: The environment fixes prompts and decoding settings. How controllers behave under prompt variations, domain shifts, or instruction changes is unexamined.
- Replay granularity and signals: The environment records only intermediate answers at fixed intervals. Leveraging richer signals (hidden-state probes, entropy/logprob, uncertainty estimates) or adaptive probing granularity is left open.
- Learning-based discovery alternatives: The paper uses an LLM coding agent to propose controllers. Comparing against or combining with program synthesis, evolutionary search, or offline RL over the replay MDP is an open benchmark question.
- Reproducibility across agents: Discovery uses a frontier proprietary coding agent (Claude Code). Whether open-source agents can match performance, and how sensitive outcomes are to agent choice, seed, or prompt variations, is not established.
- Controller portability: Updating the base LLM or task typically requires rebuilding the replay environment. Methods to reuse, adapt, or incrementally update replay data and controllers across model upgrades are not proposed.
- Evaluation metrics: Beyond accuracyโtokens, metrics like calibration, reliability under budget constraints, Pareto hypervolume, and area under the frontier are not reported; their inclusion could standardize comparisons.
- Ethical/compliance considerations: While broader impacts are discussed, there is no analysis of how dynamic TTS controllers might interact with safety filters, hallucination risk under pruning, or compliance constraints at test time.
Practical Applications
Below is a distilled set of practical applications enabled by the paperโs AutoTTS frameworkโan environment-driven approach that discovers test-time scaling (TTS) controllers via offline replay, beta parameterization (a single โbudget knobโ), and execution-trace feedback. Each item notes target sectors, potential tools/products/workflows, and feasibility assumptions.
Immediate Applications
- Production costโaccuracy optimization for reasoning LLMs
- Sectors: software, cloud/ML platforms, finance (FinOps), enterprise AI
- What it looks like: run a one-time โController Discoveryโ job on your model+task using offline replay; deploy the discovered controller to production to prune/stop/branch adaptively; expose the beta knob to meet per-request cost/latency SLAs; monitor accuracyโtoken scaling curves
- Tools/workflows: MLOps pipeline steps for (1) trajectory collection, (2) discovery rounds, (3) controller registry with versions, (4) A/B and canary rollouts, (5) dashboards of scaling curves and execution traces
- Assumptions/dependencies: pre-collection of task- and model-specific trajectories/probes; production stack supports fixed-interval decoding and probing; distribution shift is manageable; probe read cost is negligible or accounted for
- โFast vs. Thoroughโ user-facing knob in assistants and copilots
- Sectors: productivity apps, education/tutoring, customer support, software engineering
- What it looks like: surface beta as a simple UI control that tunes width/depth during inference to prioritize speed or accuracy per request type (e.g., โquick answerโ vs. โbest-effort solutionโ)
- Tools/workflows: controller integration in inference middleware (e.g., vLLM/TGI plugin); per-session beta policies; UX prompts explaining qualityโlatency trade-offs
- Assumptions/dependencies: user contexts reliably map to budget preferences; controller generalizes across typical user tasks; monitoring catches under-budgeting failures
- Inference FinOps: token budget reduction without retraining
- Sectors: finance/FinOps, cloud cost management, platform engineering
- What it looks like: per-workload discovered controllers that reduce tokens 50โ70% at comparable accuracy (as in math/GPQA results); budget-aware routing that increases beta on โcritical queriesโ
- Tools/workflows: cost dashboards tied to beta settings; workload tagging to assign discovered controllers; policy-based beta assignment (e.g., P1 incidents use higher beta)
- Assumptions/dependencies: similar difficulty distribution between discovery corpus and production; cost savings outweigh the one-time discovery/collection overhead
- SLA-driven inference control (latency/energy/CPU-GPU quotas)
- Sectors: cloud, on-device/edge, mobile apps
- What it looks like: schedules that set beta based on current latency target or device power state (e.g., โlow-power modeโ on mobile uses lower beta)
- Tools/workflows: dynamic policy engine reading queue latency or device telemetry; fallback triggers (raise beta if early uncertainty persists)
- Assumptions/dependencies: reliable latency predictors; controllerโs monotone budget mapping (via beta) holds under production conditions
- Offline replay for safe A/B testing of TTS strategies
- Sectors: enterprise AI, research platforms
- What it looks like: simulate cost/accuracy curves for multiple controllers using stored branch/probe matricesโno repeated LLM callsโbefore any prod exposure
- Tools/workflows: replay harness + execution trace viewers; auto-reporting of Pareto frontiers; regression alerts on held-out sets
- Assumptions/dependencies: replay dataset is representative; consistent sampling parameters (temperature, interval length) with target runtime
- Domain-tuned controllers for reasoning-heavy verticals
- Sectors: education (math tutors), legal analysis, quantitative research
- What it looks like: collect trajectories on domain benchmarks (e.g., bar exams, finance Q&A), discover bespoke controllers that learn when to branch deeper vs. prune
- Tools/workflows: benchmark-specific replay sets; controller catalogs per domain; policy gating (e.g., conservative branch abandonment in high-risk queries)
- Assumptions/dependencies: adequate, diverse replay data; validation on held-out corpora; guardrails to mitigate domain-specific failure modes
- Academic test harness for reproducible TTS research
- Sectors: academia, open-source
- What it looks like: use the replay MDP and execution-trace history to compare new controllers under identical conditions; publish scaling curves and traces for reproducibility
- Tools/workflows: open datasets of replay matrices; standardized evaluation scripts; shared controller interfaces
- Assumptions/dependencies: community adoption of replay-based evaluation; transparent reporting of sampling configs
- Controller-assisted on-device LLMs
- Sectors: mobile, embedded/edge AI
- What it looks like: apply discovered controllers to cap branches/depth on-device, reducing battery and memory pressure while keeping acceptable answer quality
- Tools/workflows: lightweight probing intervals; beta presets mapped to device mode (battery saver vs. performance)
- Assumptions/dependencies: feasible interval probing on-device; limited memory footprint for controller logic; acceptable accuracy under tighter budgets
Long-Term Applications
- Generalized controller discovery beyond widthโdepth
- Sectors: software, reasoning research, robotics planning
- What it looks like: environments that support tree search, verifier-guided refinement, hidden-state confidence probes, and tool-use orchestration; agents discover richer control programs
- Tools/workflows: extended action/state spaces; verifier APIs; hidden-state probing APIs; mixed discrete-continuous control search
- Assumptions/dependencies: access to model internals (logits/hidden states); robust verifiers; greater engineering to keep search tractable
- AutoTTS-as-a-Service (managed discovery platform)
- Sectors: cloud, enterprise AI
- What it looks like: hosted service for trajectory collection, discovery, and validation; delivery of signed controller artifacts with SLAs and monitoring hooks
- Tools/workflows: multi-tenant replay data vaults; governance around data privacy/IP; automated drift detection prompting re-discovery
- Assumptions/dependencies: security and compliance around storing trajectories; standard interfaces to integrate controllers into varied runtimes
- Regulatory auditing and efficiency standards for LLM inference
- Sectors: public policy, standards bodies, energy/sustainability
- What it looks like: audit using replay environments to verify claimed efficiency/accuracy trade-offs; publish standardized โtokens-per-accuracyโ metrics and carbon per query
- Tools/workflows: certified replay test suites; reporting templates; third-party verification labs
- Assumptions/dependencies: policy appetite for efficiency labeling; representative audit datasets; cooperation from model vendors
- Carbon-aware and risk-aware compute governance at inference time
- Sectors: energy, sustainability, finance (risk), cloud
- What it looks like: adjust beta dynamically using grid carbon intensity, cost spikes, or task risk scores; escalate compute only when uncertainty or risk justifies it
- Tools/workflows: carbon/risk signals feeding a policy engine; audit logs of compute decisions; costโrisk trade-off analytics
- Assumptions/dependencies: reliable carbon and risk signals; clear policies for when to spend compute; organizational buy-in
- Integrated TTS controllerโhardware co-design
- Sectors: semiconductor, cloud infrastructure
- What it looks like: compilers/runtimes that map discovered control patterns to hardware scheduling (batching, memory prefetch for likely-to-continue branches)
- Tools/workflows: controller-aware schedulers; profiling tools that visualize branch survival and depth progression; kernel-level optimizations for probing
- Assumptions/dependencies: access to low-level schedulers; stable controller interfaces; hardware vendor collaboration
- Safety-critical deployments with formalized compute policies
- Sectors: healthcare, legal, finance, autonomous systems
- What it looks like: controllers that encode conservative branch abandonment, alignment-aware depth allocation, and verifier gates; formal testing on domain hazards
- Tools/workflows: hazard libraries; simulation benches via replay; signed policy packs for audits
- Assumptions/dependencies: rigorous domain validation and regulation compliance; robust verifiers; acceptance of test-time compute policies in safety cases
- Multi-agent orchestration and tool-use controllers
- Sectors: enterprise automation, knowledge work, DevOps
- What it looks like: controllers that decide not only width/depth for language reasoning but when to call tools, retrieve documents, or delegate to agents under a unified budget
- Tools/workflows: orchestration frameworks with budget APIs; tool success-probes as signals; cross-agent execution traces for discovery
- Assumptions/dependencies: unified observability across tools; consistent cost accounting; new replay schemas spanning tool calls
- Standardized replay datasets and leaderboards across domains
- Sectors: academia, open benchmarks
- What it looks like: large, diverse replay banks (math, coding, legal, biomedical) enabling community controller discovery and fair comparisons
- Tools/workflows: dataset governance, licensing; reproducible pipelines; multi-domain Pareto leaderboards
- Assumptions/dependencies: data sharing agreements; privacy-preserving collection; sustained curation effort
- Real-time planning controllers for robotics and IoT
- Sectors: robotics, industrial automation, smart devices
- What it looks like: discovered policies that throttle planning depth based on evidence trends and alignment with goals; on-line budget adaptation for tight control loops
- Tools/workflows: simulation-to-replay pipelines; policy verification under latency limits; hardware-in-the-loop tests
- Assumptions/dependencies: low-latency probing; stable alignment signals; safety certification pathways
Notes on feasibility and transferability across all applications:
- The method benefits most when tasks tolerate multi-branch sampling and intermediate probes (widthโdepth). Extensions are needed for tasks without natural branching or with strict single-pass constraints.
- Discovery quality depends on the representativeness of the replay dataset and matching runtime settings (e.g., temperature, interval size).
- Results are strongest on math and show promising but limited transfer to non-math tasks; domain-specific validation is necessary.
- The paper relied on a strong coding agent (Claude Code). Comparable performance with open-source agents remains a research question.
Glossary
- Ablation study: An experiment that removes or alters components of a system to measure their impact on performance or efficiency. "We further conduct an ablation study to examine the key design choices including beta-controlled search space and history design, in our discovery framework."
- Accuracy-cost Pareto frontier: The set of non-dominated solutions that optimize accuracy for a given computation cost (tokens) and vice versa. "the discovered controller improves the accuracy-cost Pareto frontier over hand-crafted baselines"
- Accuracy-efficiency frontier: A curve showing the best achievable accuracy for different levels of computational efficiency. "the discovered controller consistently achieves a stronger accuracy- efficiency frontier"
- Admissible action set: The set of actions that are allowed from a given state in the control environment. "the admissible action set is A(St) ={BRANCH : Cost(st)+1โคB} U {CONTINUE(i): ...} U {PROBE(i) : ...} U {PRUNE(i) : ...} U {ANSWER}."
- Adaptive-Consistency (ASC): A test-time strategy that adaptively samples and stops when a confidence threshold is met. "ASC [2]: A parallel sampling approach that samples trajectories one by one and stop until reaching a pre-defined threshold."
- Agentic discovery: An LLM-driven process where an agent iteratively proposes and refines algorithms using feedback from executions. "fine-grained execution feedback improves agentic discovery for harness engineering."
- Aggregation rule: A function that aggregates information from explored branches/states to produce the final answer. "An aggregation rule Agg takes a state as the input and outputs the final answer."
- Beta parameterization: Constraining a controller to expose a single scalar ฮฒ that deterministically sets all internal hyperparameters to make search tractable. "we introduce beta parameterization"
- Branch prefixes: Partial sequences generated along a branch up to certain probe intervals. "This directly instantiates the branch prefixes Zi,1, Zi,2, ..."
- Code-defined policy: A controller implemented in code that maps states (and ฮฒ) to actions. "find a code-defined policy 7 that maps a state and a parameter 3 to a distribution over admissible atomic actions"
- Controller synthesis: Automatically constructing a decision policy (controller) over a defined control space. "we formulate width-depth TTS as con- troller synthesis over pre-collected reasoning trajectories and probe signals"
- Early-stopping self-consistency (ESC): A chunk-based approach that stops sampling when intermediate answers converge. "ESC [3]: A chunk-based hybrid approach that generates trajectories in parallel and terminates early when answer stability is detected within a sliding window."
- EMA momentum: Using an exponential moving average trend signal to decide when to halt further computation. "trend-based stopping via EMA momentum"
- Execution traces: Detailed logs of the sequence of decisions taken by a controller during evaluation. "receives feedback from scaling curves and execution traces"
- Meta-hyper-parameter: A higher-level parameter controlling lower-level hyperparameters of the algorithm. "Here, ร is a meta-hyper-parameter that is used to control all hyper-parameter used in the algorithm."
- Offline replay environment: An evaluation setup that uses pre-collected trajectories and probe signals to avoid live LLM calls. "we construct an offline replay environment that moves all LLM calls prior to the discovery process"
- Parallel-Probe: A parallel reasoning method that probes branches to inform stopping, pruning, and continuation decisions. "Parallel-Probe [6]: A recent efficient parallel reasoning approach that leverages cross-branch information to dynamically decide when to stop reasoning, prune unpromising branches, or continue computation."
- Probe signals: Intermediate answers observed at specific intervals without necessarily advancing generation. "intermediate probe signals"
- Reasoning trajectories: Sequences of generated reasoning steps (e.g., chain-of-thought) for a given problem. "we pre-collect N independent reasoning trajectories from the base LLM"
- Replay MDP: A Markov Decision Process built from pre-collected data enabling deterministic, replay-based evaluation. "instantiating it over a replay MDP built from pre-collected trajectories"
- Scaling curve: A plot showing performance (e.g., accuracy) as a function of inference budget across settings of a tradeoff parameter. "we sweep across multiple betas and record the resulting scaling curve"
- Self-Consistency (SC@64): A method that samples many reasoning paths and selects the majority-voted answer. "Self- Consistency (SC@64) [1]: A vanilla parallel reasoning approach that first samples 64 reasoning trajectories and performs majority voting"
- Test-time scaling (TTS): Improving model performance by allocating more computation at inference time. "Test-time scaling (TTS) has become an effective approach for improving LLM performance by allocating additional computation during inference."
- Width-depth space: A control space where width is the number of explored branches and depth is how far each branch is developed. "A simple example is the width- depth space, where width denotes how many rea- soning branches are explored and depth denotes how far each branch is developed"
Collections
Sign up for free to add this paper to one or more collections.