SCATR: Simple Calibrated Test-Time Ranking

Published 16 Apr 2026 in cs.LG and cs.AI | (2604.16535v2)

Abstract: Test-time scaling (TTS) improves LLMs by allocating additional compute at inference time. In practice, TTS is often achieved through parallel scaling: generating multiple candidate responses and selecting the best via a Best-of-N (BoN) strategy. Its effectiveness therefore hinges on the scoring function. Learned scorers such as process reward models (PRMs) can be strong, but they are expensive to train and run. Lightweight confidence heuristics based on token log-probabilities are much cheaper, yet we find that they often perform substantially worse. To improve on lightweight confidence heuristics without incurring the full cost of stronger learned scorers, we introduce SCATR, a simple and efficient BoN ranking method that learns a lightweight scorer from a small calibration set using hidden representations from the base model. Across coding and mathematical reasoning benchmarks, SCATR improves over prior confidence-based baselines by up to 9%. Relative to LoRA fine-tuning on the same calibration data, it achieves comparable accuracy with up to 8000x fewer trainable parameters and much lower compute, reducing training and inference latency by up to 150x and 1000x, respectively. SCATR is also competitive with strong PRM baselines, and in several settings improves accuracy by up to 7.8% on math and 4.2% on coding while enabling up to 1000x faster inference. Overall, SCATR offers a strong accuracy-efficiency trade-off for scalable test-time selection.

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper introduces SCaTR, a novel framework that uses penultimate-layer embeddings to calibrate and enhance best-of-N candidate selection in LLMs with up to 9.1% accuracy improvement in math and 6.1% in coding.
It employs a two-phase approach with a calibration phase training a shallow MLP, achieving comparable performance to fine-tuning with 700–1,000× fewer parameters and significant speedups.
Empirical results show that SCaTR reliably outperforms traditional token-based confidence heuristics, closing up to 76.8% of the gap towards oracle selection across diverse domains.

SCaTR: Simple Calibrated Test-Time Ranking for Best-of- $N$ Selection

Motivation and Problem Statement

Test-time scaling (TTS) has become a primary methodology for boosting LLM output quality by allocating additional compute during inference, typically via parallel rollout: generating $N$ candidate responses and selecting the best. The effectiveness of parallel TTS hinges on the scoring function used for Best-of- $N$ (BoN) selection. While learned selectors like process reward models (PRMs) deliver strong performance, their deployment cost is prohibitive; conversely, cheap confidence heuristics based on token probabilities are computationally efficient but empirically unreliable, particularly in open-ended domains (code, free-form generation). Existing confidence-based metrics consistently perform close to random selection, failing to reliably discriminate correct from incorrect solutions (Figure 1).

Figure 1: Tail-aggregated token uncertainty metrics versus random and SCaTR: confidence-based metrics provide limited signal for effective response selection.

Recent evidence suggests that hidden states, rather than superficial token-level signals, capture richer information about the underlying reasoning trajectory and output quality. The key challenge addressed by this paper is whether a lightweight, calibratable scoring model leveraging hidden representations can robustly outperform confidence heuristics, match strong learned selectors, and maintain high efficiency.

Methodology: SCaTR Framework

SCaTR proposes a two-phase approach:

Calibration Phase: For a given LLM, candidate responses to prompts are generated and labeled for correctness (via domain-specific evaluators). The penultimate layer's hidden representation for the final non-padding token of each response is extracted. A shallow MLP scoring function is trained to predict correctness from these representations, using just a few hundred calibration problems per model.
Inference Phase: At test time, the LLM generates multiple candidates. For each, the scoring model evaluates the penultimate-layer embedding and assigns a probability of correctness. The best candidate is selected either by maximizing the score (code/free-form), or by weighted majority vote (definite-answer domains, e.g. math).

This design enables model- and domain-specific calibration with minimal computational overhead—training the lightweight scorer is orders of magnitude faster than fine-tuning or PRM training, and inference incurs negligible latency.

Figure 2: SCaTR pipeline: generation of candidate responses, extraction of penultimate-layer embeddings, scoring via calibration-trained MLP, and best-of- $N$ selection.

Empirical Results

Accuracy Versus Baselines

SCaTR is evaluated across multiple LLMs (OLMo-2, Qwen3, GPT-OSS) ranging from 1.7B to 30B parameters, and both coding and mathematical reasoning benchmarks. Across all settings, SCaTR substantially outperforms confidence-based heuristics, achieving up to 9.1% improvement in accuracy for math and 6.1% for coding.

Figure 3: SCaTR consistently matches or exceeds the strongest confidence-based selection, yielding up to 9.1-point improvement on mathematical reasoning and 6.1-point gain on coding.

SCaTR bridges a substantial fraction of the gap toward oracle selection (choosing any correct candidate), closing up to 76.8% of the gap in the best cases.

Efficiency and Accuracy Trade-Off

Relative to LoRA fine-tuning on the same data, SCaTR attains comparable performance with up to 8,000 $\times$ fewer trainable parameters, and training/inference speedups of up to 3,500 $\times$ and 1,000 $\times$ , respectively. Compared with PRMs (e.g., ReasonFlux-1.5B), SCaTR matches or outperforms them in accuracy on several datasets, but requires 700–1,000 $\times$ fewer parameters and delivers orders-of-magnitude faster inference (Figure 4).

Figure 4: SCaTR remains competitive across rollout budget sizes and steadily improves as more candidates are available.

Absolute accuracy gains relative to PRMs reach up to 7.8 points on math and 4.2 points on coding benchmarks. When combined with weighted majority vote, SCaTR further improves performance if a canonical answer is present.

Robustness and Ablation Studies

Hidden Representation Choice

Performance ablation over layer selection shows penultimate-layer embeddings are robust, with low variance (Figure 5). Other deep layer embeddings also deliver comparable results, suggesting flexibility in representational choice.

Figure 5: Left: Penultimate-layer embeddings exhibit low variance and robust performance. Right: Larger calibration set improves accuracy and reduces variance.

Calibration Set Size

Reduced calibration size degrades accuracy gracefully; on larger datasets, SCaTR's performance plateaus, indicating data efficiency.

Scoring Architecture

More sophisticated scoring models (ensemble, transformer-based, multi-layer) do not afford consistent gains over the simple penultimate-layer MLP, validating the simplicity and efficiency of SCaTR’s default design.

Uncertainty Metrics: Empirical Failure

Extensive comparative analyses establish that confidence-based metrics (mean, median, variance, probability gap, entropy) aggregated by various strategies—over sequence, tail, or low-confidence token groups—consistently fail to yield discriminatory power, and never reliably outperform random selection (Figures 7, 8, 9).

Figure 6: Confidence-based metrics behave similarly to random selection and remain far below oracle performance.

Figure 7: Different aggregation strategies yield similar performance, offering no clear criterion for confidence-based selection.

Figure 8: SCaTR outperforms confidence-based uncertainty metrics across all models in the coding domain.

Implications and Future Directions

The results validate that lightweight calibration models trained on hidden states provide scalable, efficient response selection for TTS, enabling practical deployment in compute-sensitive scenarios. This is of immediate significance for open-weight LLMs, where access to hidden representations is feasible. The findings challenge the utility of fixed confidence heuristics and reinforce the value of model-internal signals for downstream selection tasks.

Given its efficiency, SCaTR opens the door for domain-adaptive and model-specific selection pipelines requiring minimal supervision. Extensions to multi-turn reasoning, use of richer internal signals (token-level states, attention maps), or hybrid approaches combining lightweight scorers with PRMs for fallback in hard cases are promising future avenues.

Conclusion

SCaTR establishes a strong accuracy-efficiency tradeoff for scalable test-time candidate selection in LLMs. By exploiting hidden representations and lightweight calibration, it significantly outperforms confidence-based heuristics and matches strong learned selectors while maximizing efficiency. The approach is robust, data-efficient, and model-adaptive, and its empirical findings have implications for both theory and practical deployment of scalable inference pipelines. Future work will explore its extension to multi-turn and complex reasoning, as well as richer utilization of internal model dynamics.

Markdown Report Issue