Papers
Topics
Authors
Recent
Search
2000 character limit reached

Adaptive Temperature Scaling (ATS)

Updated 23 January 2026
  • Adaptive Temperature Scaling is a calibration technique that replaces a single global temperature with input-specific adjustments to enhance prediction confidences.
  • It leverages uncertainty proxies such as logit margins and predictive entropy to tailor temperature values per sample or context for improved reliability.
  • ATS has demonstrated robust performance across diverse applications including image classification, continual learning, and language modeling in data-scarce regimes.

Adaptive Temperature Scaling (ATS) is a post-hoc calibration paradigm that replaces the single global temperature of classical temperature scaling with a learned, context-dependent temperature. By assigning either per-sample or contextually-adaptive rescaling parameters to the logits of a neural network classifier or LLM, ATS enables fine-grained adjustment of calibrated confidences, addressing systematic over- or under-confidence that global approaches cannot rectify. ATS is now a central technique for robust uncertainty quantification, especially in data-scarce regimes, continual learning, conformal prediction, and large language modeling.

1. Principles of Adaptive Temperature Scaling

The core principle of ATS is to generalize the standard temperature scaling transformation. Instead of applying a fixed scalar TT to all logit vectors z(x)RKz(x)\in\mathbb R^K,

pk(x;T)=exp(zk(x)/T)j=1Kexp(zj(x)/T)p_k(x;T) = \frac{\exp(z_k(x)/T)}{\sum_{j=1}^K \exp(z_j(x)/T)}

ATS introduces a temperature function T()T(\cdot) that depends explicitly on each input, features, or context: pk(x;T(x))=exp(zk(x)/T(x))j=1Kexp(zj(x)/T(x))p_k(x; T(x)) = \frac{\exp(z_k(x)/T(x))}{\sum_{j=1}^K \exp(z_j(x)/T(x))} or, equivalently, TT may be parameterized by prediction-specific quantities (e.g., logits, features, uncertainty metrics, or latent representations).

This functional flexibility enables ATS to:

  • Adapt calibration corrections at the level of individual predictions or meaningful data contexts.
  • Preserve the argmax\arg\max structure, thus maintaining the original model's accuracy.
  • Exploit domain-specific uncertainty proxies—such as logit margin, entropy, or feature-space distances—to provide robust temperature assignments.

2. Methodological Variants

Expressive Parameterizations

ATS methods span a spectrum of complexity:

Uncertainty Proxies and Features

  • Logit gap / margin: The difference between the highest and second-highest logits captures decision boundary uncertainty and yields robust, scalar input to ATS heads (Guo et al., 30 Jun 2025).
  • Predictive entropy: The entropy of the predicted class distribution informs confidence mismatches (Balanya et al., 2022).
  • Prototype-based distances: In continual learning, distances to feature-space prototypes reflect task proximity for batch-level temperature adaptation (Serra et al., 25 Sep 2025).
  • Latent representations: Conditional VAEs or other feature models can provide class-likelihood signals leveraged for temperature prediction (Joy et al., 2022).
  • LLM hidden states: In LLMs, per-token hidden states parameterize token-wise temperatures through calibration-specific heads (Xie et al., 2024).

Optimization Objectives

3. Algorithms and Representative Implementations

Per-Sample and Per-Context Architecture Table

ATS Variant Temperature Parameterization Input Signal
SMART (Guo et al., 30 Jun 2025) Small 1-layer MLP on logit gap Δ=z(1)z(2)\Delta = z_{(1)} - z_{(2)}
PTS (Tomani et al., 2021) 3-layer NN on sorted top-kk logits zs(x)\mathbf{z}^s(x)
ETS (Balanya et al., 2022) softplus(wlogHˉ(z)+b)\mathrm{softplus}(w \log \bar{H}(z) + b) Normalized predictive entropy
DATS (Serra et al., 25 Sep 2025) T(dc)=Tbase+wdcT(d_c) = T_{\rm base} + w d_c Prototype-based class distance
ADATS (Joy et al., 2022) 2-layer MLP on per-class VAE logliks logpλ(zyi)\log p_\lambda(z|y_i)
ATS-CP (Kotelevskii et al., 21 May 2025) Numerical bisection s.t. conformal coverage Nonconformity scores per label
LLM ATS (Xie et al., 2024) Causal Transformer head on token hidden state hih_i (per-token)

SMART (Guo et al., 30 Jun 2025) deploys a low-variance, margin-aware variant in which a one-hidden-layer MLP maps the logit gap Δi\Delta_i to the temperature TiT_i. This network is trained with the SoftECE loss, which adaptively bins predicted confidences, thus balancing bias and variance with minimal parameterization (typically <50<50 total parameters).

In PTS (Tomani et al., 2021), a 3-layer fully-connected network consumes sorted top-kk logits and outputs a positive scalar temperature. This architecture enables expressive, nonlinear mappings from the logit profile to the temperature, trained via a Brier-style or cross-entropy objective. ETS (Balanya et al., 2022) applies a simple two-parameter function of the log-entropy, offering superior robustness in data-scarce regimes.

ATS-CP (Kotelevskii et al., 21 May 2025) addresses the conformal prediction setting by searching for a per-input temperature τ(x)\tau^*(x) that guarantees calibrated probability mass on conformal sets.

For LLMs, LLM-specific ATS (Xie et al., 2024) attaches a lightweight, single-layer causal Transformer block to map per-token hidden state hih_i to a log-temperature, enabling token-level calibration after RLHF finetuning.

4. Calibration Metrics, Bias–Variance Trade-offs, and Empirical Results

Calibration Metrics

  • Expected Calibration Error (ECE): Aggregates absolute difference between average confidence and accuracy in confidence bins.
  • Adaptive ECE (AdaECE), SoftECE: Adaptive or soft binning variants that address pitfalls in standard binning, especially in small or imbalanced datasets (Guo et al., 30 Jun 2025, Balanya et al., 2022).
  • Negative Log-Likelihood (NLL): Measures overall log-probability assignment to the true class.
  • Brier Score: Mean squared error between predicted probability vector and one-hot target.
  • Maximum Calibration Error (MCE): Worst-case binwise gap.

Bias–Variance Considerations

  • Global TS: High bias, low variance; fails to correct heterogeneity in miscalibration.
  • Expressive ATS (PTS, high-capacity NNs): Low bias, higher variance; risk of overfitting with small calibration sets.
  • SMART: Low-dimensional (1D input), margin-aware, soft-binned; achieves robust bias–variance tradeoff, minimal overfitting (Guo et al., 30 Jun 2025).
  • ETS: Extreme parameter parsimony enables generalization in scarce-data settings (Balanya et al., 2022).

Empirical Performance

ATS variants consistently improve calibration under diverse test conditions:

Method CIFAR-10, ResNet-50 ECE CIFAR-100, ResNet-50 ECE ImageNet-1K ECE (val size = 50)
TS 1.38% 5.61% 2.17%
PTS 1.10% 1.96% 0.95%
CTS 0.83% 3.67%
Spline 1.52% 3.48% 0.62%
SMART 0.85% 1.37% 0.61%

ATS methods preserve top-1 accuracy while decreasing calibration error, outperforming global TS by a factor of $2$–$5$× depending on data and architecture (Guo et al., 30 Jun 2025, Tomani et al., 2021). Under data scarcity (N<100N<100), SMART and ETS maintain stable ECE, contrasting with the dramatic variance-driven degradation seen in over-parameterized neural ATS (Guo et al., 30 Jun 2025, Balanya et al., 2022).

In language modeling, token-wise ATS reduces ECE by 10–50% over the best global TS methods on multi-choice and QA tasks, with no deterioration of RLHF-induced performance (Xie et al., 2024).

5. Domains of Application and Specialized Contexts

Neural Classification

ATS is now standard for post-hoc calibration of deep image classifiers under i.i.d., shift, corruption, and long-tail scenarios, as well as in deep ensemble settings (Guo et al., 30 Jun 2025, Tomani et al., 2021, Joy et al., 2022).

Continual and Incremental Learning

Distance-Aware Temperature Scaling (DATS) uses class prototype distances to adapt temperature by batch, solving the problem of calibration drift and oscillating ECE in class-incremental streams without known task ID at inference (Serra et al., 25 Sep 2025). This yields substantial improvements in both average and worst-case calibration.

Distribution-Free Conformal Prediction

ATS-CP leverages input-dependent temperature selection to enforce coverage constraints on conformal sets, offering the first principled approach to assign calibrated probabilities while preserving conformal guarantees (Kotelevskii et al., 21 May 2025).

LLMs

ATS is employed to recalibrate post-RLHF LLMs at the token level, restoring reliable confidence estimates despite non-uniform miscalibration induced by reward optimization (Xie et al., 2024).

6. Limitations and Future Directions

ATS inherits the structural limitation that it only adjusts confidence, not the predicted class ranking. All accuracy-preserving calibration is fundamentally a correction to softmax scale, so errors in class ranking from the base model are untouched (Tomani et al., 2021, Guo et al., 30 Jun 2025). The power of the temperature mapping must be balanced against the size of the calibration set: excess capacity can overfit (high variance), insufficient capacity fails to capture real miscalibration (high bias).

Current research directions include:

  • Improved theoretical generalization bounds on held-out calibration error.
  • Uncertainty-aware regularization and curriculum calibration strategies.
  • Joint learning of features and temperatures, allowing calibration-aware representation learning.
  • Extensions to structured prediction, regression, and dense prediction tasks.
  • Data-driven binning schemes and uncertainty proxies beyond logit margin and entropy.

A plausible implication is that lightweight, adaptive approaches (e.g., SMART, ETS) will serve as first-choice calibration layers for safety-critical and data-constrained environments, while expressive neural ATS will remain state-of-the-art in rich-data, high-capacity domains.

7. Summary and Comparative Table

Main ATS Approach Param. Count (order) Input Signal Robustness (Low Data) Empirical ECE (CIFAR-10/100)
Global TS 1 High 1.38% / 5.61% (Guo et al., 30 Jun 2025)
PTS (Tomani et al., 2021) ≈91 Sorted logits Moderate 1.10% / 1.96%
ETS (Balanya et al., 2022) 2 Entropy High 1.34% (examples)
SMART (Guo et al., 30 Jun 2025) 49 Logit gap Very high 0.85% / 1.37%
ADATS (Joy et al., 2022) 102–103 VAE logliks High 0.76% / 2.95%
DATS (Serra et al., 25 Sep 2025) 2 Proto. distance High 20–35% ↓ ECE over TS

The empirical consensus is that ATS—implemented via unsupervised uncertainty proxies and small parametric functions—enables strong, efficient, and data-efficient calibration across a range of domains, with minimal computational burden and high robustness in limited-data regimes. For further details and open-source implementations, see SMART (Guo et al., 30 Jun 2025), ADATS (Joy et al., 2022), PTS (Tomani et al., 2021), and DATS (Serra et al., 25 Sep 2025).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Temperature Scaling (ATS).