Adaptive Temperature Scaling (ATS)

Updated 23 January 2026

Adaptive Temperature Scaling is a calibration technique that replaces a single global temperature with input-specific adjustments to enhance prediction confidences.
It leverages uncertainty proxies such as logit margins and predictive entropy to tailor temperature values per sample or context for improved reliability.
ATS has demonstrated robust performance across diverse applications including image classification, continual learning, and language modeling in data-scarce regimes.

Adaptive Temperature Scaling (ATS) is a post-hoc calibration paradigm that replaces the single global temperature of classical temperature scaling with a learned, context-dependent temperature. By assigning either per-sample or contextually-adaptive rescaling parameters to the logits of a neural network classifier or LLM, ATS enables fine-grained adjustment of calibrated confidences, addressing systematic over- or under-confidence that global approaches cannot rectify. ATS is now a central technique for robust uncertainty quantification, especially in data-scarce regimes, continual learning, conformal prediction, and large language modeling.

1. Principles of Adaptive Temperature Scaling

The core principle of ATS is to generalize the standard temperature scaling transformation. Instead of applying a fixed scalar $T$ to all logit vectors $z(x)\in\mathbb R^K$ ,

$p_k(x;T) = \frac{\exp(z_k(x)/T)}{\sum_{j=1}^K \exp(z_j(x)/T)}$

ATS introduces a temperature function $T(\cdot)$ that depends explicitly on each input, features, or context: $p_k(x; T(x)) = \frac{\exp(z_k(x)/T(x))}{\sum_{j=1}^K \exp(z_j(x)/T(x))}$ or, equivalently, $T$ may be parameterized by prediction-specific quantities (e.g., logits, features, uncertainty metrics, or latent representations).

This functional flexibility enables ATS to:

Adapt calibration corrections at the level of individual predictions or meaningful data contexts.
Preserve the $\arg\max$ structure, thus maintaining the original model's accuracy.
Exploit domain-specific uncertainty proxies—such as logit margin, entropy, or feature-space distances—to provide robust temperature assignments.

2. Methodological Variants

Expressive Parameterizations

ATS methods span a spectrum of complexity:

Low-dimensional parametric forms: Simple functions of entropy (Balanya et al., 2022), logit margin (Guo et al., 30 Jun 2025), or summary statistics (e.g., $T(x)=w\log H(z(x)) + b$ ).
Neural architectures: Prediction-specific temperatures via small neural networks over logits or features (Tomani et al., 2021, Joy et al., 2022, Xie et al., 2024).
Task- or context-aware mappings: Temperatures assigned by explicit class, task, prototype distance, or batch-level statistics (Serra et al., 25 Sep 2025).

Uncertainty Proxies and Features

Logit gap / margin: The difference between the highest and second-highest logits captures decision boundary uncertainty and yields robust, scalar input to ATS heads (Guo et al., 30 Jun 2025).
Predictive entropy: The entropy of the predicted class distribution informs confidence mismatches (Balanya et al., 2022).
Prototype-based distances: In continual learning, distances to feature-space prototypes reflect task proximity for batch-level temperature adaptation (Serra et al., 25 Sep 2025).
Latent representations: Conditional VAEs or other feature models can provide class-likelihood signals leveraged for temperature prediction (Joy et al., 2022).
LLM hidden states: In LLMs, per-token hidden states parameterize token-wise temperatures through calibration-specific heads (Xie et al., 2024).

Optimization Objectives

Negative Log-Likelihood (NLL): Classical likelihood-based objective, widely used for ATS (Tomani et al., 2021, Balanya et al., 2022, Xie et al., 2024).
Brier Score: Squared error between predicted probabilities and true labels (Serra et al., 25 Sep 2025, Tomani et al., 2021).
Expected Calibration Error (ECE) and Variants: Soft binning and adaptive binning extensions (SoftECE, AdaECE) provide stability and gradient smoothness in limited data (Guo et al., 30 Jun 2025, Balanya et al., 2022).
Conformal Coverage Constraints: In conformal prediction, ATS finds per-input temperatures to meet quantile-based coverage requirements (Kotelevskii et al., 21 May 2025).

3. Algorithms and Representative Implementations

Per-Sample and Per-Context Architecture Table

ATS Variant	Temperature Parameterization	Input Signal
SMART (Guo et al., 30 Jun 2025)	Small 1-layer MLP on logit gap	$\Delta = z_{(1)} - z_{(2)}$
PTS (Tomani et al., 2021)	3-layer NN on sorted top- $k$ logits	$z(x)\in\mathbb R^K$ 0
ETS (Balanya et al., 2022)	$z(x)\in\mathbb R^K$ 1	Normalized predictive entropy
DATS (Serra et al., 25 Sep 2025)	$z(x)\in\mathbb R^K$ 2	Prototype-based class distance
ADATS (Joy et al., 2022)	2-layer MLP on per-class VAE logliks	$z(x)\in\mathbb R^K$ 3
ATS-CP (Kotelevskii et al., 21 May 2025)	Numerical bisection s.t. conformal coverage	Nonconformity scores per label
LLM ATS (Xie et al., 2024)	Causal Transformer head on token hidden state	$z(x)\in\mathbb R^K$ 4 (per-token)

SMART (Guo et al., 30 Jun 2025) deploys a low-variance, margin-aware variant in which a one-hidden-layer MLP maps the logit gap $z(x)\in\mathbb R^K$ 5 to the temperature $z(x)\in\mathbb R^K$ 6. This network is trained with the SoftECE loss, which adaptively bins predicted confidences, thus balancing bias and variance with minimal parameterization (typically $z(x)\in\mathbb R^K$ 7 total parameters).

In PTS (Tomani et al., 2021), a 3-layer fully-connected network consumes sorted top- $z(x)\in\mathbb R^K$ 8 logits and outputs a positive scalar temperature. This architecture enables expressive, nonlinear mappings from the logit profile to the temperature, trained via a Brier-style or cross-entropy objective. ETS (Balanya et al., 2022) applies a simple two-parameter function of the log-entropy, offering superior robustness in data-scarce regimes.

ATS-CP (Kotelevskii et al., 21 May 2025) addresses the conformal prediction setting by searching for a per-input temperature $z(x)\in\mathbb R^K$ 9 that guarantees calibrated probability mass on conformal sets.

For LLMs, LLM-specific ATS (Xie et al., 2024) attaches a lightweight, single-layer causal Transformer block to map per-token hidden state $p_k(x;T) = \frac{\exp(z_k(x)/T)}{\sum_{j=1}^K \exp(z_j(x)/T)}$ 0 to a log-temperature, enabling token-level calibration after RLHF finetuning.

4. Calibration Metrics, Bias–Variance Trade-offs, and Empirical Results

Calibration Metrics

Expected Calibration Error (ECE): Aggregates absolute difference between average confidence and accuracy in confidence bins.
Adaptive ECE (AdaECE), SoftECE: Adaptive or soft binning variants that address pitfalls in standard binning, especially in small or imbalanced datasets (Guo et al., 30 Jun 2025, Balanya et al., 2022).
Negative Log-Likelihood (NLL): Measures overall log-probability assignment to the true class.
Brier Score: Mean squared error between predicted probability vector and one-hot target.
Maximum Calibration Error (MCE): Worst-case binwise gap.

Bias–Variance Considerations

Global TS: High bias, low variance; fails to correct heterogeneity in miscalibration.
Expressive ATS (PTS, high-capacity NNs): Low bias, higher variance; risk of overfitting with small calibration sets.
SMART: Low-dimensional (1D input), margin-aware, soft-binned; achieves robust bias–variance tradeoff, minimal overfitting (Guo et al., 30 Jun 2025).
ETS: Extreme parameter parsimony enables generalization in scarce-data settings (Balanya et al., 2022).

Empirical Performance

ATS variants consistently improve calibration under diverse test conditions:

Method	CIFAR-10, ResNet-50 ECE	CIFAR-100, ResNet-50 ECE	ImageNet-1K ECE (val size = 50)
TS	1.38%	5.61%	2.17%
PTS	1.10%	1.96%	0.95%
CTS	0.83%	3.67%	–
Spline	1.52%	3.48%	0.62%
SMART	0.85%	1.37%	0.61%

ATS methods preserve top-1 accuracy while decreasing calibration error, outperforming global TS by a factor of $p_k(x;T) = \frac{\exp(z_k(x)/T)}{\sum_{j=1}^K \exp(z_j(x)/T)}$ 1– $p_k(x;T) = \frac{\exp(z_k(x)/T)}{\sum_{j=1}^K \exp(z_j(x)/T)}$ 2× depending on data and architecture (Guo et al., 30 Jun 2025, Tomani et al., 2021). Under data scarcity ( $p_k(x;T) = \frac{\exp(z_k(x)/T)}{\sum_{j=1}^K \exp(z_j(x)/T)}$ 3), SMART and ETS maintain stable ECE, contrasting with the dramatic variance-driven degradation seen in over-parameterized neural ATS (Guo et al., 30 Jun 2025, Balanya et al., 2022).

In language modeling, token-wise ATS reduces ECE by 10–50% over the best global TS methods on multi-choice and QA tasks, with no deterioration of RLHF-induced performance (Xie et al., 2024).

5. Domains of Application and Specialized Contexts

Neural Classification

ATS is now standard for post-hoc calibration of deep image classifiers under i.i.d., shift, corruption, and long-tail scenarios, as well as in deep ensemble settings (Guo et al., 30 Jun 2025, Tomani et al., 2021, Joy et al., 2022).

Continual and Incremental Learning

Distance-Aware Temperature Scaling (DATS) uses class prototype distances to adapt temperature by batch, solving the problem of calibration drift and oscillating ECE in class-incremental streams without known task ID at inference (Serra et al., 25 Sep 2025). This yields substantial improvements in both average and worst-case calibration.

Distribution-Free Conformal Prediction

ATS-CP leverages input-dependent temperature selection to enforce coverage constraints on conformal sets, offering the first principled approach to assign calibrated probabilities while preserving conformal guarantees (Kotelevskii et al., 21 May 2025).

LLMs

ATS is employed to recalibrate post-RLHF LLMs at the token level, restoring reliable confidence estimates despite non-uniform miscalibration induced by reward optimization (Xie et al., 2024).

6. Limitations and Future Directions

ATS inherits the structural limitation that it only adjusts confidence, not the predicted class ranking. All accuracy-preserving calibration is fundamentally a correction to softmax scale, so errors in class ranking from the base model are untouched (Tomani et al., 2021, Guo et al., 30 Jun 2025). The power of the temperature mapping must be balanced against the size of the calibration set: excess capacity can overfit (high variance), insufficient capacity fails to capture real miscalibration (high bias).

Current research directions include:

Improved theoretical generalization bounds on held-out calibration error.
Uncertainty-aware regularization and curriculum calibration strategies.
Joint learning of features and temperatures, allowing calibration-aware representation learning.
Extensions to structured prediction, regression, and dense prediction tasks.
Data-driven binning schemes and uncertainty proxies beyond logit margin and entropy.

A plausible implication is that lightweight, adaptive approaches (e.g., SMART, ETS) will serve as first-choice calibration layers for safety-critical and data-constrained environments, while expressive neural ATS will remain state-of-the-art in rich-data, high-capacity domains.

7. Summary and Comparative Table

Main ATS Approach	Param. Count (order)	Input Signal	Robustness (Low Data)	Empirical ECE (CIFAR-10/100)
Global TS	1	–	High	1.38% / 5.61% (Guo et al., 30 Jun 2025)
PTS (Tomani et al., 2021)	≈91	Sorted logits	Moderate	1.10% / 1.96%
ETS (Balanya et al., 2022)	2	Entropy	High	1.34% (examples)
SMART (Guo et al., 30 Jun 2025)	49	Logit gap	Very high	0.85% / 1.37%
ADATS (Joy et al., 2022)	10^2–10³	VAE logliks	High	0.76% / 2.95%
DATS (Serra et al., 25 Sep 2025)	2	Proto. distance	High	20–35% ↓ ECE over TS

The empirical consensus is that ATS—implemented via unsupervised uncertainty proxies and small parametric functions—enables strong, efficient, and data-efficient calibration across a range of domains, with minimal computational burden and high robustness in limited-data regimes. For further details and open-source implementations, see SMART (Guo et al., 30 Jun 2025), ADATS (Joy et al., 2022), PTS (Tomani et al., 2021), and DATS (Serra et al., 25 Sep 2025).

Markdown Report Issue Upgrade to Chat

References (7)

Adaptive Temperature Scaling for Robust Calibration of Deep Neural Networks (2022)

Sample Margin-Aware Recalibration of Temperature Scaling (2025)

Parameterized Temperature Scaling for Boosting the Expressive Power in Post-Hoc Uncertainty Calibration (2021)

Sample-dependent Adaptive Temperature Scaling for Improved Calibration (2022)

Calibrating Language Models with Adaptive Temperature Scaling (2024)

DATS: Distance-Aware Temperature Scaling for Calibrated Class-Incremental Learning (2025)

Adaptive Temperature Scaling with Conformal Prediction (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Adaptive Temperature Scaling (ATS).

Adaptive Temperature Scaling (ATS)

1. Principles of Adaptive Temperature Scaling

2. Methodological Variants

Expressive Parameterizations

Uncertainty Proxies and Features

Optimization Objectives

3. Algorithms and Representative Implementations

Per-Sample and Per-Context Architecture Table

4. Calibration Metrics, Bias–Variance Trade-offs, and Empirical Results

Calibration Metrics

Bias–Variance Considerations

Empirical Performance

5. Domains of Application and Specialized Contexts

Neural Classification

Continual and Incremental Learning

Distribution-Free Conformal Prediction

LLMs

6. Limitations and Future Directions

7. Summary and Comparative Table

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Adaptive Temperature Scaling (ATS)

1. Principles of Adaptive Temperature Scaling

2. Methodological Variants

Expressive Parameterizations

Uncertainty Proxies and Features

Optimization Objectives

3. Algorithms and Representative Implementations

Per-Sample and Per-Context Architecture Table

4. Calibration Metrics, Bias–Variance Trade-offs, and Empirical Results

Calibration Metrics

Bias–Variance Considerations

Empirical Performance

5. Domains of Application and Specialized Contexts

Neural Classification

Continual and Incremental Learning

Distribution-Free Conformal Prediction

LLMs

6. Limitations and Future Directions

7. Summary and Comparative Table

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research