Class Adaptive Conformal Training

Updated 21 January 2026

Class Adaptive Conformal Training (CaCT) is a framework that integrates conformal prediction with per-class augmented Lagrangian updates to tailor prediction set sizes to class heterogeneity while ensuring coverage guarantees.
It employs learned classwise penalties to optimize both efficiency and uniformity, addressing challenges in long-tailed and imbalanced class distributions.
Empirical studies show that CaCT achieves smaller prediction sets and minimal coverage gaps compared to global-penalty methods, enhancing robustness and calibration.

Class Adaptive Conformal Training (CaCT) is a framework for uncertainty quantification in deep neural networks that combines conformal prediction theory with an augmented Lagrangian optimization scheme to produce class-conditionally efficient prediction sets while maintaining formal coverage guarantees. CaCT generalizes previous conformal training techniques by explicitly learning per-class penalties and constraints, thereby adapting prediction set sizes to the heterogeneity of class distributions, including in long-tailed regimes, and outperforming global-penalty approaches in efficiency, coverage uniformity, and robustness (Marani et al., 14 Jan 2026).

1. Background and Motivation

Conformal Prediction (CP) is a wrapper algorithm that augments a base classifier with a prediction set $\mathcal{C}(x)\subseteq\{1,\ldots,K\}$ for $K$ -class classification, such that for $(X,Y)\sim\mathcal{P}$ , marginal coverage

$\Pr(Y\in\mathcal{C}(X))\ge 1-\alpha$

is guaranteed for a user-specified tolerance $\alpha$ . CP, when applied post-hoc to a trained classifier, grants formal calibration but cannot influence the underlying model’s representations or class scores to directly optimize prediction set sizes (efficiency) or tailor behavior for specific classes. Conformal training methodologies such as ConfTr (Stutz et al., 2021) address this by differentiating through the conformalization procedure and penalizing average set size during training.

However, previous methods typically impose a single scalar efficiency penalty (e.g., $\lambda$ ) shared across all classes. In practice, especially in large-scale or long-tailed settings, class frequencies and difficulties differ substantially — a global constraint often misallocates the efficiency budget, leading to oversized prediction sets for tail classes or accidental undercoverage. CaCT addresses this by introducing classwise constraints and per-class adaptivity, learning a separate penalty $\lambda_k$ for each class $k$ through an augmented Lagrangian approach, thus enabling fine-grained set-size control without distributional assumptions (Marani et al., 14 Jan 2026).

2. Formulation: Augmented Lagrangian for Classwise Set-Size Constraints

Let $\mathcal{D}=\{(X_i, Y_i)\}_{i=1}^n$ , $Y_i\in\{1,\ldots,K\}$ , and let $\pi_\theta(x)\in\Delta_K$ be the classifier parameterized by $\theta$ . Non-conformity scores $S_\theta(x, y)$ (e.g., THR, APS, RAPS) are employed to compute calibrated thresholds on a held-out set, yielding the prediction set

$\mathcal{C}_\theta(x) = \{y : S_\theta(x, y) \le \widehat{q}_\theta \}$

where $\widehat{q}_\theta$ is the empirical quantile. Define the per-class average set size

$\widehat{d}_k(\theta) = \frac{1}{n_k} \sum_{i : Y_i = k} |\mathcal{C}_\theta(X_i)|$

for $n_k$ instances in class $k$ . The target is

$\min_\theta\ \mathcal{L}_{\mathrm{cls}}(\theta) \quad \text{s.t. } \widehat{d}_k(\theta) \le \eta, \;\forall k$

for some target size $\eta$ .

CaCT introduces dual variables $\lambda_k>0$ and penalty weights $\rho_k>0$ per class. Using the Hestenes/Powell–type penalty (PHR), the augmented Lagrangian objective at iteration $j$ is:

$\mathcal{L}_{\mathrm{CaCT}}^{(j)}(\theta) = \mathcal{L}_{\mathrm{cls}}(\theta) + \sum_{k=1}^K P(z_k(\theta), \lambda_k^{(j)}, \rho_k^{(j)})$

where $z_k(\theta) = \widehat{d}_k(\theta) - \eta$ and the penalty $P$ is defined by

$P_{\mathrm{PHR}}(z, \lambda, \rho) = \begin{cases} \lambda z + \frac{1}{2}\rho z^2 & \text{if } \lambda+\rho z\ge0 \ - \frac{\lambda^2}{2\rho} & \text{otherwise} \end{cases}$

The dual update

$\lambda_k^{(j+1)} = \partial_z P(z_k(\theta^{(j)}), \lambda_k^{(j)}, \rho_k^{(j)})$

increases $\lambda_k$ when constraints are violated ( $z_k>0$ ). Penalty parameters $\rho_k^{(j)}$ are increased by factor $\beta>1$ when constraints stop improving.

This framework ensures class-specific control over set size, automates tuning of efficiency penalties, and is agnostic to the choice of non-conformity score.

3. Algorithmic Structure

The CaCT optimization consists of an inner (SGD-based model update) and an outer (dual/penalty update) loop.

Initialization: Set initial $\lambda_k \approx 10^{-6}$ , $\rho_k \approx 1$ for all $k$ , choose $\beta \approx 1.2$ .
For each outer iteration $j=0,\ldots,T-1$ :
- Inner loop:
  - Sample minibatch $B$ , split into $B_\text{cal}$ and $B_\text{pred}$ .
  - Compute smooth quantile $\widehat{q}_\theta$ on $B_\text{cal}$ (using sigmoid or NeuralSort for differentiability).
  - Construct prediction sets (soft during training) on $B_\text{pred}$ .
  - Compute classwise average set sizes $\widehat{d}_k$ on $B_\text{pred}$ .
  - Form total loss:
$\mathcal{L}_{\rm tot} = \mathcal{L}_{\rm cls}(\theta;B_{\rm pred}) + \sum_{k=1}^K P(\widehat{d}_k - \eta, \lambda_k^{(j)}, \rho_k^{(j)})$

- Backpropagate and update $\theta$ via SGD. - Outer loop (every few epochs): - Evaluate $\widehat{d}_k$ on validation set. - Update $\lambda_k$ using gradient of $P$ . - If $\widehat{d}_k$ fails to decrease, increase $\rho_k$ by $\beta$ .

During training, smoothed surrogate set sizes and quantiles enable gradient-based optimization. At test time, exact (non-differentiable) split conformal prediction is used for final prediction sets.

4. Coverage Guarantees and Theoretical Properties

CaCT maintains the formal guarantees of conformal prediction:

Marginal coverage holds by construction, since final test-time prediction sets are exact conformal sets calibrated on held-out data:

$\Pr(Y\in\mathcal{C}(X))\ge 1-\alpha$

This remains valid under i.i.d. or exchangeability assumptions between calibration and test data.

Class-conditional coverage is achieved by applying label-conditional (Mondrian) conformal predictors at test time, for which class-specific thresholds are calibrated. Because CaCT learns per-class set-size behaviors, it supports this extension and ensures:

$\Pr(Y\in\mathcal{C}(X)\mid Y=k)\ge 1-\alpha,\ \forall k$

Smoothing of indicators and quantiles is used only during training to provide gradients. These approximations are eliminated at test time, so formal properties are identical to standard CP (Marani et al., 14 Jan 2026).

5. Empirical Evaluation

Experiments encompass both balanced and long-tailed benchmarks in vision (MNIST, CIFAR-10, CIFAR-100, ImageNet, and their long-tailed variants with imbalance factor $\gamma$ ), and text (20 Newsgroups, with bag-of-words or BERT embeddings). Set construction uses THR, APS, or RAPS scores.

Baseline comparisons include cross-entropy, Focal Loss, ConfTr (Stutz et al., 2021), CUT, InfoCTr, and DPSM. Evaluated metrics comprise average set size $S = \mathbb{E}[|\mathcal{C}(X)|]$ , marginal coverage $C$ , class-conditional coverage gap

$\text{CG} = \frac{1}{K}\sum_{k=1}^{K} |\Pr(Y\in\mathcal{C}(X)\mid Y=k) - (1-\alpha)|$

and top- $k$ accuracy.

Key findings:

CaCT+ALM consistently yields the smallest average set size at target $\alpha = 0.1$ , while maintaining $C\approx1-\alpha$ .
Minimal coverage gap: In long-tailed settings (e.g., CIFAR100-LT, ImageNet-LT), CaCT balances marginal and class-conditional coverage where prior methods over- or under-cover tail classes.
Robustness to hyperparameters: Performance is stable for $\gamma\in\{0.5,0.8,1.0\}$ , with sharp smoothing ( $T$ small) enabling greater efficiency.
Ablation findings: The PHR penalty gives best efficiency and convergence. CaCT-trained models generalize over $\alpha$ at test time.

Representative results (CIFAR100-LT, $\gamma=0.1$ ) for THR, $\alpha=0.1$ :

Method	Set Size	Coverage Gap (CG)
ConfTr	~7.8	~6.3%
CUT	~4.8	~5.1%
InfoCTr-Fano	~4.8	~4.96%
DPSM	~4.2	~4.91%
CaCT+ALM	~3.0	~4.40%

6. Advantages, Limitations, and Computational Aspects

Advantages:

Class-conditional adaptivity: Per-class penalties $\lambda_k$ allow explicit allocation of the efficiency budget according to class difficulty and frequency.
Scalability: No hand-tuning of $O(K)$ penalty weights required, applicable to large $K$ (e.g., ImageNet).
Long-tail robustness: Graceful adaptation for minority classes, avoiding under- or over-coverage.
Score-agnostic: Compatible with any non-conformity score (THR, APS, RAPS).

Limitations:

Implementation complexity: Requires ALM dual/penalty updates and careful penalty scheduling.
Training-only smoothing: Additional hyperparameters (e.g., smoothing temperature $T$ ), and possible need for monitoring convergence as ALM’s properties in non-convex deep nets remain empirical.

Hyperparameters:

Duals: $\lambda_k^{(0)}\approx10^{-6}$ , $\rho_k^{(0)}\approx1$ .
Update rate: $\beta\approx1.2$ , update duals every 5–10 epochs.
Target size: $\eta\approx1$ .
Smoothing: $T=0.01$ –$0.1$.
Learning rate, batch size as per deep learning norms.

Computational cost:

Inner step (SGD through model and set-size loss): comparable to ConfTr/CUT.
Outer dual update: $O(K)$ per update (negligible for $K\le10^3$ ).
Total wall-time increase: $\sim$ 10–20% (Marani et al., 14 Jan 2026).

7. Relationship to Prior Conformal Training Methods

CaCT generalizes and subsumes earlier frameworks such as ConfTr (Stutz et al., 2021), which already differentiates through the conformalization process in batch and permits shaping the inefficiency via re-weighting schemes or classification-on-sets losses. However, in ConfTr, adaptation is limited to either manual weighting or fixed matrix designs, whereas CaCT automates the discovery of optimal per-class penalties.

In empirical comparisons, CaCT+ALM achieves lower coverage gaps and smaller set sizes than ConfTr, CUT, DPSM, and Fano-bound-based InfoCTr, especially for long-tailed or imbalanced regimes. The augmented Lagrangian machinery ensures classwise constraint satisfaction efficiently, which is unattainable with scalar penalty-based objectives. The framework operates under the same minimal exchangeability assumption required for validity of conformal predictors at test time.

For a comprehensive technical exposition, refer to (Marani et al., 14 Jan 2026) and (Stutz et al., 2021).

Markdown Report Issue Upgrade to Chat

References (2)

Class Adaptive Conformal Training (2026)

Learning Optimal Conformal Classifiers (2021)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class Adaptive Conformal Training (CaCT).