Papers
Topics
Authors
Recent
Search
2000 character limit reached

Class Adaptive Conformal Training

Updated 21 January 2026
  • Class Adaptive Conformal Training (CaCT) is a framework that integrates conformal prediction with per-class augmented Lagrangian updates to tailor prediction set sizes to class heterogeneity while ensuring coverage guarantees.
  • It employs learned classwise penalties to optimize both efficiency and uniformity, addressing challenges in long-tailed and imbalanced class distributions.
  • Empirical studies show that CaCT achieves smaller prediction sets and minimal coverage gaps compared to global-penalty methods, enhancing robustness and calibration.

Class Adaptive Conformal Training (CaCT) is a framework for uncertainty quantification in deep neural networks that combines conformal prediction theory with an augmented Lagrangian optimization scheme to produce class-conditionally efficient prediction sets while maintaining formal coverage guarantees. CaCT generalizes previous conformal training techniques by explicitly learning per-class penalties and constraints, thereby adapting prediction set sizes to the heterogeneity of class distributions, including in long-tailed regimes, and outperforming global-penalty approaches in efficiency, coverage uniformity, and robustness (Marani et al., 14 Jan 2026).

1. Background and Motivation

Conformal Prediction (CP) is a wrapper algorithm that augments a base classifier with a prediction set C(x){1,,K}\mathcal{C}(x)\subseteq\{1,\ldots,K\} for KK-class classification, such that for (X,Y)P(X,Y)\sim\mathcal{P}, marginal coverage

Pr(YC(X))1α\Pr(Y\in\mathcal{C}(X))\ge 1-\alpha

is guaranteed for a user-specified tolerance α\alpha. CP, when applied post-hoc to a trained classifier, grants formal calibration but cannot influence the underlying model’s representations or class scores to directly optimize prediction set sizes (efficiency) or tailor behavior for specific classes. Conformal training methodologies such as ConfTr (Stutz et al., 2021) address this by differentiating through the conformalization procedure and penalizing average set size during training.

However, previous methods typically impose a single scalar efficiency penalty (e.g., λ\lambda) shared across all classes. In practice, especially in large-scale or long-tailed settings, class frequencies and difficulties differ substantially — a global constraint often misallocates the efficiency budget, leading to oversized prediction sets for tail classes or accidental undercoverage. CaCT addresses this by introducing classwise constraints and per-class adaptivity, learning a separate penalty λk\lambda_k for each class kk through an augmented Lagrangian approach, thus enabling fine-grained set-size control without distributional assumptions (Marani et al., 14 Jan 2026).

2. Formulation: Augmented Lagrangian for Classwise Set-Size Constraints

Let D={(Xi,Yi)}i=1n\mathcal{D}=\{(X_i, Y_i)\}_{i=1}^n, Yi{1,,K}Y_i\in\{1,\ldots,K\}, and let πθ(x)ΔK\pi_\theta(x)\in\Delta_K be the classifier parameterized by θ\theta. Non-conformity scores Sθ(x,y)S_\theta(x, y) (e.g., THR, APS, RAPS) are employed to compute calibrated thresholds on a held-out set, yielding the prediction set

Cθ(x)={y:Sθ(x,y)q^θ}\mathcal{C}_\theta(x) = \{y : S_\theta(x, y) \le \widehat{q}_\theta \}

where q^θ\widehat{q}_\theta is the empirical quantile. Define the per-class average set size

d^k(θ)=1nki:Yi=kCθ(Xi)\widehat{d}_k(\theta) = \frac{1}{n_k} \sum_{i : Y_i = k} |\mathcal{C}_\theta(X_i)|

for nkn_k instances in class kk. The target is

minθ Lcls(θ)s.t. d^k(θ)η,  k\min_\theta\ \mathcal{L}_{\mathrm{cls}}(\theta) \quad \text{s.t. } \widehat{d}_k(\theta) \le \eta, \;\forall k

for some target size η\eta.

CaCT introduces dual variables λk>0\lambda_k>0 and penalty weights ρk>0\rho_k>0 per class. Using the Hestenes/Powell–type penalty (PHR), the augmented Lagrangian objective at iteration jj is:

LCaCT(j)(θ)=Lcls(θ)+k=1KP(zk(θ),λk(j),ρk(j))\mathcal{L}_{\mathrm{CaCT}}^{(j)}(\theta) = \mathcal{L}_{\mathrm{cls}}(\theta) + \sum_{k=1}^K P(z_k(\theta), \lambda_k^{(j)}, \rho_k^{(j)})

where zk(θ)=d^k(θ)ηz_k(\theta) = \widehat{d}_k(\theta) - \eta and the penalty PP is defined by

PPHR(z,λ,ρ)={λz+12ρz2if λ+ρz0 λ22ρotherwiseP_{\mathrm{PHR}}(z, \lambda, \rho) = \begin{cases} \lambda z + \frac{1}{2}\rho z^2 & \text{if } \lambda+\rho z\ge0 \ - \frac{\lambda^2}{2\rho} & \text{otherwise} \end{cases}

The dual update

λk(j+1)=zP(zk(θ(j)),λk(j),ρk(j))\lambda_k^{(j+1)} = \partial_z P(z_k(\theta^{(j)}), \lambda_k^{(j)}, \rho_k^{(j)})

increases λk\lambda_k when constraints are violated (zk>0z_k>0). Penalty parameters ρk(j)\rho_k^{(j)} are increased by factor β>1\beta>1 when constraints stop improving.

This framework ensures class-specific control over set size, automates tuning of efficiency penalties, and is agnostic to the choice of non-conformity score.

3. Algorithmic Structure

The CaCT optimization consists of an inner (SGD-based model update) and an outer (dual/penalty update) loop.

  1. Initialization: Set initial λk106\lambda_k \approx 10^{-6}, ρk1\rho_k \approx 1 for all kk, choose β1.2\beta \approx 1.2.
  2. For each outer iteration j=0,,T1j=0,\ldots,T-1:
    • Inner loop:
      • Sample minibatch BB, split into BcalB_\text{cal} and BpredB_\text{pred}.
      • Compute smooth quantile q^θ\widehat{q}_\theta on BcalB_\text{cal} (using sigmoid or NeuralSort for differentiability).
      • Construct prediction sets (soft during training) on BpredB_\text{pred}.
      • Compute classwise average set sizes d^k\widehat{d}_k on BpredB_\text{pred}.
      • Form total loss:

    Ltot=Lcls(θ;Bpred)+k=1KP(d^kη,λk(j),ρk(j))\mathcal{L}_{\rm tot} = \mathcal{L}_{\rm cls}(\theta;B_{\rm pred}) + \sum_{k=1}^K P(\widehat{d}_k - \eta, \lambda_k^{(j)}, \rho_k^{(j)})

- Backpropagate and update θ\theta via SGD. - Outer loop (every few epochs): - Evaluate d^k\widehat{d}_k on validation set. - Update λk\lambda_k using gradient of PP. - If d^k\widehat{d}_k fails to decrease, increase ρk\rho_k by β\beta.

During training, smoothed surrogate set sizes and quantiles enable gradient-based optimization. At test time, exact (non-differentiable) split conformal prediction is used for final prediction sets.

4. Coverage Guarantees and Theoretical Properties

CaCT maintains the formal guarantees of conformal prediction:

  • Marginal coverage holds by construction, since final test-time prediction sets are exact conformal sets calibrated on held-out data:

Pr(YC(X))1α\Pr(Y\in\mathcal{C}(X))\ge 1-\alpha

This remains valid under i.i.d. or exchangeability assumptions between calibration and test data.

  • Class-conditional coverage is achieved by applying label-conditional (Mondrian) conformal predictors at test time, for which class-specific thresholds are calibrated. Because CaCT learns per-class set-size behaviors, it supports this extension and ensures:

Pr(YC(X)Y=k)1α, k\Pr(Y\in\mathcal{C}(X)\mid Y=k)\ge 1-\alpha,\ \forall k

  • Smoothing of indicators and quantiles is used only during training to provide gradients. These approximations are eliminated at test time, so formal properties are identical to standard CP (Marani et al., 14 Jan 2026).

5. Empirical Evaluation

Experiments encompass both balanced and long-tailed benchmarks in vision (MNIST, CIFAR-10, CIFAR-100, ImageNet, and their long-tailed variants with imbalance factor γ\gamma), and text (20 Newsgroups, with bag-of-words or BERT embeddings). Set construction uses THR, APS, or RAPS scores.

Baseline comparisons include cross-entropy, Focal Loss, ConfTr (Stutz et al., 2021), CUT, InfoCTr, and DPSM. Evaluated metrics comprise average set size S=E[C(X)]S = \mathbb{E}[|\mathcal{C}(X)|], marginal coverage CC, class-conditional coverage gap

CG=1Kk=1KPr(YC(X)Y=k)(1α)\text{CG} = \frac{1}{K}\sum_{k=1}^{K} |\Pr(Y\in\mathcal{C}(X)\mid Y=k) - (1-\alpha)|

and top-kk accuracy.

Key findings:

  • CaCT+ALM consistently yields the smallest average set size at target α=0.1\alpha = 0.1, while maintaining C1αC\approx1-\alpha.
  • Minimal coverage gap: In long-tailed settings (e.g., CIFAR100-LT, ImageNet-LT), CaCT balances marginal and class-conditional coverage where prior methods over- or under-cover tail classes.
  • Robustness to hyperparameters: Performance is stable for γ{0.5,0.8,1.0}\gamma\in\{0.5,0.8,1.0\}, with sharp smoothing (TT small) enabling greater efficiency.
  • Ablation findings: The PHR penalty gives best efficiency and convergence. CaCT-trained models generalize over α\alpha at test time.

Representative results (CIFAR100-LT, γ=0.1\gamma=0.1) for THR, α=0.1\alpha=0.1:

Method Set Size Coverage Gap (CG)
ConfTr ~7.8 ~6.3%
CUT ~4.8 ~5.1%
InfoCTr-Fano ~4.8 ~4.96%
DPSM ~4.2 ~4.91%
CaCT+ALM ~3.0 ~4.40%

6. Advantages, Limitations, and Computational Aspects

Advantages:

  • Class-conditional adaptivity: Per-class penalties λk\lambda_k allow explicit allocation of the efficiency budget according to class difficulty and frequency.
  • Scalability: No hand-tuning of O(K)O(K) penalty weights required, applicable to large KK (e.g., ImageNet).
  • Long-tail robustness: Graceful adaptation for minority classes, avoiding under- or over-coverage.
  • Score-agnostic: Compatible with any non-conformity score (THR, APS, RAPS).

Limitations:

  • Implementation complexity: Requires ALM dual/penalty updates and careful penalty scheduling.
  • Training-only smoothing: Additional hyperparameters (e.g., smoothing temperature TT), and possible need for monitoring convergence as ALM’s properties in non-convex deep nets remain empirical.

Hyperparameters:

  • Duals: λk(0)106\lambda_k^{(0)}\approx10^{-6}, ρk(0)1\rho_k^{(0)}\approx1.
  • Update rate: β1.2\beta\approx1.2, update duals every 5–10 epochs.
  • Target size: η1\eta\approx1.
  • Smoothing: T=0.01T=0.01–$0.1$.
  • Learning rate, batch size as per deep learning norms.

Computational cost:

  • Inner step (SGD through model and set-size loss): comparable to ConfTr/CUT.
  • Outer dual update: O(K)O(K) per update (negligible for K103K\le10^3).
  • Total wall-time increase: \sim10–20% (Marani et al., 14 Jan 2026).

7. Relationship to Prior Conformal Training Methods

CaCT generalizes and subsumes earlier frameworks such as ConfTr (Stutz et al., 2021), which already differentiates through the conformalization process in batch and permits shaping the inefficiency via re-weighting schemes or classification-on-sets losses. However, in ConfTr, adaptation is limited to either manual weighting or fixed matrix designs, whereas CaCT automates the discovery of optimal per-class penalties.

In empirical comparisons, CaCT+ALM achieves lower coverage gaps and smaller set sizes than ConfTr, CUT, DPSM, and Fano-bound-based InfoCTr, especially for long-tailed or imbalanced regimes. The augmented Lagrangian machinery ensures classwise constraint satisfaction efficiently, which is unattainable with scalar penalty-based objectives. The framework operates under the same minimal exchangeability assumption required for validity of conformal predictors at test time.

For a comprehensive technical exposition, refer to (Marani et al., 14 Jan 2026) and (Stutz et al., 2021).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Class Adaptive Conformal Training (CaCT).