Class Adaptive Conformal Training
- Class Adaptive Conformal Training (CaCT) is a framework that integrates conformal prediction with per-class augmented Lagrangian updates to tailor prediction set sizes to class heterogeneity while ensuring coverage guarantees.
- It employs learned classwise penalties to optimize both efficiency and uniformity, addressing challenges in long-tailed and imbalanced class distributions.
- Empirical studies show that CaCT achieves smaller prediction sets and minimal coverage gaps compared to global-penalty methods, enhancing robustness and calibration.
Class Adaptive Conformal Training (CaCT) is a framework for uncertainty quantification in deep neural networks that combines conformal prediction theory with an augmented Lagrangian optimization scheme to produce class-conditionally efficient prediction sets while maintaining formal coverage guarantees. CaCT generalizes previous conformal training techniques by explicitly learning per-class penalties and constraints, thereby adapting prediction set sizes to the heterogeneity of class distributions, including in long-tailed regimes, and outperforming global-penalty approaches in efficiency, coverage uniformity, and robustness (Marani et al., 14 Jan 2026).
1. Background and Motivation
Conformal Prediction (CP) is a wrapper algorithm that augments a base classifier with a prediction set for -class classification, such that for , marginal coverage
is guaranteed for a user-specified tolerance . CP, when applied post-hoc to a trained classifier, grants formal calibration but cannot influence the underlying model’s representations or class scores to directly optimize prediction set sizes (efficiency) or tailor behavior for specific classes. Conformal training methodologies such as ConfTr (Stutz et al., 2021) address this by differentiating through the conformalization procedure and penalizing average set size during training.
However, previous methods typically impose a single scalar efficiency penalty (e.g., ) shared across all classes. In practice, especially in large-scale or long-tailed settings, class frequencies and difficulties differ substantially — a global constraint often misallocates the efficiency budget, leading to oversized prediction sets for tail classes or accidental undercoverage. CaCT addresses this by introducing classwise constraints and per-class adaptivity, learning a separate penalty for each class through an augmented Lagrangian approach, thus enabling fine-grained set-size control without distributional assumptions (Marani et al., 14 Jan 2026).
2. Formulation: Augmented Lagrangian for Classwise Set-Size Constraints
Let , , and let be the classifier parameterized by . Non-conformity scores (e.g., THR, APS, RAPS) are employed to compute calibrated thresholds on a held-out set, yielding the prediction set
where is the empirical quantile. Define the per-class average set size
for instances in class . The target is
for some target size .
CaCT introduces dual variables and penalty weights per class. Using the Hestenes/Powell–type penalty (PHR), the augmented Lagrangian objective at iteration is:
where and the penalty is defined by
The dual update
increases when constraints are violated (). Penalty parameters are increased by factor when constraints stop improving.
This framework ensures class-specific control over set size, automates tuning of efficiency penalties, and is agnostic to the choice of non-conformity score.
3. Algorithmic Structure
The CaCT optimization consists of an inner (SGD-based model update) and an outer (dual/penalty update) loop.
- Initialization: Set initial , for all , choose .
- For each outer iteration :
- Inner loop:
- Sample minibatch , split into and .
- Compute smooth quantile on (using sigmoid or NeuralSort for differentiability).
- Construct prediction sets (soft during training) on .
- Compute classwise average set sizes on .
- Form total loss:
- Inner loop:
- Backpropagate and update via SGD. - Outer loop (every few epochs): - Evaluate on validation set. - Update using gradient of . - If fails to decrease, increase by .
During training, smoothed surrogate set sizes and quantiles enable gradient-based optimization. At test time, exact (non-differentiable) split conformal prediction is used for final prediction sets.
4. Coverage Guarantees and Theoretical Properties
CaCT maintains the formal guarantees of conformal prediction:
- Marginal coverage holds by construction, since final test-time prediction sets are exact conformal sets calibrated on held-out data:
This remains valid under i.i.d. or exchangeability assumptions between calibration and test data.
- Class-conditional coverage is achieved by applying label-conditional (Mondrian) conformal predictors at test time, for which class-specific thresholds are calibrated. Because CaCT learns per-class set-size behaviors, it supports this extension and ensures:
- Smoothing of indicators and quantiles is used only during training to provide gradients. These approximations are eliminated at test time, so formal properties are identical to standard CP (Marani et al., 14 Jan 2026).
5. Empirical Evaluation
Experiments encompass both balanced and long-tailed benchmarks in vision (MNIST, CIFAR-10, CIFAR-100, ImageNet, and their long-tailed variants with imbalance factor ), and text (20 Newsgroups, with bag-of-words or BERT embeddings). Set construction uses THR, APS, or RAPS scores.
Baseline comparisons include cross-entropy, Focal Loss, ConfTr (Stutz et al., 2021), CUT, InfoCTr, and DPSM. Evaluated metrics comprise average set size , marginal coverage , class-conditional coverage gap
and top- accuracy.
Key findings:
- CaCT+ALM consistently yields the smallest average set size at target , while maintaining .
- Minimal coverage gap: In long-tailed settings (e.g., CIFAR100-LT, ImageNet-LT), CaCT balances marginal and class-conditional coverage where prior methods over- or under-cover tail classes.
- Robustness to hyperparameters: Performance is stable for , with sharp smoothing ( small) enabling greater efficiency.
- Ablation findings: The PHR penalty gives best efficiency and convergence. CaCT-trained models generalize over at test time.
Representative results (CIFAR100-LT, ) for THR, :
| Method | Set Size | Coverage Gap (CG) |
|---|---|---|
| ConfTr | ~7.8 | ~6.3% |
| CUT | ~4.8 | ~5.1% |
| InfoCTr-Fano | ~4.8 | ~4.96% |
| DPSM | ~4.2 | ~4.91% |
| CaCT+ALM | ~3.0 | ~4.40% |
6. Advantages, Limitations, and Computational Aspects
Advantages:
- Class-conditional adaptivity: Per-class penalties allow explicit allocation of the efficiency budget according to class difficulty and frequency.
- Scalability: No hand-tuning of penalty weights required, applicable to large (e.g., ImageNet).
- Long-tail robustness: Graceful adaptation for minority classes, avoiding under- or over-coverage.
- Score-agnostic: Compatible with any non-conformity score (THR, APS, RAPS).
Limitations:
- Implementation complexity: Requires ALM dual/penalty updates and careful penalty scheduling.
- Training-only smoothing: Additional hyperparameters (e.g., smoothing temperature ), and possible need for monitoring convergence as ALM’s properties in non-convex deep nets remain empirical.
Hyperparameters:
- Duals: , .
- Update rate: , update duals every 5–10 epochs.
- Target size: .
- Smoothing: –$0.1$.
- Learning rate, batch size as per deep learning norms.
Computational cost:
- Inner step (SGD through model and set-size loss): comparable to ConfTr/CUT.
- Outer dual update: per update (negligible for ).
- Total wall-time increase: 10–20% (Marani et al., 14 Jan 2026).
7. Relationship to Prior Conformal Training Methods
CaCT generalizes and subsumes earlier frameworks such as ConfTr (Stutz et al., 2021), which already differentiates through the conformalization process in batch and permits shaping the inefficiency via re-weighting schemes or classification-on-sets losses. However, in ConfTr, adaptation is limited to either manual weighting or fixed matrix designs, whereas CaCT automates the discovery of optimal per-class penalties.
In empirical comparisons, CaCT+ALM achieves lower coverage gaps and smaller set sizes than ConfTr, CUT, DPSM, and Fano-bound-based InfoCTr, especially for long-tailed or imbalanced regimes. The augmented Lagrangian machinery ensures classwise constraint satisfaction efficiently, which is unattainable with scalar penalty-based objectives. The framework operates under the same minimal exchangeability assumption required for validity of conformal predictors at test time.
For a comprehensive technical exposition, refer to (Marani et al., 14 Jan 2026) and (Stutz et al., 2021).