Conformal Prediction Sets

Updated 11 February 2026

Conformal prediction sets are defined as set-valued predictors that guarantee the true label is included with a specified risk level, ensuring finite-sample coverage without assuming a specific data distribution.
They employ nonconformity scores derived from calibration data, using methods like split conformal and RAPS to balance prediction set size and coverage efficiency.
Extensions of these methods address structured outputs and challenging regimes, with ongoing research focused on computational scalability, conditional coverage, and robust adaptation in varied applications.

Conformal prediction sets are set-valued predictors constructed to provide reliable uncertainty quantification in supervised learning. For any feature-label pair $(X,Y)$ drawn from an unknown distribution and a set-valued function $C: \mathcal X \to 2^{\mathcal Y}$ , the method guarantees that the true label falls in the predicted set with prescribed probability. This marginal finite-sample coverage property is achieved without assuming any specific form of the data distribution, making conformal prediction a robust and widely applicable tool for validation and decision support in machine learning.

1. Formal Foundations and Construction

Conformal prediction operates in general input and label spaces—commonly, $\mathcal X \subset \mathbb R^D$ and $\mathcal Y = \{1, \ldots, M\}$ for classification. The fundamental property is the marginal coverage guarantee: for a user-specified risk level $\alpha$ , the conformal set $C(X)$ satisfies

$P\{Y \in C(X)\} \ge 1-\alpha,$

where the probability is over all random draws from the joint data-generating distribution (Cresswell et al., 2024).

Conformal prediction sets are typically constructed by defining a nonconformity score function $s(x, y)$ , with the interpretation that higher scores indicate greater disagreement between $y$ and the label expected at $x$ . Given a pre-trained model (e.g., a softmax classifier $C: \mathcal X \to 2^{\mathcal Y}$ 0), a popular choice in classification is

$C: \mathcal X \to 2^{\mathcal Y}$ 1

where $C: \mathcal X \to 2^{\mathcal Y}$ 2 is the model's predicted probability for class $C: \mathcal X \to 2^{\mathcal Y}$ 3.

The split-conformal method for classification proceeds as follows:

Calibration: Compute the scores $C: \mathcal X \to 2^{\mathcal Y}$ 4 for calibration data $C: \mathcal X \to 2^{\mathcal Y}$ 5.
Quantile Thresholding: Set the empirical quantile

$C: \mathcal X \to 2^{\mathcal Y}$ 6

Prediction: For a new $C: \mathcal X \to 2^{\mathcal Y}$ 7, define the conformal set

$C: \mathcal X \to 2^{\mathcal Y}$ 8

This construction achieves the target marginal coverage under exchangeability of calibration and test samples (Cresswell et al., 2024). Extensions such as RAPS (Regularized Adaptive Prediction Sets) adjust the nonconformity scores to further reduce average set size by incorporating rank-based regularization and tie-breaking (Cresswell et al., 2024).

2. Efficiency, Score Design, and Set Size

A conformal prediction method is judged not only by its coverage but also by the informativeness (smallness) of the resulting sets. Efficiency is typically measured via set cardinality or interval width (regression), and there is a trade-off between tightness and coverage.

Expected set size under split-conformal prediction has been precisely characterized by

$C: \mathcal X \to 2^{\mathcal Y}$ 9

where $\mathcal X \subset \mathbb R^D$ 0 measures the pre-image measure density of nonconformity scores, and $\mathcal X \subset \mathbb R^D$ 1 denotes the score CDF (Dhillon et al., 2023). Empirical point and interval estimates for the average set size can be straightforwardly computed using held-out data, enabling practitioners to plan and choose models non-asymptotically.

Key levers for practical set reduction (efficiency) include:

Careful selection of nonconformity scores (e.g., APS, RAPS, SAPS for classification (Cresswell et al., 2024, Huang et al., 2023)).
Regularization or adaptivity, such as in RAPS, to penalize unnecessary inclusion of labels in ambiguous inputs (Cresswell et al., 2024).
Optimization-based or constrained ERM approaches, where prediction set parameters are learned to minimize set size under empirical coverage constraints (Bai et al., 2022).

3. Extensions: Structured Outputs and Challenging Regimes

Conformal prediction sets generalize to a variety of problem structures, including hierarchical and ordinal classification, open-set and imbalanced regimes, and multi-label tasks.

Hierarchical classification: Prediction sets can be restricted to internal nodes or constrained to have bounded representation complexity $\mathcal X \subset \mathbb R^D$ 2 via dynamic programming, attaining valid coverage with interpretable, structure-aware outputs (Mortier et al., 31 Jan 2025).
Open-set/imbalanced classification: Good–Turing-conformal p-values allow prediction with coverage guarantees even when novel classes appear at test time or labels are severely imbalanced. Selective sample splitting and reweighting restore validity when exchangeability is broken by rare classes (Xie et al., 14 Oct 2025).
Ordinal output: Multiple testing procedures on conformal p-values (forward/backward sequential or Bonferroni) provide contiguous and non-contiguous prediction sets with class-conditional or marginal guarantees (Chakraborty et al., 2024).
Multi-label and false positive control: By enforcing set-wise constraints (e.g., limiting expected or high-probability false positives), conformal prediction can respect application-specific precision requirements rather than pure coverage (Fisch et al., 2022).

4. Aggregation, Selection, and Robustness

Efficiency can be significantly improved by aggregating multiple conformal predictors or by robustifying conformal sets against adversarial manipulation:

Aggregation strategies: α-allocation (COLA) selects risk allocations across multiple nonconformity scores to minimize the intersection set size while guaranteeing marginal (or, asymptotically, conditional) coverage. Sample splitting or full conformalization maintains finite-sample validity (Xu et al., 15 Nov 2025).
Model selection: Naive selection of the smallest set from a family of conformal sets risks coverage violation. Stability-based selection (MinSE) or post-selection recalibration permits valid set choice with controlled efficiency loss (Hegazy et al., 25 Jun 2025, Yang et al., 2021).
Robust sets: CAS (CDF-Aware Smoothed prediction Sets) defend against evasion and calibration poisoning by certifying coverage under worst-case input and label perturbations (Zargarbashi et al., 2024). These methods provide tighter and more efficient robust sets compared to earlier randomized smoothing-based conformal predictors, with formal guarantees for both continuous and discrete data.

5. Practical Deployment and Human-AI Interaction

Conformal prediction sets have substantial value in operational systems:

Communicating uncertainty: Variable-size and singleton prediction sets guide users’ interpretation of model confidence, proven to improve human-in-the-loop decision quality in controlled experiments. Calibration and visibility of coverage guarantees enhance human trust and accuracy (Cresswell et al., 2024).
Parameter tuning: Application-specific needs dictate confirmation of the marginal risk level $\mathcal X \subset \mathbb R^D$ 3, average set size, and any false positive/negative trade-offs. Hyperparameters in adaptive methods (e.g., RAPS' $\mathcal X \subset \mathbb R^D$ 4) are tuned specifically to balance set size and coverage (Cresswell et al., 2024).
Guidelines: Clearly present the coverage guarantee, expose set-size fluctuation to facilitate calibrated trust, and use adaptive or robustified sets to address varying input difficulty or distributional shift (Cresswell et al., 2024).

6. Limitations, Open Problems, and Future Directions

Several limitations and ongoing research directions are evident:

Conditional coverage: Finite-sample, distribution-free conditional coverage is provably impossible; only marginal or approximate instance/subgroup-level guarantees can be provided using methods like trust scores or adaptive quantiles (Kaur et al., 17 Jan 2025).
Subpopulation reliability: Marginal guarantees may obscure systematic undercoverage in specific strata (e.g., difficult subgroups or challenging classes), often manifesting as systematically larger or inadequate sets (Cresswell et al., 2024).
Model dependence: Prediction sets can inherit and reinforce model errors or biases, especially when the underlying classifier poorly estimates uncertainty for certain regimes (Cresswell et al., 2024).
Computational scalability: Full (non-split) conformal algorithms can be computationally prohibitive, particularly in regression or under model instability, motivating root-finding or stability-based approximations for practical use (Ndiaye, 2021, Ndiaye et al., 2021).
Structured outputs and unbounded label spaces: Recent advances handle open-label and structured tasks but further methodological development is required for domains such as deep generative models, where the output space is combinatorial or continuous (Shahrokhi et al., 13 Mar 2025).

The field continues to evolve, with ongoing work on adaptive and optimal set selection, robustness, individualized allocation, integration with evidential deep learning, and deployment in large-scale, safety-critical systems.