Counterfactual Explanations in XAI

Updated 28 January 2026

Counterfactual Explanations (CFEs) are post-hoc interpretability tools that identify the minimal, actionable feature changes required to shift a model's prediction, emphasizing proximity, sparsity, and validity.
They employ various constrained optimization methods—such as black-box sampling, mixed-integer programming, SAT-based enumeration, and reinforcement learning—to generate diverse and robust recourse options.
CFEs are applied in high-stakes domains like finance, healthcare, and robotics, where integrating user-specific constraints and ensuring fairness are critical for practical decision-making.

Counterfactual Explanations (CFEs) are a class of post-hoc model interpretability tools that specify minimal, concrete feature changes which—if enacted—would change a model’s prediction to a desired outcome. In contrast to feature-attribution or rule-based explanations, CFEs provide actionable, contrastive, and minimally-invasive modifications directly aligned with decision boundaries. Historically rooted in the logic of “what-if” reasoning, CFEs now play a central role in XAI for high-stakes domains including finance, healthcare, robotics, and recourse policy.

1. Formalization and Evaluation Criteria

Mathematically, a counterfactual explanation for input $x$ and classifier $f$ is an instance $x'$ that is as close as possible to $x$ (measured via a norm or cost function) such that $f(x')$ outputs the desired class or label. Standard formulations include:

$x' = \underset{z\in\mathcal{X}}{\arg\min}\; d(x, z) \quad \text{subject to}\quad f(z) = y'$

where $d(\cdot,\cdot)$ is typically an $L_p$ or weighted norm (Verma et al., 2021, Mohammadi et al., 2020).

Common CFE evaluation metrics include:

Proximity: $d(x, x')$ —how small is the total change.
Sparsity: Number of features that change; $||x - x'||_0$ (Choudhury et al., 20 Jul 2025).
Validity: $f(x') = y'$ ; required for success.
Plausibility: $x'$ lies within high-density regions of the data or obeys domain constraints (Wielopolski et al., 2024, Verma et al., 2021).
Actionability: Only mutable features are allowed to change; immutable/causal constraints must be respected (Mastromichalakis et al., 2024).
Diversity: Return multiple, diverse CFEs (Mohammadi et al., 2020, Wang et al., 2023).

Historically, most approaches have optimized proximity and sparsity (Mohammadi et al., 2020, Verma et al., 2021). User studies reveal that these surrogate objectives often miss key aspects of human effort and feasibility: only 63.81% of user choices matched proximity-optimal CFEs, and even less when global feature weights were assumed (Choudhury et al., 20 Jul 2025).

2. User Preferences and Adaptive Metrics

Empirical research demonstrates significant discrepancies between classical CFE objectives and actual user decision-making. In a two-phase user study, participants exposed to CFEs for loan recourse exhibited:

Individualized effort weights: Users assigned heterogeneous, idiosyncratic costs $w_i$ to different features.
Hard acceptability thresholds: Many participants imposed per-feature cutoffs $\alpha_i$ , rejecting CFEs requiring changes above these bounds regardless of proximity (Choudhury et al., 20 Jul 2025).

A two-stage preference model, AWP (Acceptability × Weighted Proximity), operationalizes this as:

Filter candidate $x'$ by acceptability ( $|x_i' - x_i| \leq \alpha_i$ for all $i$ ).
Among feasible $x'$ , select $\arg\min_{x'} \sum_i w_i |x_i - x_i'|/Range_i$ .

AWP achieves 84.4% predictive accuracy for user choice—an absolute 20 point gain over proximity-based models—demonstrating the necessity of adaptive, user-centric metrics (Choudhury et al., 20 Jul 2025).

3. Algorithmic Design: Generation and Constraints

CFEs are computed via constrained optimization, with techniques depending on data type and model access:

Black-box optimization: Bayesian Optimization or sampling when gradients are unavailable; e.g., ACE (Adaptive sampling for Counterfactual Explanations) employs Gaussian Process surrogates and expected improvement for efficient query selection (Guerrero et al., 30 Sep 2025).
Mixed-Integer Programming: Provides globally optimal, nearest CFEs with coverage and runtime guarantees; scales to deep ReLU networks (Mohammadi et al., 2020).
SAT-based enumeration: For minimal-featureset perturbations; CEMSP (Counterfactual Explanations with Minimal Satisfiable Perturbations) leverages SAT solvers to find all minimal, robust CFEs while integrating actionability, causality, and domain knowledge via propositional constraints (Wang et al., 2023).
Probabilistic plausibility: PPCEF employs normalizing flows to ensure $p_{data}(x'|y') \geq \delta$ ; CFEs are optimized using composite losses that penalize distance, plausibility violation, and prediction failure, supporting batch computation (Wielopolski et al., 2024).
Reinforcement Learning: For high-dimensional or sequential data (e.g., time series, robotics), RL agents search for feasible, sparse CFEs while respecting user-imposed constraints (Sun et al., 2024, Remman et al., 11 May 2025).

In all settings, additional penalties or constraints can address actionability, plausibility, feature sparsity, and fairness (Wielopolski et al., 2024, Verma et al., 2021).

4. Robustness, Stability, and Real-World Usage

Robustness of CFEs to model changes is a critical concern in dynamic deployments. BetaRCE provides the first model-agnostic method for post-hoc robustness guarantees: for a given CFE $x'$ , it estimates $\delta$ -robustness by testing $P_{M'}[M'(x') = y'] \geq \delta$ over admissible model perturbations, delivering a Bayesian credible interval for the guarantee (Stępka et al., 2024). Empirical studies confirm that BetaRCE can maintain target robustness with minimal additional perturbation beyond the base solution.

For practical actionability, current research underscores the need to:

Personalize cost and feasibility constraints to individual users (Choudhury et al., 20 Jul 2025).
Amortize computation of CFEs for large-scale or real-time environments, using learned policies or batched Gaussian Process surrogates (Spooner et al., 2021, Naggita et al., 2024, Sun et al., 2024).
Ensure stability under input or model changes: minimal sets of CFEs generated via CEMSP or globally optimal search methods are both robust and diverse (Wang et al., 2023).
Address multivariate or sequence data: RL-based CFWoT finds parsimonious, feasible CFEs for multivariate time-series with no training data (Sun et al., 2024).

5. Application Domains and Class-Specific Innovations

CFEs are applied in diverse domains, each with unique requirements:

Financial recourse: CFEs offer actionable steps for loan approvals, with adaptive-user models outperforming traditional metrics (Choudhury et al., 20 Jul 2025).
Clinical intervention: SenseCF uses fine-tuned LLMs (LLaMA-3.1-8B) to generate valid ($0.99$), plausible ($0.99$), sparse (avg. $1.8$ features), and semantically coherent interventions for stress prediction and sensor-based digital health (Soumma et al., 21 Jan 2026).
Data augmentation: LLM-generated CFEs help correct class imbalance, restoring up to $20\%$ F1 loss in label-scarce settings (Soumma et al., 21 Jan 2026).
Robotics: Realistic geometric CFEs for 2D LiDAR are found by searching over parameterized shape spaces with genetic algorithms, yielding physically plausible scans interpretable by roboticists and aligning with end-user queries (Remman et al., 11 May 2025).
Video classification: BTTF produces temporally coherent, plausible video CFEs using a diffusion-model–based latent search, satisfying semantic and spatiotemporal minimality (Wang et al., 25 Nov 2025).
Regression models: Globally convergent Bayesian Optimization with differentiable output potentials extends CFE search to regression tasks, handling sparsity and actionable constraints with theoretical complexity results (Spooner et al., 2021).

User studies reveal that the psychological plausibility of minimal, “closest” CFEs often aligns better with human reasoning than data-manifold or “computationally plausible” constraints, which may degrade comprehension and learning outcomes (Kuhl et al., 2022). Further, upward-directed CFEs promote learning and explicit knowledge formation more strongly than downward or mixed-direction CFEs; these effects tie directly to regulatory fit theory and task alignment (Kuhl et al., 2023).

A prominent risk is the tendency of lay users to attribute causal meaning to CFEs generated from purely statistical models; simple interventions (e.g., explicit “correlation ≠ causation” warnings) are effective at mitigating such misperceptions (Tesic et al., 2022).

For iterated, partial fulfillment, some CFE algorithms are IPF-stable, ensuring no adverse cost is incurred; others (notably non-optimal or randomized search) can lead to oscillations and unbounded cost, with negative fairness implications (Zhou, 2023).

7. Limitations, Open Challenges, and Future Directions

Challenges for real-world CFE deployment include:

Personalization: Robust, scalable elicitation and integration of user-specific cost and feasibility models (Choudhury et al., 20 Jul 2025).
Model and data shift: Post-hoc robustness (e.g., BetaRCE), privacy-preserving recourse, and dynamic updating under continual learning (Stępka et al., 2024, Verma et al., 2021).
Causality: Lack of structural causal models often precludes guarantees on actionability; hybrid approaches or elicitation of partial constraints are active areas (Verma et al., 2021).
Fairness and group disparities: Ensuring equal access to recourse, equal cost, and robustness across demographic groups (Verma et al., 2021, Zhou, 2023).
Evaluation: Adoption of application-aligned, user-validated metrics rather than one-size-fits-all surrogates (Choudhury et al., 20 Jul 2025, Mastromichalakis et al., 2024).

Recent advances suggest that the future of CFEs lies in adaptive, multi-level frameworks—capable of flexibly addressing diverse user objectives, deploying in time-varying environments, and supporting both rigorous theoretical guarantees and psychologically valid, actionable feedback.