FESTA: Trust Assessment via Equivalent Sampling
- The paper introduces FESTA, a framework leveraging functionally equivalent and complementary sampling for uncertainty quantification in multimodal models.
- FESTA employs black-box perturbations along consistency and sensitivity axes, using KL divergence to measure prediction stability and sensitivity.
- Empirical results show significant AUROC improvements in misprediction detection, offering a practical tool for selective prediction in safety-critical applications.
Functionally Equivalent Sampling for Trust Assessment (FESTA) is a principled framework for evaluating the reliability and uncertainty of predictive models, particularly multimodal LLMs (MLLMs). As presented in recent literature (Bhattacharya et al., 20 Sep 2025), FESTA enables selective prediction and user confidence improvement in settings where diverse input modalities pose substantial challenges for conventional uncertainty quantification. The methodology leverages post-hoc, black-box sampling along two orthogonal axes—consistency via functionally equivalent sampling and sensitivity via functionally complementary sampling—to generate an uncertainty score correlated with model trust. FESTA requires only input-output access and does not assume availability of ground truth annotations.
1. Framework and Problem Definition
FESTA is designed for precise uncertainty estimation in MLLMs—models ingesting multimodal inputs such as images, audio, and text. The core objective is to assign a trust or uncertainty measure to model outputs without relying on ground-truth labels, thus facilitating selective prediction and abstention in low-confidence cases. The process is unsupervised and black-box: it requires only the capability to perturb the input and observe the model’s output.
For a model and input , FESTA constructs two sampling sets:
- Functionally Equivalent Samples (FES): Inputs constructed such that the task remains unchanged, i.e., and, for an ideal model , .
- Functionally Complementary Samples (FCS): Inputs with but , inducing a controlled semantic shift.
These constructions probe model invariance and sensitivity, respectively.
2. Task-Preserving Sampling Methodology
The sampling approach is central to FESTA. Equivalent samples (FES) are generated by applying transformations that do not alter the underlying task semantics. For visual input, these include operations such as grayscale conversion, mild noise, blurring, rotation, or paraphrasing for text queries. The expectation is that a robust model will produce invariant predictions under such perturbations.
Complementary samples (FCS) are constructed to perturb the input so that an ideal model would produce a different but task-appropriate output. Examples include negating phrases in a prompt or switching spatial relations within an image. While the core task is unchanged, the “correct” prediction is supposed to differ.
By generating both sets, FESTA quantifies two distinct properties:
- Consistency: The model’s ability to maintain prediction stability across invariant, equivalent input variants.
- Sensitivity: The model’s responsiveness to input variations that should trigger alternative predictions if the model properly attends to altered semantics.
3. Uncertainty Quantification via KL Divergence
FESTA’s uncertainty estimation is based on the predictive distributions and :
- For FES, is the aggregated distribution of model outputs across functionally equivalent samples, and the “ideal” distribution is a delta function at the original prediction . The KL divergence quantifies the deviation:
- For FCS, the predictive distribution is normalized to the complement (all ):
The overall FESTA uncertainty is additive:
This quantifies both prediction stability and sensitivity under task-preserving perturbations.
4. Experimental Evaluation and Metrics
FESTA is empirically validated on both image and audio reasoning tasks with multiple off-the-shelf vision-LLMs (Gemma-3, LLaVA-1.6, Qwen-2.5VL, Phi-4, Pixtral) and audio-LLMs (Qwen2-Audio, SALMONN). Evaluation is conducted on datasets including BLINK, VSR (for spatial reasoning), and TREA (for audio temporal reasoning).
The key metric is AUROC (Area Under Receiver Operator Characteristic Curve), probing how well FESTA’s uncertainty discriminates between correct and incorrect predictions. FESTA achieves significant gains: 33.3% relative improvement for vision-LLMs, 29.6% for audio-LLMs in misprediction detection compared to baseline entropy or black-box uncertainty methods.
Ablation and comparative studies show that KL-divergence-based uncertainty (as formulated in FESTA) outperforms entropy-based approaches, particularly in scenarios where models display low uncertainty hallucinations—overconfidence in erroneous outputs.
5. Implementation Details and Practical Utility
FESTA is implemented as an open-source toolkit. It operates in post-hoc fashion, requiring only black-box access (no internal scores or model internals), with input perturbations generated via standard augmentation pipelines (OpenCV, Hugging Face paraphrasers, etc.). The method introduces a minimal hyperparameter set, mainly the number of equivalent and complementary samples for aggregation. Configuration can be adapted for multi-choice QA or other classification formulations. The codebase is made available for reproducibility and further extension.
The framework supports deployment in safety-critical domains (healthcare, finance, autonomous systems). A plausible implication is that black-box uncertainty scores from FESTA can be used to abstain from prediction or trigger human review when uncertainty is high. This mitigates the risk of low-uncertainty hallucinations, an issue exacerbated in multimodal tasks.
6. Applications and Limitations
FESTA enables robust trust assessment, addressing the challenge of reliable uncertainty estimation in domains where ground truth is expensive or unavailable. By probing consistency and sensitivity, it provides an actionable metric for selective prediction. The methodology is directly applicable to QA, spatial and temporal reasoning, and diagnostic tasks using MLLMs.
However, its current formulation is tailored to multiple-choice tasks; extending it to fully generative settings may require further research. The technique’s accuracy relies on the diversity and quality of equivalent/complementary input transformations—if perturbations are not truly task-preserving or semantically complementary, the uncertainty estimate may be less informative.
7. Comparative Context and Significance
FESTA advances trust assessment beyond standard entropy-based or sampling-based methods by formalizing input perturbation along functionally defined axes and quantifying model response via KL divergence (Bhattacharya et al., 20 Sep 2025). Previous works have addressed trust and uncertainty assessment in DNNs through operational sampling (Guerriero et al., 2024) or coverage-based redundancy (Bertolino et al., 2023), but FESTA’s black-box multimodal approach is distinct in its generality and empirical improvement.
This suggests that principled input sampling, guided by formal task-preservation criteria, is necessary for reliable trust estimation in increasingly complex model settings. The improvement in AUROC establishes FESTA as a well-founded framework for quantifying trust and uncertainty without requiring access to ground-truth annotations or model internals—a key requirement for contemporary applications.
In summary, Functionally Equivalent Sampling for Trust Assessment (FESTA) provides a technically rigorous, empirically validated framework for model trust quantification, leveraging orthogonal input perturbation and KL-divergence–based uncertainty metrics to enable reliable selective prediction in multimodal LLMs (Bhattacharya et al., 20 Sep 2025).