Vision-Amplified Semantic Entropy (VASE)

Updated 20 January 2026

VASE is a metric that quantifies semantic uncertainty by contrasting answer distributions from clean and perturbed visual inputs for hallucination detection.
It utilizes both NLI- and embedding-based clustering to group model outputs into semantic equivalence classes for robust uncertainty estimation.
Empirical results show that VASE outperforms traditional metrics by achieving higher AUC and robustness in general, medical, and video VQA tasks.

Vision-Amplified Semantic Entropy (VASE) is an evaluation metric designed for hallucination detection in visual question answering (VQA) with multimodal LLMs (MLLMs), including both medical and general-purpose vision-LLMs as well as their video-extending counterparts. VASE quantifies semantic uncertainty in model outputs, and critically, amplifies the signal associated with visual grounding by comparing the entropy of answer distributions for clean versus perturbed visual inputs. The VASE methodology addresses the limitations of traditional uncertainty and semantic diversity metrics, offering robustness and sensitivity to vision-specific errors such as hallucinations—cases where generated answers are not supportable by the input imagery.

1. Conceptual Foundations and Mathematical Formulation

The foundation of VASE is rooted in semantic entropy (SE), adapted to measure not only the spread of model answers in semantic space but also their sensitivity to vision perturbations. For a given image–question (or video–question) pair, a set of model-generated answers is clustered into semantic equivalence classes. The central quantities can be summarized as follows:

Let

$A_0$ : baseline, low-temperature answer to the visual input (image or video) and question,
$\{A_1, \ldots, A_n\}$ : $n$ high-temperature samples from the clean input,
$\{N_1, \ldots, N_n\}$ : $n$ high-temperature samples from perturbed (noisy or distorted) input,
$\ell_i$ : mean token log-likelihood of answer $i$ .

All $2n+1$ answers are clustered into $K$ semantic clusters $\{C_j\}$ (using NLI-based or embedding-based clustering). The semantic predictive distributions $s_{\text{clean}}$ and $s_{\text{noisy}}$ are computed for clean and noisy samples, respectively:

$S_j = \exp\left( \sum_{i: c_i = j} \exp(\ell_i - \max_k \ell_k) \right), \quad s_j = \frac{S_j}{\sum_{m=1}^K S_m}$

Semantic Entropy (SE) is:

$\mathrm{SE} = -\sum_{j=1}^K s_{\text{clean},j} \log s_{\text{clean},j}$

Vision-Amplified Semantic Entropy (VASE) is defined as:

$t = \mathrm{softmax}\left( s_{\text{clean}} + \alpha(s_{\text{clean}} - s_{\text{noisy}}) \right)$

$\mathrm{VASE} = -\sum_{j=1}^K t_j \log t_j$

with $\alpha$ (typically 1.0) controlling amplification of visual sensitivity. The contrastive term amplifies shifts in semantic mass between clean and noisy inputs, yielding a highly vision-sensitive uncertainty signal. This framework readily extends to video by applying analogous perturbations across spatiotemporal domains (Gautam et al., 16 Nov 2025, Gautam et al., 13 Jan 2026, Liao et al., 26 Mar 2025).

2. Sampling and Clustering Methodologies

VASE requires carefully structured sampling and clustering for effective hallucination detection:

Sampling protocol:

Generate $A_0$ with low temperature ( $T \approx 0.1$ ) to approximate the modal answer.
For the clean input, sample $n$ outputs at high temperature ( $T=1.0$ ).
Apply controlled visual perturbations (random affine transforms, color jitter, Gaussian or Poisson noise) to the input and draw a corresponding $n$ high-temperature outputs.
Retain per-sample log-likelihoods for weighting.

Semantic clustering:

NLI-based: Pairwise bidirectional entailment using e.g. DeBERTa-MNLI, followed by entailment graph analysis to define mutually entailing clusters.
Embedding-based: Sentence encoders such as SBERT to obtain vector representations; clusters extracted via cosine similarity thresholding and kNN graph connectivity.

Class alignment is performed when the clustering of clean and noisy samples produces a differing number of clusters, with alignment determined through cross-entailment (NLI case) or vector proximity (embedding case) (Gautam et al., 16 Nov 2025, Liao et al., 26 Mar 2025).

3. Practical Algorithmic Pipeline and Implementation

The canonical VASE computation pipeline involves the following steps:

Sampling: Draw the required clean and perturbed samples, keeping n in the empirically supported range (10–15 for image VQA, $\geq6$ for improved sensitivity in video VQA).
Sequence Assembly: Jointly assemble all answer samples and record their log-likelihoods.
Semantic Clustering: Cluster answers using either NLI-based (O( $n^2$ ) complexity) or embedding-based (O( $nd$ ) per-dimension, highly scalable) methods.
Metric Computation: Calculate cluster-wise semantic distributions, semantic entropy (SE), RadFlag (cluster drift rate), and the VASE statistic.
Thresholding: Use VASE in conjunction with a pre-determined threshold (calibrated on validation sets) to flag hallucinations.

Distortion transformations are implemented via libraries such as Albumentations; clustering uses pretrained backbone models (DeBERTa-MNLI for NLI, SentenceTransformer for embedding) (Gautam et al., 16 Nov 2025, Gautam et al., 13 Jan 2026).

4. Empirical Evaluation and Comparative Performance

Comprehensive experiments have been conducted in multiple domains:

Medical VQA (e.g., CheXagent, LLaVA-Med on MedDiff-VQA and VQA-RAD): VASE achieves higher AUC and Area Under GREEN (AUG) metrics compared to all baselines, notably outperforming token-level uncertainty and even semantic entropy without vision amplification. For instance, on CheXagent+MedDiff-VQA, VASE attains AUC 87.8% and AUG 62.1% ( $n=10$ ), above SE (AUC 86.5%) and RadFlag (Liao et al., 26 Mar 2025).
General and Video-VQA (HEDGE, VideoHEDGE frameworks): VASE exhibits superior robustness, especially with larger distortion budgets. For 7B-parameter video-VLMs on SoccerChat, VASE yields ROC-AUC increases of $+$ 0.1–0.2 over SE and RadFlag, which are often near chance (AUC $\approx$ 0.5) under video perturbations. Performance gains saturate with $\gtrsim$ 10 clean/noisy samples (Gautam et al., 16 Nov 2025, Gautam et al., 13 Jan 2026).
Prompt and Architecture Sensitivity: VASE is especially effective for unified-fusion models (e.g., Qwen2.5-VL), concise label-style prompts, and when using embedding-based clustering for short answers and NLI-based for longer/complex ones. SE and RadFlag are less reliable at low budgets ( $n\leq2$ ) or for weaker architectures (Gautam et al., 16 Nov 2025).

Metric	MedDiff-VQA AUC	VQA-RAD AUC	KvasirVQA-x1 AUC
SE	86.5	73.0	≈ 0.88
RadFlag	–	–	–
VASE	87.8	76.5	≈ 0.89

All scores as reported in original studies at optimal sampling (Gautam et al., 16 Nov 2025, Liao et al., 26 Mar 2025).

5. Guidelines, Limitations, and Best Practices

VASE's operational success relies on:

Choice and magnitude of visual perturbations: Weak transforms must preserve semantic content, especially in medical applications where strong distortions can obscure relevant diagnostics.
Sampling budget: n = 10–15 is generally optimal; larger n offer diminishing returns in detection AUC.
Clustering strategy: Embedding-based clustering provides scalable, comparable performance to NLI in short-form answers; use NLI for complex semantics.
Threshold calibration: The hallucination flag threshold for VASE should be set on a validation set to balance false positives and negatives; typically, no additional hallucination-annotated data is required (Liao et al., 26 Mar 2025).

Limitations include the dependence on pretrained NLI or embedding models (for clustering), and in medical VQA, careful tuning of perturbations is required to remain within clinically valid bounds. VASE is largely architecture-agnostic but is most effective when paired with models that are not heavily language-prior dominated.

6. Extension to Video and Broader Modalities

The VideoHEDGE framework generalizes VASE to spatiotemporal inputs, combining photometric and temporal perturbations with the same entropy amplification logic. Empirical results affirm that vision-based amplification is crucial for hallucination detection in the video domain, as SE and RadFlag rapidly lose discriminative power with increasing task complexity and input ambiguity (Gautam et al., 13 Jan 2026).

This generalization supports the broader claim that VASE formalizes multimodal hallucination detection as a contrastive, perturbation-amplified entropy analysis in semantic space. The method is deployable across image, video, and potentially future sensor modalities, provided appropriate clustering and perturbation operators are defined.

7. Summary and Impact

VASE has established itself as the preferred hallucination detection metric within recent multimodal reliability benchmarks. By leveraging contrastive entropy between clean and perturbed semantic prediction distributions, it detects vision-unreliable answers with superior robustness compared to both token-centric and pure semantic diversity baselines. The method has been successfully adopted in general-purpose, medical, and video VQA settings, with community-standard implementations available via the hedge-bench benchmarking library. Its principled, black-box nature requires only detection queries to the underlying model, eliminating the need for model modifications or handcrafted hallucination data (Gautam et al., 16 Nov 2025, Gautam et al., 13 Jan 2026, Liao et al., 26 Mar 2025).