SoccerChat Benchmark for VLM Evaluation

Updated 20 January 2026

SoccerChat Benchmark is a framework that assesses hallucination detection by quantifying semantic uncertainty under controlled visual perturbations in vision-language models.
It employs a systematic pipeline combining controlled sampling, semantic clustering (using embedding and NLI-based methods), and contrastive entropy computation for robust evaluation.
The benchmark demonstrates improved ROC-AUC in video and VQA settings, offering a compute-efficient, architecture-agnostic tool for soccer event classification and related applications.

Vision-Amplified Semantic Entropy (VASE) is a metric and framework for hallucination detection in vision-LLMs (VLMs), notably in the domains of visual question answering (VQA), medical imaging, and video understanding. By quantifying the effect of controlled visual perturbations on the semantic uncertainty of model-generated responses, VASE robustly distinguishes between grounded answers and hallucinations—responses inconsistent with the visual evidence. VASE generalizes earlier use of semantic entropy by explicitly contrasting model behavior under clean and perturbed visual inputs, yielding a compute-efficient and architecture-agnostic signal for reliability assessment in both images and videos (Gautam et al., 16 Nov 2025, &&&1&&&, Liao et al., 26 Mar 2025).

1. Conceptual Overview and Rationale

Semantic Entropy (SE) measures the spread of model responses in semantic space, using clustering of sampled outputs to estimate epistemic uncertainty. VASE extends SE by introducing visual amplification: it measures the increase in semantic dispersion after applying slight distortions to the input images or videos. The underlying hypothesis is that faithful, vision-grounded models are resilient to moderate perturbations, while hallucinating models display increased epistemic uncertainty or cluster drift. Therefore, a substantial jump in semantic dispersion, as measured by VASE, is indicative of hallucination (Gautam et al., 16 Nov 2025, Liao et al., 26 Mar 2025).

2. Formal Definitions and Core Mathematical Formulation

Let $\{A_1, \dots, A_n\}$ denote $n$ high-temperature generations on a clean visual input, and $\{N_1, \dots, N_n\}$ be corresponding samples under controlled perturbations. Each output $A_i$ and $N_i$ is assigned semantic cluster $c_i$ via one of two methods: NLI-based (Mutual NLI inference) or embedding-based (vector similarity).

For $K$ total clusters:

Semantic Distribution:

$s^{\rm clean}_j = \frac{\exp\left(\sum_{i: c_i = j} \exp(\ell_i - \max_k \ell_k)\right)}{\sum_{m=1}^{K} \exp\left(\sum_{i: c_i = m} \exp(\ell_i - \max_k \ell_k)\right)}$ where $\ell_i$ is the mean token log-likelihood and $j = 1, \ldots, K$ . $s^{\rm noisy}_j$ is computed identically on noise-perturbed samples.

Semantic Entropy (SE):

$\mathrm{SE} = -\sum_{j=1}^K s^{\rm clean}_j \log s^{\rm clean}_j$

RadFlag:

$\mathrm{RadFlag} = 1 - \frac{1}{n} \sum_{i=1}^{n} \mathbf{1}[c_i = c_0]$ where $c_0$ is the cluster of the baseline (low-temperature) answer.

Vision-Amplified Semantic Entropy (VASE):

$t_j = \mathrm{softmax}_j\left(s^{\rm clean}_j + \alpha (s^{\rm clean}_j - s^{\rm noisy}_j)\right)$ $\mathrm{VASE} = -\sum_{j=1}^K t_j \log t_j$ Here, $\alpha > 0$ controls the amplification; $\alpha=1.0$ is standard (Gautam et al., 16 Nov 2025, Gautam et al., 13 Jan 2026, Liao et al., 26 Mar 2025).

In medical MLLM settings, weak perturbations (crop, slight rotation, intensity adjustments) are used to preserve clinical validity; VASE can also contrast response distributions under weak and strong perturbations for enhanced sensitivity (Liao et al., 26 Mar 2025).

3. Algorithmic Pipeline

The VASE computation consists of four essential stages:

Controlled Sampling:
- Draw one low-temperature baseline answer.
- Sample $n$ answers at $T=1.0$ on unmodified input.
- Generate $n$ visually perturbed versions; sample responses per distortion.
- Recommended $n \approx 10–15$ for performance–cost trade-off (Gautam et al., 16 Nov 2025, Gautam et al., 13 Jan 2026).
Semantic Clustering:
- NLI-based: Pairwise bidirectional entailment using large MNLI models, yielding transitive closure clusters (contradictions filtered).
- Embedding-based: Sentence encoder transforms responses to vectors; (cosine similarity $\geq\tau$ or kNN criteria) defines clusters. $\tau$ is tuned by maximizing ROC-AUC.
Probability Mass Assignment:
- Aggregate normalized exponential of mean log-likelihood per cluster for both clean and noisy sets.
Contrastive Entropy Calculation:
- Compute SE and VASE as above.
- Assign hallucination label if VASE exceeds threshold $\tau$ calibrated on validation data.

The entire pipeline is reproducibly implemented within the hedge-bench toolkit, with tunable distortion policies and clustering strategies (Gautam et al., 16 Nov 2025, Gautam et al., 13 Jan 2026).

4. Practical Implementations and Benchmark Evaluations

VASE has been validated in multiple domains:

Medical VQA:

On MIMIC-Diff-VQA and VQA-RAD, VASE outperforms AvgEnt, MaxEnt, SE, and RadFlag, with AUC increases on both CheXagent and LLaVA-Med backbones (e.g., AUC 87.8 vs. 86.5 for SE on MedDiff-VQA) (Liao et al., 26 Mar 2025).

General VQA (HEDGE):

On VQA-RAD and KvasirVQA-x1, VASE—especially with embedding-based clustering and $n\approx 10–15$ —achieves highest ROC-AUC, outperforming SE and RadFlag across tested VLMs (Qwen2.5-VL, LLaVA-Med, Med-Gemma), with AUCs up to 0.89 for concise prompts (Gautam et al., 16 Nov 2025).

Video VQA (VideoHEDGE):

In SoccerChat EventClassification and VideoQA, VASE offers consistent ROC-AUC gains (+0.1–0.2) versus SE and RadFlag, which often operate near chance levels in video settings. Embedding clustering reaches NLI-based detection accuracy at much lower computation (Gautam et al., 13 Jan 2026).

A summary of empirical results:

Domain	Best Performer	VASE AUC Improvement	Clustering Type / Notes
Medical (VQA)	VASE	+1.3–3.5% AUC	NLI-cluster; weak noise
General (VQA)	VASE	+0.02–0.15 AUC	Embedding/entailment
Video (VQA)	VASE	+0.1–0.2 AUC	Embedding, more distortions

5. Methodological Considerations and Sensitivity

Several design factors impact VASE detection accuracy:

Sampling Budget ( $n$ ):

Detection performance saturates for $n\geq 10–15$ ; very low $n$ reduces reliability except for RadFlag, which remains moderately robust at $n=1$ (Gautam et al., 16 Nov 2025).

Clustering Method:

Embedding-based clustering is preferable for short, label-style answers. NLI clustering offers superior separation for longer, sentence-level outputs. Computational complexity is significantly lower for embedding-based approaches (Gautam et al., 16 Nov 2025, Gautam et al., 13 Jan 2026).

Perturbation Policy:

In medical domains, only weak, clinically acceptable transforms are used for clean samples, with strong noise reserved for contrastive amplification. Excessive perturbations may compromise semantic alignment or clinical fidelity (Liao et al., 26 Mar 2025).

Prompt Sensitivity:

Minimal-label prompts yield stronger hallucination signals for powerful models; verbose or rigid prompts reduce separability but VASE maintains a significant lead over alternatives (Gautam et al., 16 Nov 2025).

6. Comparative Analysis and Empirical Insights

VASE is empirically shown to surpass both token-level uncertainty metrics (AvgEnt, MaxEnt) and prior sample-level consistency checks (RadFlag, vanilla SE):

On VQA-RAD and KvasirVQA-x1, VASE leads to a ROC-AUC of up to 0.89 (embedding clustering, default prompt) (Gautam et al., 16 Nov 2025).
In video-VQA, VASE scales with the number of distinct distortions, saturating above 6; it achieves up to 0.67 ROC-AUC vs. flat 0.50–0.55 for SE/RadFlag (Gautam et al., 13 Jan 2026).
In medical VQA, ablation demonstrates that the combination of weak transformations and the contrastive step yields maximal AUC/AUG, exceeding any one component in isolation (Liao et al., 26 Mar 2025).

Key observations:

Clean-only semantic uncertainty is frequently insufficient in the presence of strong language priors or domain shifts.
Vision amplification via contrastive entropy accentuates the contribution of visual evidence, penalizing hallucinatory, language-driven outputs.
Embedding clustering is nearly as effective as the more costly NLI-based clustering, especially in non-clinical domains.

7. Guidelines for Application and Extension

VASE should be prioritized as the primary hallucination detection metric in VQA models wherever controlled visual perturbations are feasible and supported.
Use $n=10–15$ paired clean/noisy samples for optimal trade-off between compute and detection accuracy; with severe resource constraints or unavailable perturbations, fall back to RadFlag (Gautam et al., 16 Nov 2025).
For short answers, prefer embedding clustering with tuned similarity threshold; for sentence-like outputs, employ NLI-based semantic clustering (Gautam et al., 16 Nov 2025, Liao et al., 26 Mar 2025).
In medical/clinical applications, closely regulate perturbation type and amplitude to preserve input validity; select transforms congruent with domain constraints (Liao et al., 26 Mar 2025).
Thresholds for VASE should be set using validation data in the target domain to balance sensitivity and specificity.

This treatment is grounded in the original formalism and empirics in "HEDGE" (Gautam et al., 16 Nov 2025), "VideoHEDGE" (Gautam et al., 13 Jan 2026), and the foundational VASE medical VQA study (Liao et al., 26 Mar 2025).