Spatiotemporal-Semantic Contrastive Decoding
- Spatiotemporal-Semantic Contrastive Decoding is a method to mitigate hallucination in VideoLLMs by constructing counterfactual negative features that target spatial, temporal, and semantic inconsistencies.
- It employs a lightweight disruptor network to inject perturbations into video feature encodings, ensuring evidence-grounded and accurate token selection.
- Empirical results across multiple benchmarks demonstrate that SSCD significantly improves accuracy and reduces false outputs compared to conventional heuristic methods.
Spatiotemporal-Semantic Contrastive Decoding (SSCD) is a decoding paradigm for video LLMs (VideoLLMs) that directly mitigates hallucination—generation of outputs that are inconsistent with explicit video content or factual evidence—by leveraging structured negative features targeting the underlying spatiotemporal and semantic dependencies of video data. Conventional decoding and heuristic perturbation methods inadequately address the root causes of hallucination and fine-grained temporal-semantic correlations, reducing their effectiveness and generalization. SSCD offers a systematic approach for constructing and leveraging counterfactual (negative) features, enabling robust, evidence-grounded generation without compromising task performance (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).
1. Hallucination in VideoLLMs: Roots and Challenges
Hallucination in VideoLLMs encompasses several patterns distinctive to the video modality:
- Spatial hallucination: The model invents objects/attributes not visible in any frame.
- Temporal hallucination: The model asserts event order or causality contrary to the actual sequence.
- Semantic hallucination: The model misattributes objects, actions, or attributes inconsistent with the video.
Prior contrastive decoding methods—such as randomly perturbing spatial or temporal information—mostly rely on heuristics and fail to precisely capture the causal patterns leading to hallucination. This inadequacy is especially apparent in complex, multi-event videos or queries demanding nuanced temporal and semantic reasoning (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).
2. Core Methodology: Construction and Use of Negative Features
Spatiotemporal-Semantic Contrastive Decoding systematically constructs negative features that deliberately disrupt the underlying spatial, temporal, and semantic consistencies of the video encoding, and penalizes tokens favored by these corrupted features at inference.
a. Spatiotemporal Disruption:
A lightweight "disruptor" network is trained to inject residual perturbations into the frozen VideoLLM's encoded features so as to break long-range coherence. For each video , framewise encoded tokens are disrupted:
Cycle-consistency in a sparse video graph is minimized, encouraging to break temporal chains and inter-frame dependencies (Eq. 6–11 in (Gao et al., 30 Jan 2026)).
b. Semantic Disruption:
is also trained to minimize the mutual information between disrupted visual features and the ground-truth answer , resulting in negative features that are maximally "hallucination-prone" (Eq. 12–15).
c. Model-aware Negatives:
Other frameworks, such as MACD, use model-guided counterfactual generation. Here, object/region masks are constructed and optimized—by gradient ascent on the model's NLL—so as to identify and "knock out" the most influential regions, generating targeted spatial, temporal, and semantic negatives (Xiao et al., 2 Feb 2026).
3. Contrastive Decoding Framework: Inference Objectives
The contrastive decoding mechanism uses two parallel branches—
- Positive: Real video features ( or )
- Negative: Disrupted/counterfactual features ( or )
At each decoding timestep, the model computes the standard logits on both streams, forming a contrastive logit:
where scales the negative penalty. A plausibility filter restricts candidate tokens to those whose (positive) probability exceeds a relative threshold, mitigating over-penalization of plausible continuations. The final token is selected from this filtered set, either greedily or by sampling (Gao et al., 30 Jan 2026, Xiao et al., 2 Feb 2026).
In adaptive approaches such as SEASON, multiple negatives (spatial, temporal, and semantic) are fused. Per-token "self-diagnostic" divergence weights, derived from attention distribution differences, dynamically calibrate each negative branch's impact (Wu et al., 4 Dec 2025).
4. Unified Inference Algorithms
A generic pseudocode pattern for SSCD-style methods is as follows:
- Extract video features: Encode video frames to obtain .
- Construct negatives:
- SSCD: Apply disruptor to to get .
- MACD: Optimize object/frame mask strengths by gradient ascent on model loss; generate targeted counterfactual input.
- SEASON: Add Gaussian noise for spatial, homogenize features for temporal, or generate more complex semantic negatives.
- Project features into model space: For both positive and negative streams.
- Contrastive scoring: Compute contrastive logits, possibly with dynamic weighting, and selection of candidates by plausibility.
- Token selection: Greedy or sampled from filtered, contrastively-calibrated distribution (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).
5. Theoretical Rationale: Why Contrastive Negatives Suppress Hallucination
SSCD directly targets the core patterns responsible for hallucination by constructing negatives that decouple only spatiotemporal and semantic grounding, rather than globally perturbing features. Subtractive contrast at the logits level discourages the selection of tokens that are more probable under the hallucination-prone (negative) views, while softly retaining the original model’s token distribution. This approach maintains general video understanding and reasoning performance, as only a lightweight component is trained or adaptively optimized, with the VideoLLM backbone frozen (Gao et al., 30 Jan 2026).
Model-aware strategies (MACD) additionally align the negative features with model-specific vulnerabilities, focusing the contrastive penalty directly on the evidence most responsible for hallucination (Xiao et al., 2 Feb 2026).
6. Broader Frameworks and Extensions
SEASON demonstrates that this method is extensible to a unified inference-time, training-free implementation using self-diagnostic token-wise risk scores, blending spatial, temporal, and semantic negatives with dynamic attention-based weights. The framework allows further refinement—e.g., introducing separate contrastive strengths for each negative family (, , )—or normalization constraints (). The negative families can be realized by object-masking, action-swap, or attribute perturbation, isolating specific sources of hallucination (Wu et al., 4 Dec 2025).
7. Empirical Evaluation and Comparative Performance
SSCD has been evaluated across major VideoLLM backbones, including Video-LLaVA, LLaVA-NeXT-Video, Qwen-VL, and InternVL models (Gao et al., 30 Jan 2026, Xiao et al., 2 Feb 2026). Key benchmarks include VideoHallucer, EventHallusion, VideoHallu, ActivityNet-QA, MMVU, MVBench, and Perception-test. Experimental outcomes demonstrate:
| Model | Benchmark | Base Acc. | SSCD/MACD Acc. | Notable Gains |
|---|---|---|---|---|
| Video-LLaVA | VideoHallucer | 14.3 | 22.9 | ORH ↑35.5→46.5; TH ↑9.7→29.5 |
| LLaVA-NeXT | VideoHallucer | 32.3 | 35.4 | ORH ↑58.0→60.5; TH ↑21.6→32.4 |
| Qwen2.5-VL-7B | EventHallusion | 0.44 | 0.61 | F1 ↑0.44→0.67 |
SSCD reduces hallucination rates across all subtypes without adversely impacting standard QA and reasoning metrics. MACD provides further improvements in scenarios involving small, occluded, or co-occurring objects, as well as strong reduction in false "yes" rates on absent-object queries (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).
8. Outlook and Significance
Spatiotemporal-Semantic Contrastive Decoding establishes a paradigm in which evidence grounding for video-language generation is enforced via structured, interpretably constructed negatives throughout the decoding process. This removes reliance on heuristic or random perturbations and explicitly addresses the structural dependencies intrinsic to video data. A plausible implication is the modular extensibility of the framework: explicit negatives for other modalities (audio, sensor) or more fine-grained semantic cues can be directly incorporated, and self-diagnostic, adaptive weighting schemes enable generalized, inference-time, backbone-agnostic deployment (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).