Spatiotemporal-Semantic Contrastive Decoding

Updated 6 February 2026

Spatiotemporal-Semantic Contrastive Decoding is a method to mitigate hallucination in VideoLLMs by constructing counterfactual negative features that target spatial, temporal, and semantic inconsistencies.
It employs a lightweight disruptor network to inject perturbations into video feature encodings, ensuring evidence-grounded and accurate token selection.
Empirical results across multiple benchmarks demonstrate that SSCD significantly improves accuracy and reduces false outputs compared to conventional heuristic methods.

Spatiotemporal-Semantic Contrastive Decoding (SSCD) is a decoding paradigm for video LLMs (VideoLLMs) that directly mitigates hallucination—generation of outputs that are inconsistent with explicit video content or factual evidence—by leveraging structured negative features targeting the underlying spatiotemporal and semantic dependencies of video data. Conventional decoding and heuristic perturbation methods inadequately address the root causes of hallucination and fine-grained temporal-semantic correlations, reducing their effectiveness and generalization. SSCD offers a systematic approach for constructing and leveraging counterfactual (negative) features, enabling robust, evidence-grounded generation without compromising task performance (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).

1. Hallucination in VideoLLMs: Roots and Challenges

Hallucination in VideoLLMs encompasses several patterns distinctive to the video modality:

Spatial hallucination: The model invents objects/attributes not visible in any frame.
Temporal hallucination: The model asserts event order or causality contrary to the actual sequence.
Semantic hallucination: The model misattributes objects, actions, or attributes inconsistent with the video.

Prior contrastive decoding methods—such as randomly perturbing spatial or temporal information—mostly rely on heuristics and fail to precisely capture the causal patterns leading to hallucination. This inadequacy is especially apparent in complex, multi-event videos or queries demanding nuanced temporal and semantic reasoning (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).

2. Core Methodology: Construction and Use of Negative Features

Spatiotemporal-Semantic Contrastive Decoding systematically constructs negative features that deliberately disrupt the underlying spatial, temporal, and semantic consistencies of the video encoding, and penalizes tokens favored by these corrupted features at inference.

a. Spatiotemporal Disruption:

A lightweight "disruptor" network $M$ is trained to inject residual perturbations into the frozen VideoLLM's encoded features so as to break long-range coherence. For each video $V$ , framewise encoded tokens $H_v = E_v(V)\in\mathbb{R}^{T\times N\times d}$ are disrupted:

$H_{neg} = H_v + M(H_v)$

Cycle-consistency in a sparse video graph is minimized, encouraging $M$ to break temporal chains and inter-frame dependencies (Eq. 6–11 in (Gao et al., 30 Jan 2026)).

b. Semantic Disruption:

$M$ is also trained to minimize the mutual information $I(Z_{neg}; Y|X)$ between disrupted visual features $Z_{neg}$ and the ground-truth answer $Y$ , resulting in negative features that are maximally "hallucination-prone" (Eq. 12–15).

c. Model-aware Negatives:

Other frameworks, such as MACD, use model-guided counterfactual generation. Here, object/region masks are constructed and optimized—by gradient ascent on the model's NLL—so as to identify and "knock out" the most influential regions, generating targeted spatial, temporal, and semantic negatives (Xiao et al., 2 Feb 2026).

3. Contrastive Decoding Framework: Inference Objectives

The contrastive decoding mechanism uses two parallel branches—

Positive: Real video features ( $Z_+$ or $v^O$ )
Negative: Disrupted/counterfactual features ( $Z_-$ or $v^{neg}$ )

At each decoding timestep, the model computes the standard logits on both streams, forming a contrastive logit:

$\ell_{cd}(y) = (1+\alpha) f_0(y | Z_+, X, y_{<t}) - \alpha f_0(y | Z_-, X, y_{<t})$

where $\alpha$ scales the negative penalty. A plausibility filter restricts candidate tokens to those whose (positive) probability exceeds a relative threshold, mitigating over-penalization of plausible continuations. The final token is selected from this filtered set, either greedily or by sampling (Gao et al., 30 Jan 2026, Xiao et al., 2 Feb 2026).

In adaptive approaches such as SEASON, multiple negatives (spatial, temporal, and semantic) are fused. Per-token "self-diagnostic" divergence weights, derived from attention distribution differences, dynamically calibrate each negative branch's impact (Wu et al., 4 Dec 2025).

4. Unified Inference Algorithms

A generic pseudocode pattern for SSCD-style methods is as follows:

Extract video features: Encode video frames to obtain $H_v$ .
Construct negatives:
- SSCD: Apply disruptor $M$ to $H_v$ to get $H_{neg}$ .
- MACD: Optimize object/frame mask strengths by gradient ascent on model loss; generate targeted counterfactual input.
- SEASON: Add Gaussian noise for spatial, homogenize features for temporal, or generate more complex semantic negatives.
Project features into model space: For both positive and negative streams.
Contrastive scoring: Compute contrastive logits, possibly with dynamic weighting, and selection of candidates by plausibility.
Token selection: Greedy or sampled from filtered, contrastively-calibrated distribution (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).

5. Theoretical Rationale: Why Contrastive Negatives Suppress Hallucination

SSCD directly targets the core patterns responsible for hallucination by constructing negatives that decouple only spatiotemporal and semantic grounding, rather than globally perturbing features. Subtractive contrast at the logits level discourages the selection of tokens that are more probable under the hallucination-prone (negative) views, while softly retaining the original model’s token distribution. This approach maintains general video understanding and reasoning performance, as only a lightweight component is trained or adaptively optimized, with the VideoLLM backbone frozen (Gao et al., 30 Jan 2026).

Model-aware strategies (MACD) additionally align the negative features with model-specific vulnerabilities, focusing the contrastive penalty directly on the evidence most responsible for hallucination (Xiao et al., 2 Feb 2026).

6. Broader Frameworks and Extensions

SEASON demonstrates that this method is extensible to a unified inference-time, training-free implementation using self-diagnostic token-wise risk scores, blending spatial, temporal, and semantic negatives with dynamic attention-based weights. The framework allows further refinement—e.g., introducing separate contrastive strengths for each negative family ( $\alpha_S$ , $\alpha_T$ , $\alpha_{sem}$ )—or normalization constraints ( $w_S + w_T + w_{sem}=1$ ). The negative families can be realized by object-masking, action-swap, or attribute perturbation, isolating specific sources of hallucination (Wu et al., 4 Dec 2025).

7. Empirical Evaluation and Comparative Performance

SSCD has been evaluated across major VideoLLM backbones, including Video-LLaVA, LLaVA-NeXT-Video, Qwen-VL, and InternVL models (Gao et al., 30 Jan 2026, Xiao et al., 2 Feb 2026). Key benchmarks include VideoHallucer, EventHallusion, VideoHallu, ActivityNet-QA, MMVU, MVBench, and Perception-test. Experimental outcomes demonstrate:

Model	Benchmark	Base Acc.	SSCD/MACD Acc.	Notable Gains
Video-LLaVA	VideoHallucer	14.3	22.9	ORH ↑35.5→46.5; TH ↑9.7→29.5
LLaVA-NeXT	VideoHallucer	32.3	35.4	ORH ↑58.0→60.5; TH ↑21.6→32.4
Qwen2.5-VL-7B	EventHallusion	0.44	0.61	F1 ↑0.44→0.67

SSCD reduces hallucination rates across all subtypes without adversely impacting standard QA and reasoning metrics. MACD provides further improvements in scenarios involving small, occluded, or co-occurring objects, as well as strong reduction in false "yes" rates on absent-object queries (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).

8. Outlook and Significance

Spatiotemporal-Semantic Contrastive Decoding establishes a paradigm in which evidence grounding for video-language generation is enforced via structured, interpretably constructed negatives throughout the decoding process. This removes reliance on heuristic or random perturbations and explicitly addresses the structural dependencies intrinsic to video data. A plausible implication is the modular extensibility of the framework: explicit negatives for other modalities (audio, sensor) or more fine-grained semantic cues can be directly incorporated, and self-diagnostic, adaptive weighting schemes enable generalized, inference-time, backbone-agnostic deployment (Gao et al., 30 Jan 2026, Wu et al., 4 Dec 2025, Xiao et al., 2 Feb 2026).

Markdown Report Issue Upgrade to Chat

References (3)

Mitigating Hallucinations in Video Large Language Models via Spatiotemporal-Semantic Contrastive Decoding (2026)

SEASON: Mitigating Temporal Hallucination in Video Large Language Models via Self-Diagnostic Contrastive Decoding (2025)

MACD: Model-Aware Contrastive Decoding via Counterfactual Data (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Spatiotemporal-Semantic Contrastive Decoding.