Structure-Disrupted Contrastive Decoding (SDCD)
- The paper introduces SDCD, a training-free algorithm that mitigates object hallucination by penalizing tokens with high confidence under shuffled visual inputs.
- SDCD performs contrastive calibration by comparing logits from original and structure-disrupted (shuffled) patch views to enhance alignment with true object structures.
- SDCD demonstrates significant improvements across benchmarks, reducing hallucination by up to 63% and boosting performance on metrics like POPE and CHAIR.
Structure-Disrupted Contrastive Decoding (SDCD) is a training-free, inference-time algorithm designed to mitigate object hallucination in Large Vision-LLMs (LVLMs) by addressing visual statistical bias introduced by modern vision encoders. SDCD operates by performing contrastive calibration of autoregressive token generation, penalizing generation candidates that remain high-confidence under a shuffled, structure-disrupted view of the input image. Through this mechanism, SDCD suppresses texture-driven spurious confidence, reinforcing alignment between generated language and genuine structural cues in the visual input (Xia et al., 7 Jan 2026).
1. Visual Statistical Bias and the Bag-of-Patches Phenomenon
Object hallucination in LVLMs is conventionally attributed to overreliance on language priors and high-level statistical biases. However, a significant contributing factor is a visual statistical bias, rooted in the behavior of contemporary vision encoders, such as those derived from Vision Transformers (ViT/CLIP). These encoders partition an image into patches and aggregate patch information via self-attention. Empirical observations indicate that ViT/CLIP-based encoders retain local texture statistics even when patch positions are permuted randomly, manifesting a "Bag-of-Patches" property. This leads to insensitivity to global spatial arrangement (holistic geometry) and results in overreliance on local textures.
The consequence is that LVLMs may assign high confidence to the presence of objects whenever local texture cues coincide with object priors—for example, interpreting fur-like textures as evidence of a cat—regardless of the true global structure. This effect is formalized by "Structure Sensitivity Divergence": the logits associated with actual object tokens drop sharply when presented with a structure-disrupted image, while hallucination-prone tokens (those based on local textures) may persist or increase (Xia et al., 7 Jan 2026).
2. SDCD Objective and Contrastive Calibration
SDCD introduces a structure-disrupted view by randomly permuting the spatial positions of image patches. Let denote the textual prompt, the original sequence of visual patches, and the structure-disrupted sequence induced by random permutation .
For each token generation step , the base model computes logits for candidate token : and, for the shuffled view: SDCD defines the contrastively calibrated logit as: where is the contrastive weight hyperparameter.
The corresponding token distribution is:
Tokens with high confidence under the shuffled view (indicating texture-driven bias) are penalized, while those whose confidence drops with shuffling (indicating structural dependence) are maintained or amplified.
3. Inference Algorithm and Hyperparameterization
SDCD operates in a purely inference-time, training-free manner. The core procedure is as follows:
- Split the input image into non-overlapping patches of size to form .
- Randomly permute the patches to generate .
- Initialize the generated sequence with a beginning-of-sequence token.
- For each timestep :
- Compute logits
- Compute structure-disrupted logits
- Form SDCD logits:
- Compute SDCD token probabilities via softmax.
- Sample from using a selected sampling strategy (greedy, nucleus sampling, etc.).
- Append to .
- Output the generated text.
Hyperparameters and defaults:
- Shuffle grid size: (preferred for patch-level shuffling), or for coarser models.
- Contrastive weight:
- (Optional) Plausibility threshold: to filter candidates with low plausibility.
- Sampling: nucleus sampling (top-, temperature $1.0$).
4. Empirical Evaluation across Hallucination and Reasoning Benchmarks
SDCD has been systematically evaluated on discriminative object-existence tasks (POPE, MSCOCO/Random), multimodal reasoning (MME suite with LLaVA-1.5), and open-ended image captioning with hallucination metrics (CHAIR). The following experimental results are reported (Xia et al., 7 Jan 2026):
| Metric / Task | Regular | VCD | SDCD |
|---|---|---|---|
| POPE Accuracy (%) | 82.93 | 84.87 | 85.90 |
| POPE F1 (%) | 80.87 | 83.37 | 84.56 |
| MME Perception (LLaVA-1.5 total) | 1229.9 | 1292.0 | 1348.4 |
| MME Cognition (LLaVA-1.5 total) | 307.1 | 286.4 | 338.9 |
| CHAIR_S (caption hallucination) | 55.6 | 57.0 | 18.6 |
| CHAIR_I (image hallucination) | 17.3 | 16.9 | 6.4 |
| CHAIR F1 (%) | 72.2 | 74.4 | 70.7 |
SDCD achieves a 63% reduction in hallucinated objects compared to regular decoding on CHAIR metrics, and consistent improvements in object-existence F1 and accuracy on POPE benchmarks. Performance gains extend to “Popular” and “Adversarial” POPE splits, further test datasets (A-OKVQA, GQA), and across LVLM architectures (including Qwen2.5-VL).
SDCD also enhances reasoning performance on MME structural tasks (existence, position) and code reasoning, which suggests broader benefits in multimodal understanding beyond hallucination suppression.
5. Ablation Studies: Shuffle Granularity and Contrastive Weight
Ablations examine the SDCD performance dependence on shuffle granularity and contrastive calibration weight .
Shuffle Granularity ():
- (patch-level): F1 = 84.56%
- (coarser shuffle): F1 = 81.85%
- (very coarse): F1 = 80.44%
Performance is optimal when shuffling is performed at the patch level, indicating the sensitivity of the texture-vs-structure signal to shuffle scale.
Contrastive Weight ():
- (no contrast): F1 = 82.63%
- : F1 = 83.77%
- : F1 = 84.25%
- : F1 = 84.51%
Increasing boosts recall at a modest precision tradeoff. A plausible implication is that tuning may allow users to compromise between strict hallucination suppression and broader language generation recall.
6. Qualitative Impact and Mechanism
A representative example demonstrates SDCD's suppression of texture-driven hallucination. For an image of a pillow with fur-like texture (containing no cat), the regular decoding produces: “A cat is sitting on a pillow with soft white fur.” (hallucinated “cat”) In contrast, SDCD yields: “A pillow with a soft fur-patterned cover lies on a surface.” (correct, no cat)
SDCD penalizes candidate tokens (such as “cat”) that remain high-confidence under the shuffled view, filtering out hallucinations attributable to local texture statistics rather than global semantic structure.
7. Position within the Multimodal Model Landscape
SDCD provides a principled, training-free, inference-time procedure for mitigating LVLM hallucinations, distinct from prior approaches focusing on language priors or high-level statistical correction. By explicitly leveraging the Bag-of-Patches property and targeting the locus of visual statistical bias, SDCD directly addresses internal complexities of visual encoding pipelines in transformer-based multimodal architectures. Its efficacy across discriminative probing, open-ended captioning, and general multimodal benchmarks establishes it as a robust fit-for-purpose decoding intervention in current LVLM deployment scenarios (Xia et al., 7 Jan 2026).