Anatomical Region-Guided Contrastive Decoding
- ARCD is a region-guided inference strategy for MedVLMs that uses explicit anatomical masks to steer output toward user-defined regions.
- It employs a three-tiered contrastive decoding architecture—at token, attention, and logits levels—to enhance factual alignment without requiring retraining.
- Empirical results demonstrate improvements in diagnostic accuracy across modalities, with gains up to +8.66% in fine-tuned settings reducing model hallucinations.
Anatomical Region-Guided Contrastive Decoding (ARCD) is a plug-and-play inference intervention for medical vision-LLMs (MedVLMs) designed to mitigate hallucinations by explicitly steering the model’s attention and response generation toward user-specified anatomical regions. By integrating an anatomical mask into the decoding process, ARCD implements a three-tiered contrastive reweighting architecture—at the token, attention, and logits levels—without requiring retraining or costly expert annotations. The approach demonstrably improves factual alignment of MedVLM outputs with visual evidence in diverse imaging modalities, robustly reducing model hallucination rates and enhancing diagnostic accuracy (Liang et al., 19 Dec 2025).
1. Anatomical Mask Generation
ARCD relies on explicit specification of a region of interest (ROI) within the input medical image . The ROI is defined by a binary mask , with denoting pixels inside the anatomical region. Mask generation can be accomplished by expert annotation, a pre-trained segmentation network (e.g., PSPNet, MedSAM), or via conversion from a bounding box annotation.
To integrate this spatial information with the vision transformer (ViT)-based encoder, the mask is downsampled into:
- A global mask (size ) aligned with ViT patch tokens.
- A local, finer-grid mask (size ), where and are user-defined grid factors.
Both masks are serialized, with row-wise separators, and concatenated to form a unified mask vector , where
This vector is injected into the MedVLM self-attention modules to direct attention during downstream processing.
2. Three-Tiered Contrastive Decoding Architecture
At each autoregressive decoding step , ARCD maintains two parallel branches:
- The unguided branch (standard decoding, no ROI guidance).
- The guided branch (conditioned on the region mask ).
These branches are fused contrastively at three core levels:
(a) Token-Level Reweighting
Token embeddings are reweighted depending on mask membership. In the standard instantiation: where denotes ROI membership for token and (typically $0.01$). This suppresses ROI information in the unguided branch, enhancing contrastive signal.
(b) Attention-Level Reweighting
Self-attention heads are biased to up-weight region tokens via a multiplicative factor: with typically in . Attention softmax probabilities are thereby adaptively increased for ROI tokens.
(c) Logits-Level Reweighting
Given candidate next-token logit distributions and , ARCD fuses these as a convex combination of their log-probabilities: for . This drives final token selection toward region-grounded generations.
3. Contrastive Objective and Decoding Procedure
ARCD operates purely at inference, with no retraining required. The decoding objective for each sequence is: where
serves as a region-contrastive loss. The hyperparameter (typically $1$) mediates the tradeoff between fidelity and region grounding.
At each decoding step:
- Both guided and unguided logits are computed.
- Token, attention, and logits reweighting are applied as above.
- Next-token selection proceeds via greedy search or beam search over the fused logits.
4. Empirical Performance and Ablation
ARCD was evaluated across chest X-ray (MIMIC-Ext-VQA), abdominal CT and brain MRI (SLAKE), and ocular ultrasound (OBScan) datasets, in both zero-shot (PubMedVision-only) and task-specific fine-tuning settings. Principal metrics include closed-question accuracy and open-question token-recall.
| Dataset / Setting | Baseline | VCD | DoLa | OPERA | ARCD (ours) | Δ vs. Baseline |
|---|---|---|---|---|---|---|
| MIMIC-Ext-VQA (ZS) | 47.24 | 48.43 | 47.24 | 50.00 | 50.79 | +3.55 |
| SLAKE (ZS) | 50.00 | 50.79 | 53.54 | 50.79 | 55.11 | +5.11 |
| OBScan (ZS) | 61.81 | 62.99 | 61.02 | 62.06 | 65.75 | +3.94 |
| MIMIC (FT) | 69.29 | 69.69 | 72.05 | 71.65 | 77.95 | +8.66 |
| SLAKE (FT) | 82.28 | 81.10 | 81.89 | 82.28 | 83.07 | +0.79 |
| OBScan (FT) | 89.76 | 89.76 | 89.37 | 87.40 | 92.13 | +2.37 |
GPT-4o-based hallucination assessment for Med-Phi3.5V:
- Zero-shot: “Correct” ↑ from 42%→56%, Hallucination/Factual Contradiction ↓ from 28%→14%.
- Fine-tuned: “Correct” ↑ from 68%→78%, Hallucination ↓ from 16%→7%.
Ablation studies on SLAKE revealed that each reweighting tier contributes incrementally (token-only: , attention-only: , logits-only: , all combined: zero-shot accuracy gain relative to baseline).
5. Practical Considerations and Limitations
Computational Overhead
ARCD introduces a slowdown relative to standard greedy decoding due to dual forward passes (guided and unguided branches) and the application of three reweighting operations per decoding step.
Reliance on Mask Quality
The method is contingent upon the availability of accurate ROI masks. When employing PSPNet-generated lung masks ([email protected]≈37%) on MIMIC, ARCD still delivered a VQA accuracy gain, indicating some robustness to noisy masks. However, severely incorrect masks may misdirect the model, reducing grounding efficacy.
Potential Extensions
Possible modifications include:
- Learnable token scoring functions for adaptive token weighting.
- Soft masks reflecting uncertainty or probabilistic region definitions.
- Multi-region steering for simultaneous focus on multiple anatomical areas.
- Integration with retrieval-augmented or preference optimization frameworks to further suppress hallucination.
6. Significance and Broader Implications
ARCD constitutes a modular, training-free, and region-specific solution to the problem of hallucination in MedVLMs. Its architecture leverages the natural alignment between anatomical segmentation and medical reporting tasks to reinforce factuality and region-groundedness. By complementing training-free global interventions with precise, spatially localized guidance, ARCD achieves marked improvements (up to accuracy in fine-tuned settings) in both diagnostic accuracy and factual correctness relative to prior baselines. This approach opens avenues for robust clinical deployment of MedVLMs where reliable visual grounding is paramount (Liang et al., 19 Dec 2025).