Anatomical Region-Guided Contrastive Decoding

Updated 26 December 2025

ARCD is a region-guided inference strategy for MedVLMs that uses explicit anatomical masks to steer output toward user-defined regions.
It employs a three-tiered contrastive decoding architecture—at token, attention, and logits levels—to enhance factual alignment without requiring retraining.
Empirical results demonstrate improvements in diagnostic accuracy across modalities, with gains up to +8.66% in fine-tuned settings reducing model hallucinations.

Anatomical Region-Guided Contrastive Decoding (ARCD) is a plug-and-play inference intervention for medical vision-LLMs (MedVLMs) designed to mitigate hallucinations by explicitly steering the model’s attention and response generation toward user-specified anatomical regions. By integrating an anatomical mask into the decoding process, ARCD implements a three-tiered contrastive reweighting architecture—at the token, attention, and logits levels—without requiring retraining or costly expert annotations. The approach demonstrably improves factual alignment of MedVLM outputs with visual evidence in diverse imaging modalities, robustly reducing model hallucination rates and enhancing diagnostic accuracy (Liang et al., 19 Dec 2025).

1. Anatomical Mask Generation

ARCD relies on explicit specification of a region of interest (ROI) within the input medical image $V\in\mathbb{R}^{H\times W\times C}$ . The ROI is defined by a binary mask $S\in\{0,1\}^{H\times W}$ , with $S(x,y)=1$ denoting pixels inside the anatomical region. Mask generation can be accomplished by expert annotation, a pre-trained segmentation network (e.g., PSPNet, MedSAM), or via conversion from a bounding box annotation.

To integrate this spatial information with the vision transformer (ViT)-based encoder, the mask $S$ is downsampled into:

A global mask $M_g$ (size $L\times L$ ) aligned with ViT patch tokens.
A local, finer-grid mask $M_l$ (size $G_hL \times G_wL$ ), where $G_h$ and $G_w$ are user-defined grid factors.

Both masks are serialized, with row-wise separators, and concatenated to form a unified mask vector $M\in\mathbb{R}^N$ , where

$N = (G_hL)\,(G_wL+1) + 1 + L\,(L+1)$

This vector is injected into the MedVLM self-attention modules to direct attention during downstream processing.

2. Three-Tiered Contrastive Decoding Architecture

At each autoregressive decoding step $t$ , ARCD maintains two parallel branches:

The unguided branch $\overline{\mathbf{c}}$ (standard decoding, no ROI guidance).
The guided branch $\mathbf{c}$ (conditioned on the region mask $M$ ).

These branches are fused contrastively at three core levels:

(a) Token-Level Reweighting

Token embeddings are reweighted depending on mask membership. In the standard instantiation: $\overline{c}_i = \begin{cases} \alpha\,c_i & m_i=1 \ c_i & m_i=0 \end{cases}$ where $m_i\in\{0,1\}$ denotes ROI membership for token $i$ and $\alpha\ll1$ (typically $0.01$). This suppresses ROI information in the unguided branch, enhancing contrastive signal.

(b) Attention-Level Reweighting

Self-attention heads are biased to up-weight region tokens via a multiplicative factor: $\tilde{a}_{jk} = a_{jk}\,(1+\beta\,m_k), \qquad \beta > 0$ with $\beta$ typically in $[3,5]$ . Attention softmax probabilities are thereby adaptively increased for ROI tokens.

(c) Logits-Level Reweighting

Given candidate next-token logit distributions $\mathbf{z}^{(\overline{c})}$ and $\mathbf{z}^{(c)}$ , ARCD fuses these as a convex combination of their log-probabilities: $\log P(y_t | \text{ARCD}) = (1-\gamma)\,\log P(y_t|\overline{c}) + \gamma\,\log P(y_t|c)$ for $\gamma \in [1.3, 1.5]$ . This drives final token selection toward region-grounded generations.

3. Contrastive Objective and Decoding Procedure

ARCD operates purely at inference, with no retraining required. The decoding objective for each sequence is: $\mathcal{L} = \sum_{t=1}^T \left[ -\log P(y_t|y_{<t}, X) + \lambda\,\ell_{RC}(y_t, X, M) \right]$ where

$\ell_{RC} = -\sum_i [P(y_t=i|c) - P(y_t=i|\overline{c})]_+$

serves as a region-contrastive loss. The hyperparameter $\lambda$ (typically $1$) mediates the tradeoff between fidelity and region grounding.

At each decoding step:

Both guided and unguided logits are computed.
Token, attention, and logits reweighting are applied as above.
Next-token selection proceeds via greedy search or beam search over the fused logits.

4. Empirical Performance and Ablation

ARCD was evaluated across chest X-ray (MIMIC-Ext-VQA), abdominal CT and brain MRI (SLAKE), and ocular ultrasound (OBScan) datasets, in both zero-shot (PubMedVision-only) and task-specific fine-tuning settings. Principal metrics include closed-question accuracy and open-question token-recall.

Dataset / Setting	Baseline	VCD	DoLa	OPERA	ARCD (ours)	Δ vs. Baseline
MIMIC-Ext-VQA (ZS)	47.24	48.43	47.24	50.00	50.79	+3.55
SLAKE (ZS)	50.00	50.79	53.54	50.79	55.11	+5.11
OBScan (ZS)	61.81	62.99	61.02	62.06	65.75	+3.94
MIMIC (FT)	69.29	69.69	72.05	71.65	77.95	+8.66
SLAKE (FT)	82.28	81.10	81.89	82.28	83.07	+0.79
OBScan (FT)	89.76	89.76	89.37	87.40	92.13	+2.37

GPT-4o-based hallucination assessment for Med-Phi3.5V:

Zero-shot: “Correct” ↑ from 42%→56%, Hallucination/Factual Contradiction ↓ from 28%→14%.
Fine-tuned: “Correct” ↑ from 68%→78%, Hallucination ↓ from 16%→7%.

Ablation studies on SLAKE revealed that each reweighting tier contributes incrementally (token-only: $+1.2\%$ , attention-only: $+3.8\%$ , logits-only: $+4.6\%$ , all combined: $+5.1\%$ zero-shot accuracy gain relative to baseline).

5. Practical Considerations and Limitations

Computational Overhead

ARCD introduces a $1.4\times\text{--}1.6\times$ slowdown relative to standard greedy decoding due to dual forward passes (guided and unguided branches) and the application of three reweighting operations per decoding step.

Reliance on Mask Quality

The method is contingent upon the availability of accurate ROI masks. When employing PSPNet-generated lung masks ([email protected]≈37%) on MIMIC, ARCD still delivered a $+2.8\%$ VQA accuracy gain, indicating some robustness to noisy masks. However, severely incorrect masks may misdirect the model, reducing grounding efficacy.

Potential Extensions

Possible modifications include:

Learnable token scoring functions $\phi(\cdot)$ for adaptive token weighting.
Soft masks $m_i \in [0,1]$ reflecting uncertainty or probabilistic region definitions.
Multi-region steering for simultaneous focus on multiple anatomical areas.
Integration with retrieval-augmented or preference optimization frameworks to further suppress hallucination.

6. Significance and Broader Implications

ARCD constitutes a modular, training-free, and region-specific solution to the problem of hallucination in MedVLMs. Its architecture leverages the natural alignment between anatomical segmentation and medical reporting tasks to reinforce factuality and region-groundedness. By complementing training-free global interventions with precise, spatially localized guidance, ARCD achieves marked improvements (up to $+8.66\%$ accuracy in fine-tuned settings) in both diagnostic accuracy and factual correctness relative to prior baselines. This approach opens avenues for robust clinical deployment of MedVLMs where reliable visual grounding is paramount (Liang et al., 19 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Anatomical Region-Guided Contrastive Decoding: A Plug-and-Play Strategy for Mitigating Hallucinations in Medical VLMs (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Anatomical Region-Guided Contrastive Decoding (ARCD).