SGRS: Saliency-Guided Rejection Sampling
- The paper introduces SGRS, a novel method that leverages per-token gradient-attention saliency to dynamically reject ungrounded tokens in LVLMs.
- It computes a saliency score by fusing attention weights with input gradients and applies adaptive thresholding to filter out hallucinations.
- Empirical results demonstrate reduced hallucination rates and improved factual accuracy on benchmarks when SGRS is integrated in the decoding process.
Saliency-Guided Rejection Sampling (SGRS) is an inference-time filtering framework developed for large vision-LLMs (LVLMs) to mitigate hallucinations during autoregressive generation. SGRS leverages a per-token gradient–attention saliency metric to dynamically reject candidate tokens that are weakly grounded in the model’s recent context, thereby filtering out predictions with a heightened risk of factual incoherence or hallucination. The method is formulated in the context of the LVLMs-Saliency framework, which quantifies the visual grounding strength of each output token by fusing self-attention weights with their input gradients (Zhang et al., 28 Jan 2026).
1. Saliency Score Definition and Computation
At each autoregressive decoding step in a pretrained LVLM , SGRS computes a scalar saliency score for every candidate next token . The process involves:
- Extraction of self-attention weight matrices for each layer and attention head .
- Computation of the cross-entropy loss for a one-hot label vector corresponding to token over the softmax logits .
- Backpropagation of this loss to obtain gradients .
- Construction of saliency matrices as , where denotes the Hadamard product and applies causal masking, so only previous positions can contribute.
- Aggregation and normalization across heads to yield for each layer.
- Pooling over a set of target layers (typically the middle-to-deep layers) and all previous output positions to yield a scalar saliency score per candidate:
This saliency score quantifies the extent to which the next token is grounded in preceding outputs via the current model gradients.
2. Context-Adaptive Thresholding
Instead of a static threshold, SGRS determines an adaptive acceptance criterion for token grounding at each step. For decoding position , the acceptance threshold is given by:
where is a window covering the most recent output tokens and is a scaling factor. This context-adaptive threshold reflects recent model behavior and requires the saliency of any new candidate to exceed a fraction of the local average saliency history.
3. Rejection Sampling Procedure
The rejection sampling procedure of SGRS operates as follows:
- Compute logits and select the top- most probable candidate tokens .
- For up to trials, sample a candidate from proportional to their softmax probabilities. Compute the candidate’s saliency score and compare to the context threshold .
- If , accept as the next output token . Otherwise, remove from and repeat.
- If no token is accepted after trials, select as a fallback.
- Emit and increment .
The method includes stabilization strategies such as exponential moving average smoothing for , a minimal floor to prevent over-rejection, and bounding rejection trials to avoid inference stagnation (Zhang et al., 28 Jan 2026).
| Hyperparameter | Default Value | Purpose |
|---|---|---|
| 20 | Top-K sampling size | |
| 5 | Max rejection trials | |
| 5 | Saliency history window | |
| 0.6 | Threshold scaling factor | |
| Target (middle/deep) layers | ||
| 0.05 | Minimal acceptance floor |
4. Theoretical and Empirical Motivation
Empirical analysis demonstrates a pronounced relationship between saliency scores and output factuality. Specifically:
- Mean saliency of correct tokens –$0.66$ versus hallucinated tokens –$0.35$ across LLaVA-1.5-7B, Qwen2-VL-7B, and Intern-VL-7B.
- Hallucination probability decreases monotonically with increasing token saliency.
- Artificial reduction of the saliency signal () elevates the hallucination rate (Zhang et al., 28 Jan 2026).
While no formal theorem asserts that implies hallucination with probability 1, negative correlation between low saliency and hallucination provides an actionable operational filter.
5. Practical Implementation and Considerations
SGRS requires one backward pass per candidate per decoding step, increasing per-token latency by 30–40% compared to greedy decoding. For efficiency, SGRS can be selectively applied—restricted to tokens with high factuality risk or in conjunction with Local Coherence Reinforcement (LocoRE) for limited computational overhead ( when used alone) (Zhang et al., 28 Jan 2026). The temperature used in softmax sampling is kept at 1.0 throughout.
Variants include:
- Smoothing via exponential moving average,
- Setting a floor for acceptance,
- Bounded rejection trials,
- Fallback mode on rejection exhaustion.
Practical hyperparameter values were tuned using held-out splits of the CHAIR and POPE hallucination detection benchmarks.
6. Experimental Results and Comparative Effectiveness
SGRS, combined with LocoRE, was assessed on standard benchmarks, comparing standard top-K and nucleus sampling methods and eight additional plug-and-play hallucination mitigation techniques. For LLaVA-1.5-7B:
- CHAIR hallucination rate: (SGRS+LocoRE) versus (baseline).
- POPE-F1 score: (SGRS+LocoRE) versus (baseline).
Comparable improvements were obtained across Qwen2-VL and Intern-VL families. On the MME general benchmark, SGRS+LocoRE improved the “Existence” and “Position” subtasks by 5–7 points compared to greedy sampling. Qualitative examples show SGRS’s ability to reject visually ungrounded (hallucinatory) tokens, such as “blue,” when saliency maps collapse, retaining candidates (“gray,” “watch”) exhibiting strong context dependency (Zhang et al., 28 Jan 2026).
7. Significance and Broader Implications
SGRS offers an interpretable, gradient-linked mechanism for online filtering of weakly grounded tokens in LVLMs, reducing hallucination while preserving fluency and downstream task performance. Its granularity and context sensitivity derive from leveraging intrinsic model dynamics—attention and gradient propagation—rather than relying solely on heuristics or external post-hoc filters. Implementation remains computationally more intensive than greedy decoding, but the method remains practical given its substantial factuality gains, particularly under controlled token selection or in conjunction with LocoRE. This suggests avenues for future research in combining gradient-aware and structural saliency signals for robust LVLM decoding (Zhang et al., 28 Jan 2026).