Papers
Topics
Authors
Recent
Search
2000 character limit reached

Differential Visual Reasoning Policy

Updated 18 January 2026
  • DVRP is a reinforcement learning framework for multimodal models that couples visual perception with reasoning through intrinsic supervision without external annotations.
  • It employs three visual views—original, masked, and noisy—to compute KL divergences, enforcing sensitivity to semantic visual changes and robustness to non-semantic noise.
  • Empirical evaluations show DVRP significantly improves visual reasoning accuracy in mathematical and medical tasks compared to traditional RL with verifiable rewards.

The Differential Visual Reasoning Policy (DVRP) is a reinforcement learning (RL) framework for multimodal LLMs (MLLMs) that enforces visual grounding by incentivizing divergence in model reasoning according to strictly visual evidence changes. DVRP addresses a key limitation of existing RL with verifiable rewards (RLVR) paradigms in multimodal settings—namely, the decoupling of perception and reasoning, which often results in models that can perform well by ignoring visual content and relying solely on linguistic priors (“blind reasoners”). DVRP introduces intrinsic supervision through visual triplet transformations and a composite loss designed to maximize visual sensitivity and minimize non-semantic sensitivity, leading to more faithful visual reasoning without requiring external annotations or auxiliary tools (Gao et al., 11 Jan 2026).

1. Formal Structure and Mathematical Foundations

Let each data instance consist of a visual input VV (image) and a textual query qq. The model’s reasoning policy πθ\pi_\theta, parameterized by θ\theta, emits an output sequence oo (chain-of-thought plus final answer) conditioned on both modalities. DVRP constructs three distinct “views” of each input image:

  • Original (Invariant) View: V=IV = I, with policy πθOri(I,q)\pi_\theta^{\rm Ori}(\cdot \mid I, q).
  • Decremental (Masked) View: Vmask=ImaskV_{\rm mask} = I_{\rm mask}, obtained via random patch occlusion, with policy πθMask(Imask,q)\pi_\theta^{\rm Mask}(\cdot \mid I_{\rm mask}, q).
  • Incremental (Noisy) View: Vnoise=InoiseV_{\rm noise}=I_{\rm noise}, created by injecting diffusion-based noise, with policy πθNoise(Inoise,q)\pi_\theta^{\rm Noise}(\cdot \mid I_{\rm noise}, q).

The autoregressive policy factorizes as: πθ(oV,q)=t=1Tπθ(oto<t,V,q)\pi_\theta(o \mid V, q) = \prod_{t=1}^T \pi_\theta(o_t \mid o_{<t}, V, q)

The crux of DVRP is to modulate RLVR training by measuring and regularizing KL divergences across these visual views at each autoregressive step.

2. Objective Formulation and Loss Components

DVRP introduces two regularization terms into any base RLVR/PPO-style objective:

  • Visual Sensitivity (Lsens\mathcal{L}_{\rm sens}): Maximizes divergence between outputs on the original and masked views, i.e.,

Lsens=D(πθOriπθMask)\mathcal{L}_{\rm sens} = \mathbb{D}\left( \pi_\theta^{\rm Ori} \| \pi_\theta^{\rm Mask} \right)

This penalizes models for failing to alter their reasoning in response to salient visual deletions.

  • Visual Robustness (Lrob\mathcal{L}_{\rm rob}): Minimizes divergence between outputs on the original and noisy views, i.e.,

Lrob=D(πθOriπθNoise)\mathcal{L}_{\rm rob} = \mathbb{D}\left( \pi_\theta^{\rm Ori} \| \pi_\theta^{\rm Noise} \right)

This enforces stability to non-semantic, distribution-preserving perturbations.

KL divergence is computed at every generation step, summed over output tokens and rollouts: DKL(πθOriπθMask)=t=1Totpt(ot)logpt(ot)qt(ot)D_{\mathrm{KL}}\left(\pi_\theta^{\mathrm{Ori}} \| \pi_\theta^{\mathrm{Mask}}\right) = \sum_{t=1}^T \sum_{o_t} p_t(o_t) \log\frac{p_t(o_t)}{q_t(o_t)}

The composite DVRP loss (to maximize under RL) is: JDVRP(θ)=  JRLVR(θ)+λsensDKL(πθOriπθMask) λrobDKL(πθOriπθNoise)λentE[H(πθMask)+H(πθNoise)]\begin{aligned} \mathcal{J}_{\rm DVRP}(\theta) = &\; \mathcal{J}_{\rm RLVR}(\theta) + \lambda_{\rm sens} D_{KL}\left(\pi_\theta^{\rm Ori} \| \pi_\theta^{\rm Mask}\right) \ & - \lambda_{\rm rob} D_{KL}\left(\pi_\theta^{\rm Ori} \| \pi_\theta^{\rm Noise}\right) - \lambda_{\rm ent} \mathbb{E}\big[ \mathcal{H}(\pi_\theta^{\rm Mask}) + \mathcal{H}(\pi_\theta^{\rm Noise}) \big] \end{aligned} where entropy regularization (λent\lambda_{\rm ent}) prevents distribution collapse.

3. Training Algorithm and Implementation Details

DVRP extends GRPO or DAPO-style PPO frameworks via intrinsic “delta” regularization, using three images per example and corresponding rollout families. The end-to-end update is efficiently computed using shared trajectories for KL computations. The high-level algorithm is:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
Initialize θ  θ  (pretrained MLLM)
for epoch = 1  N do
  for batch in DataLoader(D) do
    {(I,q,a)}  batch
    I_mask  random_patch_mask(I; p_mask)
    I_noise  diffusion_noise(I; T_init, schedule)

    {o_i}, logπ_old  Rollout(πθ, I, q, G)
    R_i  reward(o_i,a)                  -- e.g. accuracy
    Â_i  normalize(R_i)
    L_RL  ClipSurrogateLoss({o_i, Â_i}, logπ_old, πθ)

    L_sens  KL(πθ(o_i|I,q) || πθ(o_i|I_mask,q))
    L_rob   KL(πθ(o_i|I,q) || πθ(o_i|I_noise,q))
    H_mask  Entropy(πθ(·|I_mask,q))
    H_noise  Entropy(πθ(·|I_noise,q))

    L_total  -L_RL
            + λ_sens · L_sens
            - λ_rob  · L_rob
            + λ_ent · (H_mask+H_noise)
    θ  θ - η _θ L_total
  end
end

Network Architecture:

  • Vision encoder: CLIP-ViT (patch size 14, output dim 1024)
  • Text encoder/decoder: Qwen2.5-VL (3B or 7B), with LoRA adapters on cross-attention
  • Fusion: Visual tokens prepended to Transformer layers with causal decoding

Hyperparameters (as reported):

Hyperparameter Math Medical
Learning rate (η) 1e-6 1e-6
Batch size 128 128
Rollouts per example (G) 5 5
Sensitivity weight (λsens\lambda_{\rm sens}) 0.01 0.01
Robustness weight (λrob\lambda_{\rm rob}) 0.01 0.01
Entropy weight (λent\lambda_{\rm ent}) 0.05 0.05
Patch mask probability (PmaskP_{\rm mask}) 0.6 0.2
Noise steps (TinitT_{\rm init}) 500 100
Noise schedule (γ\gamma) 10 10

4. Empirical Performance and Benchmarks

DVRP was assessed against RLVR baselines on six mathematical reasoning datasets (Geo3k, MathVista, WeMath, MVerse, MVerse-V, MMKI2) and four medical VQA datasets (Slake, PathVQA, RadVQA, PMC-VQA), using both 3B and 7B parameter Qwen2.5-VL backbones.

Model Math+General Acc. Δ vs. Base Medical Acc. Δ vs. Base
Base (7B) 43.0 59.5
GRPO (7B) 62.0 +44.2% 70.4 +18.3%
DVRPG_G(7B) 65.2 +51.6% 76.4 +28.4%
Base (3B) 29.8 48.8
DAPO (3B) 54.7 +83.6% 67.8 +39.2%
DVRPD_D(3B) 55.7 +86.9% 73.3 +50.5%

Ablation studies (7B, GRPO variant):

Setting Overall Acc.
GRPO 65.4
+ Sensitivity only 68.1
+ Robustness only 67.0
Full DVRP 69.7

DVRP produces substantial absolute and relative gains across both domains/architectures. Blind-input experiments demonstrate existing RLVR models often retain or exceed baseline accuracy with blanked images (up to ~94% retention), whereas DVRP’s accuracy collapses under masking but is robust to noise perturbations, directly evidencing enforced visual dependency.

5. Mechanisms Preventing Blind Reasoning

Blind reasoning is driven by the model's exploitation of linguistic cues in the query, bypassing the visual channel when rewards can be achieved by default priors. DVRP counteracts this by structurally penalizing output invariance under semantic deletion (masking) and penalizing excessive variance under non-semantic perturbation (diffusion noise). The sensitivity penalty ensures that any attempt by the policy to ignore visual features—generating identical or plausible responses absent visual content—incurs consistent learning costs. Consequently, the model is compelled to attend to actual image evidence for successful optimization.

This methodology generates self-supervised “Δ” signals: the explicit delta between reasonings under controlled visual changes, recoupling perception with reasoning and thereby enforcing authentic multimodal alignment.

6. Qualitative Assessment and Example Cases

In the baseline RLVR models (e.g., GRPO, DAPO), accuracy in visual reasoning tasks remains stable or may even increase under blank-image scenarios, indicating an overreliance on linguistic heuristics. In contrast, DVRP-trained models display a significant accuracy drop when critical visual content is masked but remain robust under visual noise; this dichotomy is interpreted in the work as evidence that DVRP forces reliance on identifiable visual features.

Example rollout analyses illuminate failures in baseline models, such as “hallucinating numbers” or exploiting common question templates when provided with no image, versus DVRP models that reference precise visual cues—such as color counts, geometric object properties, or graph topologies—even in complex mathematical or medical contexts (Gao et al., 11 Jan 2026).

7. Implications and Considerations

DVRP is a lightweight, end-to-end approach requiring no additional ground-truth labels, external annotation, or explicit tool integration. By introducing self-supervised, per-instance visual delta supervision, DVRP represents a principled strategy for aligning multimodal RL agents with true perceptual reasoning rather than spurious language-based shortcuts. A plausible implication is that DVRP-like regularization may generalize to other modalities (e.g., audio, structured signals) or more complex perception–reasoning couplings.

Recent evidence also indicates that eliminating blind reasoning is nontrivial; typical RLVR approaches may reliably generate plausible outputs in the absence of informative perception, raising concerns about evaluation protocols in multimodal reasoning. Approaches that explicitly enforce perception-reasoning coupling, such as DVRP, offer a framework for more faithful and reliable multimodal intelligence (Gao et al., 11 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differential Visual Reasoning Policy (DVRP).