Differential Visual Reasoning Policy

Updated 18 January 2026

DVRP is a reinforcement learning framework for multimodal models that couples visual perception with reasoning through intrinsic supervision without external annotations.
It employs three visual views—original, masked, and noisy—to compute KL divergences, enforcing sensitivity to semantic visual changes and robustness to non-semantic noise.
Empirical evaluations show DVRP significantly improves visual reasoning accuracy in mathematical and medical tasks compared to traditional RL with verifiable rewards.

The Differential Visual Reasoning Policy (DVRP) is a reinforcement learning (RL) framework for multimodal LLMs (MLLMs) that enforces visual grounding by incentivizing divergence in model reasoning according to strictly visual evidence changes. DVRP addresses a key limitation of existing RL with verifiable rewards (RLVR) paradigms in multimodal settings—namely, the decoupling of perception and reasoning, which often results in models that can perform well by ignoring visual content and relying solely on linguistic priors (“blind reasoners”). DVRP introduces intrinsic supervision through visual triplet transformations and a composite loss designed to maximize visual sensitivity and minimize non-semantic sensitivity, leading to more faithful visual reasoning without requiring external annotations or auxiliary tools (Gao et al., 11 Jan 2026).

1. Formal Structure and Mathematical Foundations

Let each data instance consist of a visual input $V$ (image) and a textual query $q$ . The model’s reasoning policy $\pi_\theta$ , parameterized by $\theta$ , emits an output sequence $o$ (chain-of-thought plus final answer) conditioned on both modalities. DVRP constructs three distinct “views” of each input image:

Original (Invariant) View: $V = I$ , with policy $\pi_\theta^{\rm Ori}(\cdot \mid I, q)$ .
Decremental (Masked) View: $V_{\rm mask} = I_{\rm mask}$ , obtained via random patch occlusion, with policy $\pi_\theta^{\rm Mask}(\cdot \mid I_{\rm mask}, q)$ .
Incremental (Noisy) View: $V_{\rm noise}=I_{\rm noise}$ , created by injecting diffusion-based noise, with policy $\pi_\theta^{\rm Noise}(\cdot \mid I_{\rm noise}, q)$ .

The autoregressive policy factorizes as: $\pi_\theta(o \mid V, q) = \prod_{t=1}^T \pi_\theta(o_t \mid o_{<t}, V, q)$

The crux of DVRP is to modulate RLVR training by measuring and regularizing KL divergences across these visual views at each autoregressive step.

2. Objective Formulation and Loss Components

DVRP introduces two regularization terms into any base RLVR/PPO-style objective:

Visual Sensitivity ( $\mathcal{L}_{\rm sens}$ ): Maximizes divergence between outputs on the original and masked views, i.e.,

$\mathcal{L}_{\rm sens} = \mathbb{D}\left( \pi_\theta^{\rm Ori} \| \pi_\theta^{\rm Mask} \right)$

This penalizes models for failing to alter their reasoning in response to salient visual deletions.

Visual Robustness ( $\mathcal{L}_{\rm rob}$ ): Minimizes divergence between outputs on the original and noisy views, i.e.,

$\mathcal{L}_{\rm rob} = \mathbb{D}\left( \pi_\theta^{\rm Ori} \| \pi_\theta^{\rm Noise} \right)$

This enforces stability to non-semantic, distribution-preserving perturbations.

KL divergence is computed at every generation step, summed over output tokens and rollouts: $D_{\mathrm{KL}}\left(\pi_\theta^{\mathrm{Ori}} \| \pi_\theta^{\mathrm{Mask}}\right) = \sum_{t=1}^T \sum_{o_t} p_t(o_t) \log\frac{p_t(o_t)}{q_t(o_t)}$

The composite DVRP loss (to maximize under RL) is: $\begin{aligned} \mathcal{J}_{\rm DVRP}(\theta) = &\; \mathcal{J}_{\rm RLVR}(\theta) + \lambda_{\rm sens} D_{KL}\left(\pi_\theta^{\rm Ori} \| \pi_\theta^{\rm Mask}\right) \ & - \lambda_{\rm rob} D_{KL}\left(\pi_\theta^{\rm Ori} \| \pi_\theta^{\rm Noise}\right) - \lambda_{\rm ent} \mathbb{E}\big[ \mathcal{H}(\pi_\theta^{\rm Mask}) + \mathcal{H}(\pi_\theta^{\rm Noise}) \big] \end{aligned}$ where entropy regularization ( $\lambda_{\rm ent}$ ) prevents distribution collapse.

3. Training Algorithm and Implementation Details

DVRP extends GRPO or DAPO-style PPO frameworks via intrinsic “delta” regularization, using three images per example and corresponding rollout families. The end-to-end update is efficiently computed using shared trajectories for KL computations. The high-level algorithm is:

Initialize θ ← θ₀  (pretrained MLLM)
for epoch = 1 … N do
  for batch in DataLoader(D) do
    {(I,q,a)} ← batch
    I_mask ← random_patch_mask(I; p_mask)
    I_noise ← diffusion_noise(I; T_init, schedule)

    {o_i}, logπ_old ← Rollout(πθ, I, q, G)
    R_i ← reward(o_i,a)                  -- e.g. accuracy
    Â_i ← normalize(R_i)
    L_RL ← ClipSurrogateLoss({o_i, Â_i}, logπ_old, πθ)

    L_sens ← KL(πθ(o_i|I,q) || πθ(o_i|I_mask,q))
    L_rob  ← KL(πθ(o_i|I,q) || πθ(o_i|I_noise,q))
    H_mask ← Entropy(πθ(·|I_mask,q))
    H_noise ← Entropy(πθ(·|I_noise,q))

    L_total ← -L_RL
            + λ_sens · L_sens
            - λ_rob  · L_rob
            + λ_ent · (H_mask+H_noise)
    θ ← θ - η ∇_θ L_total
  end
end

Network Architecture:

Vision encoder: CLIP-ViT (patch size 14, output dim 1024)
Text encoder/decoder: Qwen2.5-VL (3B or 7B), with LoRA adapters on cross-attention
Fusion: Visual tokens prepended to Transformer layers with causal decoding

Hyperparameters (as reported):

Hyperparameter	Math	Medical
Learning rate (η)	1e-6	1e-6
Batch size	128	128
Rollouts per example (G)	5	5
Sensitivity weight ( $\lambda_{\rm sens}$ )	0.01	0.01
Robustness weight ( $\lambda_{\rm rob}$ )	0.01	0.01
Entropy weight ( $\lambda_{\rm ent}$ )	0.05	0.05
Patch mask probability ( $P_{\rm mask}$ )	0.6	0.2
Noise steps ( $T_{\rm init}$ )	500	100
Noise schedule ( $\gamma$ )	10	10

4. Empirical Performance and Benchmarks

DVRP was assessed against RLVR baselines on six mathematical reasoning datasets (Geo3k, MathVista, WeMath, MVerse, MVerse-V, MMKI2) and four medical VQA datasets (Slake, PathVQA, RadVQA, PMC-VQA), using both 3B and 7B parameter Qwen2.5-VL backbones.

Model	Math+General Acc.	Δ vs. Base	Medical Acc.	Δ vs. Base
Base (7B)	43.0	—	59.5	—
GRPO (7B)	62.0	+44.2%	70.4	+18.3%
DVRP $_G$ (7B)	65.2	+51.6%	76.4	+28.4%
Base (3B)	29.8	—	48.8	—
DAPO (3B)	54.7	+83.6%	67.8	+39.2%
DVRP $_D$ (3B)	55.7	+86.9%	73.3	+50.5%

Ablation studies (7B, GRPO variant):

Setting	Overall Acc.
GRPO	65.4
+ Sensitivity only	68.1
+ Robustness only	67.0
Full DVRP	69.7

DVRP produces substantial absolute and relative gains across both domains/architectures. Blind-input experiments demonstrate existing RLVR models often retain or exceed baseline accuracy with blanked images (up to ~94% retention), whereas DVRP’s accuracy collapses under masking but is robust to noise perturbations, directly evidencing enforced visual dependency.

Blind reasoning is driven by the model's exploitation of linguistic cues in the query, bypassing the visual channel when rewards can be achieved by default priors. DVRP counteracts this by structurally penalizing output invariance under semantic deletion (masking) and penalizing excessive variance under non-semantic perturbation (diffusion noise). The sensitivity penalty ensures that any attempt by the policy to ignore visual features—generating identical or plausible responses absent visual content—incurs consistent learning costs. Consequently, the model is compelled to attend to actual image evidence for successful optimization.

This methodology generates self-supervised “Δ” signals: the explicit delta between reasonings under controlled visual changes, recoupling perception with reasoning and thereby enforcing authentic multimodal alignment.

6. Qualitative Assessment and Example Cases

In the baseline RLVR models (e.g., GRPO, DAPO), accuracy in visual reasoning tasks remains stable or may even increase under blank-image scenarios, indicating an overreliance on linguistic heuristics. In contrast, DVRP-trained models display a significant accuracy drop when critical visual content is masked but remain robust under visual noise; this dichotomy is interpreted in the work as evidence that DVRP forces reliance on identifiable visual features.

Example rollout analyses illuminate failures in baseline models, such as “hallucinating numbers” or exploiting common question templates when provided with no image, versus DVRP models that reference precise visual cues—such as color counts, geometric object properties, or graph topologies—even in complex mathematical or medical contexts (Gao et al., 11 Jan 2026).

7. Implications and Considerations

DVRP is a lightweight, end-to-end approach requiring no additional ground-truth labels, external annotation, or explicit tool integration. By introducing self-supervised, per-instance visual delta supervision, DVRP represents a principled strategy for aligning multimodal RL agents with true perceptual reasoning rather than spurious language-based shortcuts. A plausible implication is that DVRP-like regularization may generalize to other modalities (e.g., audio, structured signals) or more complex perception–reasoning couplings.

Recent evidence also indicates that eliminating blind reasoning is nontrivial; typical RLVR approaches may reliably generate plausible outputs in the absence of informative perception, raising concerns about evaluation protocols in multimodal reasoning. Approaches that explicitly enforce perception-reasoning coupling, such as DVRP, offer a framework for more faithful and reliable multimodal intelligence (Gao et al., 11 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

Thinking with Deltas: Incentivizing Reinforcement Learning via Differential Visual Reasoning Policy (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Differential Visual Reasoning Policy (DVRP).

Differential Visual Reasoning Policy

1. Formal Structure and Mathematical Foundations

2. Objective Formulation and Loss Components

3. Training Algorithm and Implementation Details

4. Empirical Performance and Benchmarks

5. Mechanisms Preventing Blind Reasoning

6. Qualitative Assessment and Example Cases

7. Implications and Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Differential Visual Reasoning Policy

1. Formal Structure and Mathematical Foundations

2. Objective Formulation and Loss Components

3. Training Algorithm and Implementation Details

4. Empirical Performance and Benchmarks

5. Mechanisms Preventing Blind Reasoning

6. Qualitative Assessment and Example Cases

7. Implications and Considerations

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research