Differential Visual Reasoning Policy
- DVRP is a reinforcement learning framework for multimodal models that couples visual perception with reasoning through intrinsic supervision without external annotations.
- It employs three visual views—original, masked, and noisy—to compute KL divergences, enforcing sensitivity to semantic visual changes and robustness to non-semantic noise.
- Empirical evaluations show DVRP significantly improves visual reasoning accuracy in mathematical and medical tasks compared to traditional RL with verifiable rewards.
The Differential Visual Reasoning Policy (DVRP) is a reinforcement learning (RL) framework for multimodal LLMs (MLLMs) that enforces visual grounding by incentivizing divergence in model reasoning according to strictly visual evidence changes. DVRP addresses a key limitation of existing RL with verifiable rewards (RLVR) paradigms in multimodal settings—namely, the decoupling of perception and reasoning, which often results in models that can perform well by ignoring visual content and relying solely on linguistic priors (“blind reasoners”). DVRP introduces intrinsic supervision through visual triplet transformations and a composite loss designed to maximize visual sensitivity and minimize non-semantic sensitivity, leading to more faithful visual reasoning without requiring external annotations or auxiliary tools (Gao et al., 11 Jan 2026).
1. Formal Structure and Mathematical Foundations
Let each data instance consist of a visual input (image) and a textual query . The model’s reasoning policy , parameterized by , emits an output sequence (chain-of-thought plus final answer) conditioned on both modalities. DVRP constructs three distinct “views” of each input image:
- Original (Invariant) View: , with policy .
- Decremental (Masked) View: , obtained via random patch occlusion, with policy .
- Incremental (Noisy) View: , created by injecting diffusion-based noise, with policy .
The autoregressive policy factorizes as:
The crux of DVRP is to modulate RLVR training by measuring and regularizing KL divergences across these visual views at each autoregressive step.
2. Objective Formulation and Loss Components
DVRP introduces two regularization terms into any base RLVR/PPO-style objective:
- Visual Sensitivity (): Maximizes divergence between outputs on the original and masked views, i.e.,
This penalizes models for failing to alter their reasoning in response to salient visual deletions.
- Visual Robustness (): Minimizes divergence between outputs on the original and noisy views, i.e.,
This enforces stability to non-semantic, distribution-preserving perturbations.
KL divergence is computed at every generation step, summed over output tokens and rollouts:
The composite DVRP loss (to maximize under RL) is: where entropy regularization () prevents distribution collapse.
3. Training Algorithm and Implementation Details
DVRP extends GRPO or DAPO-style PPO frameworks via intrinsic “delta” regularization, using three images per example and corresponding rollout families. The end-to-end update is efficiently computed using shared trajectories for KL computations. The high-level algorithm is:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 |
Initialize θ ← θ₀ (pretrained MLLM) for epoch = 1 … N do for batch in DataLoader(D) do {(I,q,a)} ← batch I_mask ← random_patch_mask(I; p_mask) I_noise ← diffusion_noise(I; T_init, schedule) {o_i}, logπ_old ← Rollout(πθ, I, q, G) R_i ← reward(o_i,a) -- e.g. accuracy Â_i ← normalize(R_i) L_RL ← ClipSurrogateLoss({o_i, Â_i}, logπ_old, πθ) L_sens ← KL(πθ(o_i|I,q) || πθ(o_i|I_mask,q)) L_rob ← KL(πθ(o_i|I,q) || πθ(o_i|I_noise,q)) H_mask ← Entropy(πθ(·|I_mask,q)) H_noise ← Entropy(πθ(·|I_noise,q)) L_total ← -L_RL + λ_sens · L_sens - λ_rob · L_rob + λ_ent · (H_mask+H_noise) θ ← θ - η ∇_θ L_total end end |
Network Architecture:
- Vision encoder: CLIP-ViT (patch size 14, output dim 1024)
- Text encoder/decoder: Qwen2.5-VL (3B or 7B), with LoRA adapters on cross-attention
- Fusion: Visual tokens prepended to Transformer layers with causal decoding
Hyperparameters (as reported):
| Hyperparameter | Math | Medical |
|---|---|---|
| Learning rate (η) | 1e-6 | 1e-6 |
| Batch size | 128 | 128 |
| Rollouts per example (G) | 5 | 5 |
| Sensitivity weight () | 0.01 | 0.01 |
| Robustness weight () | 0.01 | 0.01 |
| Entropy weight () | 0.05 | 0.05 |
| Patch mask probability () | 0.6 | 0.2 |
| Noise steps () | 500 | 100 |
| Noise schedule () | 10 | 10 |
4. Empirical Performance and Benchmarks
DVRP was assessed against RLVR baselines on six mathematical reasoning datasets (Geo3k, MathVista, WeMath, MVerse, MVerse-V, MMKI2) and four medical VQA datasets (Slake, PathVQA, RadVQA, PMC-VQA), using both 3B and 7B parameter Qwen2.5-VL backbones.
| Model | Math+General Acc. | Δ vs. Base | Medical Acc. | Δ vs. Base |
|---|---|---|---|---|
| Base (7B) | 43.0 | — | 59.5 | — |
| GRPO (7B) | 62.0 | +44.2% | 70.4 | +18.3% |
| DVRP(7B) | 65.2 | +51.6% | 76.4 | +28.4% |
| Base (3B) | 29.8 | — | 48.8 | — |
| DAPO (3B) | 54.7 | +83.6% | 67.8 | +39.2% |
| DVRP(3B) | 55.7 | +86.9% | 73.3 | +50.5% |
Ablation studies (7B, GRPO variant):
| Setting | Overall Acc. |
|---|---|
| GRPO | 65.4 |
| + Sensitivity only | 68.1 |
| + Robustness only | 67.0 |
| Full DVRP | 69.7 |
DVRP produces substantial absolute and relative gains across both domains/architectures. Blind-input experiments demonstrate existing RLVR models often retain or exceed baseline accuracy with blanked images (up to ~94% retention), whereas DVRP’s accuracy collapses under masking but is robust to noise perturbations, directly evidencing enforced visual dependency.
5. Mechanisms Preventing Blind Reasoning
Blind reasoning is driven by the model's exploitation of linguistic cues in the query, bypassing the visual channel when rewards can be achieved by default priors. DVRP counteracts this by structurally penalizing output invariance under semantic deletion (masking) and penalizing excessive variance under non-semantic perturbation (diffusion noise). The sensitivity penalty ensures that any attempt by the policy to ignore visual features—generating identical or plausible responses absent visual content—incurs consistent learning costs. Consequently, the model is compelled to attend to actual image evidence for successful optimization.
This methodology generates self-supervised “Δ” signals: the explicit delta between reasonings under controlled visual changes, recoupling perception with reasoning and thereby enforcing authentic multimodal alignment.
6. Qualitative Assessment and Example Cases
In the baseline RLVR models (e.g., GRPO, DAPO), accuracy in visual reasoning tasks remains stable or may even increase under blank-image scenarios, indicating an overreliance on linguistic heuristics. In contrast, DVRP-trained models display a significant accuracy drop when critical visual content is masked but remain robust under visual noise; this dichotomy is interpreted in the work as evidence that DVRP forces reliance on identifiable visual features.
Example rollout analyses illuminate failures in baseline models, such as “hallucinating numbers” or exploiting common question templates when provided with no image, versus DVRP models that reference precise visual cues—such as color counts, geometric object properties, or graph topologies—even in complex mathematical or medical contexts (Gao et al., 11 Jan 2026).
7. Implications and Considerations
DVRP is a lightweight, end-to-end approach requiring no additional ground-truth labels, external annotation, or explicit tool integration. By introducing self-supervised, per-instance visual delta supervision, DVRP represents a principled strategy for aligning multimodal RL agents with true perceptual reasoning rather than spurious language-based shortcuts. A plausible implication is that DVRP-like regularization may generalize to other modalities (e.g., audio, structured signals) or more complex perception–reasoning couplings.
Recent evidence also indicates that eliminating blind reasoning is nontrivial; typical RLVR approaches may reliably generate plausible outputs in the absence of informative perception, raising concerns about evaluation protocols in multimodal reasoning. Approaches that explicitly enforce perception-reasoning coupling, such as DVRP, offer a framework for more faithful and reliable multimodal intelligence (Gao et al., 11 Jan 2026).