Papers
Topics
Authors
Recent
Search
2000 character limit reached

BiPS: Bidirectional Perceptual Shaping

Updated 29 December 2025
  • BiPS is a two-stage training method for vision–language models that employs dual image views—one retaining critical evidence and one ablating it—for enhanced visual reasoning.
  • It leverages attractive and repulsive KL divergence constraints to align model predictions with evidence-preserving views while discouraging shortcut learning.
  • The approach achieves significant accuracy gains on multiple VQA and chart understanding benchmarks by directly supervising the model with complementary positive and negative signals.

Bi-directional Perceptual Shaping (BiPS) is a two-stage training-time methodology for vision–LLMs (VLMs) that enforces reliance on question-relevant visual evidence. BiPS programmatically constructs two "views" of an image given a question–image pair: one that preserves only the evidence needed to answer the question and one that ablates the critical evidence so that the answer becomes incorrect. Using a combination of attractive and repulsive Kullback-Leibler (KL) divergence constraints, BiPS trains a VLM such as Qwen2.5-VL-7B to align its answer distribution with the evidence-preserving view and diverge from the ablated view, thus minimizing shortcut learning and improving fine-grained visual reasoning. The approach demonstrates state-of-the-art results across multiple VQA and chart understanding benchmarks and generalizes well beyond synthetic charts to natural image VQA contexts (Zhang et al., 26 Dec 2025).

1. Conceptual Foundations and Rationale

BiPS addresses the limitations of prior VLM training protocols that utilize intermediate visual cues, such as latent visual tokens or externally injected signals. These earlier methods often fail to encourage the model to faithfully attend to fine-grained, question-critical elements of an image and may generalize poorly across domains or incur high inference costs. BiPS introduces bidirectional "where-to-look" constraints at training time by systematically generating two alternate masked views for each question–image pair:

  • Evidence-preserving view (VpresV_{\mathrm{pres}}): Retains only pixels needed to answer the question, masking out irrelevant content while preserving layout elements.
  • Evidence-ablated view (VablV_{\mathrm{abl}}): Removes or masks decisive pixels, rendering the original answer unanswerable or incorrect while preserving the remainder of the figure.

These masked views enable complementary KL-based losses to directly supervise the model’s output distribution, providing both positive and negative shaping signals and mitigating text-only or shortcut strategies.

2. View Generation Pipeline

The generation of VpresV_{\mathrm{pres}} and VablV_{\mathrm{abl}} is conducted through a structured pipeline in the context of chart data rendered via Matplotlib, Seaborn, Altair, or similar tools:

  1. Reformulation and Validation: Each open-ended chart question is rewritten as a multiple-choice question and validated by an auxiliary LLM (GPT5-mini), which also generates the correct answer.
  2. Difficulty Filtering: The base Qwen2.5-VL-7B-Instruct model is run for 8 rollouts; questions answered correctly every time are filtered out to ensure non-triviality.
  3. Programmatic Editing:
    • To produce VpresV_{\mathrm{pres}}, non-essential code components (distractor series, subplot panels, annotations) are deleted by an LLM arbitrator while retaining chart structure (axes, legends, color assignments).
    • To generate VablV_{\mathrm{abl}}, the code that yields the decisive evidence (data arrays, threshold lines, scatter points) is removed or blanked, with other components retained.

This pipeline creates approximately 13,000 high-quality (I,q,Vpres,Vabl)(I, q, V_{\mathrm{pres}}, V_{\mathrm{abl}}) tuples in the chart domain, facilitating robust evidence-based supervision.

3. Loss Formulation and Optimization

BiPS augments the standard Group-Relative Proximal Policy Optimization (GRPO) training objective, which aligns the VLM with the correct multiple-choice answer, by adding two KL-based terms:

  • KL-consistency (Lconsistency\mathcal{L}_{\mathrm{consistency}}):

Lconsistency=clipcconsDKL(πθ(I,q)    stopgrad[π~θ(Vpres,q)])\mathcal{L}_{\mathrm{consistency}} = \mathrm{clip}_{c_{\mathrm{cons}}}\, D_{\mathrm{KL}}\bigl(\pi_{\theta}(\cdot\mid I,q)\;\|\;\mathrm{stopgrad}[\tilde\pi_{\theta}(\cdot\mid V_{\mathrm{pres}},q)]\bigr)

This term attracts the model's output on the full image II towards its output on the evidence-preserving view, encouraging coverage of all supporting pixels.

  • KL-separation (Lseparation\mathcal{L}_{\mathrm{separation}}):

Lseparation=clipcsepDKL(πθ(I,q)    stopgrad[π~θ(Vabl,q)])\mathcal{L}_{\mathrm{separation}} = -\mathrm{clip}_{c_{\mathrm{sep}}}\, D_{\mathrm{KL}}\bigl(\pi_{\theta}(\cdot\mid I,q)\;\|\;\mathrm{stopgrad}[\tilde\pi_{\theta}(\cdot\mid V_{\mathrm{abl}},q)]\bigr)

This term repels the model's output away from that of the evidence-ablated view, discouraging reliance on incomplete or misleading visual information.

The combined BiPS training loss is:

LBiPS=Ltask+λcLconsistency+λsLseparation\mathcal{L}_{\mathrm{BiPS}} = \mathcal{L}_{\mathrm{task}} + \lambda_{c}\,\mathcal{L}_{\mathrm{consistency}} + \lambda_{s}\,\mathcal{L}_{\mathrm{separation}}

where λc\lambda_{c} and λs\lambda_{s} are balancing coefficients (λc=0.01\lambda_{c}=0.01, λs=0.02\lambda_{s}=0.02), and cconsc_{\mathrm{cons}}, csepc_{\mathrm{sep}} denote clipping thresholds for numerical stability.

4. Training Procedure and Implementation

BiPS employs a curriculum with two primary stages, optionally followed by a mixed-domain fine-tuning phase:

  1. Stage 1—Consistency only: The model is trained with KL-consistency on (I,q,Vpres)(I, q, V_{\mathrm{pres}}) pairs:
    1
    2
    3
    4
    5
    6
    7
    
    for epoch=1 … E1:
        for each (I,q,V_pres):
            π_orig = forward(θ, I, q)
            π_pres = forward(θ, V_pres, q) (no grad)
            L = GRPO_loss(π_orig; reward)
                + λ_c * KL(π_orig ‖ stopgrad(π_pres))
            θ ← AdamW_step(θ, ∇_θ L)
  2. Stage 2—Add Separation: Both KL terms are used on (I,q,Vpres,Vabl)(I, q, V_{\mathrm{pres}}, V_{\mathrm{abl}}):
    1
    2
    3
    4
    5
    6
    7
    8
    9
    
    for epoch=1 … E2:
        for each (I,q,V_abl,V_pres):
            π_orig = forward(θ, I, q)
            π_pres = forward(θ, V_pres, q) (stopgrad)
            π_abl  = forward(θ, V_abl, q)  (stopgrad)
            L = GRPO_loss(π_orig; reward)
                + λ_c * KL(π_orig ‖ stopgrad(π_pres))
                − λ_s * KL(π_orig ‖ stopgrad(π_abl))
            θ ← AdamW_step(θ, ∇_θ L)
  • Model architecture: Qwen2.5-VL-7B is used as the base—a standard vision–language transformer with a frozen vision tower and multimodal encoder–decoder. No architectural changes are needed; BiPS operates entirely at the loss function level, passing all three views through the model, with stop-gradient applied to the masked views.

5. Empirical Results and Quantitative Evaluation

BiPS yields significant performance improvements over the Qwen2.5-VL-7B baseline across eight diverse benchmarks. Table 1 summarizes accuracy gains:

Benchmark Qwen2.5-VL-7B (base) BiPS-General Gain
ChartXiv 42.5 50.6 +8.1
ChartQAPro 36.6 51.8 +15.2
ChartMuseum 26.8 34.0 +7.2
Evochart 52.0 68.7 +16.7
MathVista 68.2 75.0 +6.8
MathVision 25.2 28.6 +3.4
MathVerse-VO 41.1 45.3 +4.2
MMStar 62.1 65.7 +3.6

The average gain is +8.2 percentage points (Zhang et al., 26 Dec 2025).

Ablations on GRPO-only baselines, individual KL constraints, and full BiPS confirm that the combination of consistency and separation losses, curriculum ordering (cons→sep), and programmatic code-based masking strategies are each essential for maximal performance. Hyperparameter sweeps indicate robust gains across λc[0.005,0.02]\lambda_{c}\in[0.005,0.02] and λs[0.01,0.04]\lambda_{s}\in[0.01,0.04], with best results at (0.01,0.02)(0.01,0.02).

6. Limitations and Prospects

Although BiPS demonstrates superior generalization from synthetic chart training to out-of-domain data—including natural-image VQA with complex visual primitives (e.g., object counting, polylines)—the approach has notable constraints. The method requires access to chart-rendering code to generate exact evidence masks, and extension to natural images would necessitate robust saliency or segmentation supervision. Potential future directions include unsupervised saliency map or Class Activation Map (CAM) strategies to approximate VpresV_{\mathrm{pres}} for general images; replacing code-based editing with learned masking networks (e.g., U-Nets) trained on pseudo-labels; and integration with latent visual-token methods for enhanced intermediate reasoning capabilities (Zhang et al., 26 Dec 2025).

7. Relationship to Broader Research

By explicitly shaping both the positive and negative spaces of evidence via bidirectional KL constraints, BiPS advances the field of multimodal reasoning. Compared to earlier methods based on external visual cues or latent token generation, BiPS enforces true visual grounding without modifying neural architectures or incurring inference-time overhead. The methodology's generalization across domains and significant empirical improvements establish it as a rigorous framework for perceptually-aware VLM training (Zhang et al., 26 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Bi-directional Perceptual Shaping (BiPS).