Papers
Topics
Authors
Recent
Search
2000 character limit reached

See Less, See Right: Bi-directional Perceptual Shaping For Multimodal Reasoning

Published 26 Dec 2025 in cs.CV | (2512.22120v1)

Abstract: Large vision-LLMs (VLMs) often benefit from intermediate visual cues, either injected via external tools or generated as latent visual tokens during reasoning, but these mechanisms still overlook fine-grained visual evidence (e.g., polylines in charts), generalize poorly across domains, and incur high inference-time cost. In this paper, we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training. BiPS first applies a KL-consistency constraint between the original image and an evidence-preserving view that keeps only question-relevant regions, encouraging coarse but complete coverage of supporting pixels. It then applies a KL-separation constraint between the original and an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer, discouraging text-only shortcuts (i.e., answering from text alone) and enforcing fine-grained visual reliance. Across eight benchmarks, BiPS boosts Qwen2.5-VL-7B by 8.2% on average and shows strong out-of-domain generalization to unseen datasets and image types.

Summary

  • The paper introduces BiPS, a training framework that enhances multimodal reasoning by enforcing bidirectional KL divergence on evidence-preserving and evidence-ablated views.
  • The approach uses a two-stage curriculum—consistency then separation—to overcome reliance on text shortcuts, resulting in up to 8.2% accuracy improvement on diverse chart and math tasks.
  • The method exhibits strong cross-domain generalization and data efficiency, offering a promising paradigm for robust, visually grounded VLMs without additional test-time computation.

Bi-directional Perceptual Shaping for Data-Efficient Multimodal Reasoning

Introduction

This work presents Bi-directional Perceptual Shaping (BiPS), a training framework for Vision-LLMs (VLMs) designed to enhance multimodal reasoning performance by explicitly modeling and incentivizing the visual grounding of answers. The motivation stems from pervasive visual perception bottlenecks in VLM-based visual question answering (VQA), where models often rely on text cues, coarse region selection, or domain-specific heuristics at inference, which impairs generalization, incur high test-time costs, and leads to superficial or hallucinated responses. BiPS distinguishes itself by converting perfect, question-specific visual cues—specifically, evidence-preserving and ablated views—into training-time signals, enforcing coarse coverage of all relevant regions and robust resistance to text-only shortcut learning without requiring any custom modules or extra computation at test time. Figure 1

Figure 1: Training and inference paradigms; prior VLM approaches use rigid evidence cues at inference, whereas BiPS internalizes fine-grained perceptual signals exclusively at training.

Methodology

BiPS introduces a precise training curriculum based on bidirectional KL-divergence constraints within the GRPO framework. The method is decomposed into two complementary stages:

  • Consistency Stage: For each input (image, question pair), a programmatically synthesized evidence-preserving view is constructed by removing all content irrelevant to the question at the source code level. Training minimizes the KL-divergence between the model's answer distribution on the original image and that on the preserved view, encouraging complete support coverage and discouraging unnecessary attention to distractors.
  • Separation Stage: An evidence-ablated view is generated by removing just the critical content such that the original answer is no longer supported. Training maximizes the KL-divergence between answer distributions on the original and ablated views, which penalizes predictions that are insensitive to missing evidence, thus blocking reliance on shortcuts and enforcing fine-grained visual reliance.

This bi-directional shaping is performed sequentially in a coarse-to-fine curriculum, shown to outperform joint or reversed-stage alternatives. Figure 2

Figure 2: The BiPS framework applies KL-based constraints for answer consistency with evidence-preserving views and separation from evidence-ablated views in a two-stage curriculum.

Programmatic data generation leverages the ECD chart corpus, where chart instances are paired with executable code so that code-based edits yield perfectly aligned evidence-preserving and evidence-ablated images. An LLM-based arbitrator rewrites open-ended questions to multiple-choice format, filters out trivial instances, and controls evidence manipulations at the code level, resulting in 13K high-quality training tuples. Figure 3

Figure 3: Data generation pipeline for producing chart-question paired evidence-preserving and -ablated views from chart code.

Experimental Results

BiPS was benchmarked on eight diverse chart and general VQA tasks, including CharXiv, ChartQAPro, ChartMuseum, Evochart, MathVista, MathVision, MathVerse-VO, and MMStar. Fine-tuning Qwen2.5-VL-7B solely on the 13K programmatically generated chart samples yields a 7.3% average accuracy improvement—outperforming chart-specialized baselines trained on orders-of-magnitude more data. Incorporating an additional 39K math-specific samples via the GRPO objective further boosts average gains to 8.2%. Notably, BiPS demonstrates strong cross-domain generalization, achieving substantial improvements on both chart-centric and out-of-domain mathematical reasoning tasks.

Ablation studies validate the efficacy of both KL-based constraints: either consistency or separation loss alone yields clear improvements, with their combination offering best overall performance. The curriculum order is critical: coarse-to-fine (consistency then separation) consistently outperforms joint optimization or fine-to-coarse schedules. Furthermore, the programmatic evidence view generation strategy significantly surpasses random masking alternatives in grounding accuracy. Figure 4

Figure 4: Sensitivity analysis of the consistency α\alpha and separation β\beta constraint coefficients on CharXiv performance.

Case studies on CharXiv reveal that standard VLMs frequently hallucinate plausible but visually unsupported answers or exploit dataset-specific statistical priors. BiPS-trained models instead produce responses tightly grounded in visual cues, precisely tracing plotted structures and numerical relationships. Figure 5

Figure 5: Case study illustrating more visually grounded answers by BiPS-Chart compared to Qwen2.5-VL-7B on real chart questions.

Cross-domain transfer experiments further show improved performance on visual counting, with BiPS correctly tracking and subtracting objects, outperforming baselines that miss object-level reasoning. Figure 6

Figure 6: BiPS correctly solves cross-domain visual counting tasks by robust object-level reasoning, unlike the baseline.

Implications and Future Directions

The theoretical implication of BiPS is the formalization of multimodal reasoning as a distributional alignment problem over fine-grained visual evidence, rather than a pipeline of intermediate tool calls or latent token inference. Practically, BiPS demonstrates that precise, code-based evidence view manipulation—used solely at training—enables efficient, generalizable, and robust perceptual reasoning without test-time overhead. The programmatic pipeline is tightly coupled to charts; however, the underlying principle could be extrapolated to other structured domains with accessible symbolic provenance (e.g., medical images, geometry diagrams) or potentially synthesized for natural images via advanced segmentation or generative editing.

The demonstrated data efficiency and out-of-domain generalization of BiPS suggest a broad utility for resource-constrained multimodal training and domain adaptation. Future extensions may include more granular compositional reasoning supervision, adaptation of the pipeline to heterogeneous image types, and integration with downstream reasoning protocols that exploit the disentangled perceptual grounding established by BiPS.

Conclusion

BiPS presents a robust framework for enhancing visual grounding in multimodal reasoning, converting programmatically generated, question-specific visual cues into bi-directional KL constraints during training. The method yields strong data efficiency, fine-grained perceptual capabilities, and cross-domain generalization, with no test-time computational overhead. BiPS provides an effective paradigm for future research in perceptually grounded VLM training, particularly in domains characterized by complex visual structure and the need for precise evidence localization.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Explain it Like I'm 14

Overview

This paper introduces a new way to train AI models that look at pictures and answer questions about them (called vision–LLMs, or VLMs). The idea is simple: help the model learn where to look in an image by showing it two special versions of the picture—one that keeps only the parts that matter and one that removes those parts. The method is called Bi-directional Perceptual Shaping (BiPS), and it helps the model “see less” but “see right,” so its answers are based on the correct visual evidence.

What questions did the researchers ask?

The paper asks:

  • How can we teach AI models to pay attention to the exact parts of an image that matter for a question (like thin lines or small symbols in charts), instead of guessing from text or big, simple shapes?
  • Can we train models using strong visual hints during training so they don’t need extra tools or steps during real use?
  • Will this approach work well not just on charts but also on other kinds of images and question types?

How did they do it? (Methods)

The researchers created a training strategy that gives the model smart “views” of the image based on the question:

  • Evidence-preserving view: a version of the image that keeps only the pieces needed to answer the question (like just the relevant curve in a chart).
  • Evidence-ablated view: a version of the image where the key evidence is removed, so the image no longer supports the original answer.

They use these views to shape the model’s behavior in two steps:

Turning “helper views” into training signals

Instead of using boxes or masks while the model is answering questions (which can be slow and miss fine details), they generate perfect training views and use them to teach the model what to rely on. During training only:

  • The model is pulled toward giving answers that match the evidence-preserving view (so it focuses on the right content).
  • The model is pushed away from giving answers that match the evidence-ablated view (so it won’t rely on text or shortcuts).

Think of it like practicing with two versions of a homework problem: one with the key hints highlighted and one with the hints erased, so you learn exactly which clues matter.

Two training steps: consistency and separation

  • Consistency step: The model’s predictions on the full image should be similar to its predictions on the “important parts only” view. This teaches “look here.”
  • Separation step: The model’s predictions on the full image should be different from its predictions on the “important parts removed” view. This teaches “don’t ignore the picture.”

To measure “similar” or “different,” they use a tool called KL divergence. You can think of it like a score for how far two sets of guesses are from each other—the lower the score, the more similar; the higher the score, the more different.

They train this inside a safe optimization setup called GRPO (Group Relative Policy Optimization), which is a reinforcement learning method that improves the model step by step with rewards while keeping changes stable.

Where did the training data come from?

They needed very precise training views, especially for charts (which have fine lines, layered marks, axes, legends, etc.). So they:

  • Used a dataset of computer-generated charts paired with the code that draws them.
  • Programmatically edited the chart code to create exact evidence-preserving and evidence-ablated views—no human labeling needed.
  • Built 13,000 high-quality training examples this way.
  • Later added 39,000 math-related samples to further boost general reasoning.

What did they find?

  • Strong improvements: On eight different benchmarks, their method boosted the base model (Qwen2.5-VL-7B) by about 7–8% on average.
  • Better generalization: Training only on charts helped not just with charts (like CharXiv, ChartQAPro, and Evochart) but also with general visual reasoning tasks (like MathVista and MMStar).
  • Data efficiency: With just 13,000 chart examples, they beat or matched models trained on hundreds of thousands or even millions of samples.
  • No extra steps at test time: Because the “where to look” learning happens during training, the model doesn’t need special tools or extra images when answering questions later. It’s faster and less error-prone.

Some example numbers:

  • Average improvement: +7.3% using only chart-based training, and up to +8.2% after adding math-focused data.
  • Big gains on specific tasks (e.g., Evochart saw a jump of over +16 points).

Why it matters

  • More reliable answers: The model learns to ground its reasoning in the right visual evidence, avoiding guesswork or “text-only shortcuts.”
  • Handles fine details: It can catch thin lines, small symbols, and irregular shapes that simple boxes or crops often miss.
  • Works across domains: Even though training was chart-based, the method helped on other image-and-question tasks—showing strong generalization.
  • Faster in real use: No extra visual steps or tools are needed at inference time, which saves time and reduces errors.

Simple takeaway

BiPS teaches models where to look by training them with two smart versions of each image: one that keeps the important parts and one that removes them. This two-step “pull toward the right evidence, push away from wrong shortcuts” approach makes the model more accurate, more trustworthy, and more efficient—without needing extra help when it’s actually answering questions.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a concise list of what remains missing, uncertain, or unexplored in the paper, formulated to guide future research.

  • Domain generalization of paired-view construction:
    • The method relies on programmatic chart code to create evidence-preserving (IpresI_{\mathrm{pres}}) and evidence-ablated (IablI_{\mathrm{abl}}) views. It remains unclear how to generate semantically faithful paired views for natural images, documents, diagrams, medical images, videos, or multi-image tasks where executable renderers are unavailable.
  • Fidelity of evidence ablation:
    • The paper assumes IablI_{\mathrm{abl}} “no longer supports the original answer,” but does not quantify how often ablation truly invalidates the answer (flip rate), whether residual cues leak, or how partial ablations affect learning. A systematic evaluation of counterpart correctness is missing.
  • Reliability and bias in LLM-driven code editing:
    • The pipeline uses an LLM arbitrator to reformulate questions and edit chart code. The robustness, error rate, and bias of these edits—and their downstream impact on training—are not measured. How failures (e.g., wrong removal, unintended side effects) propagate into the KL training remains unknown.
  • Multiple-choice conversion effects:
    • Reformulating open-ended questions into multiple-choice may change task difficulty and answer distributions. The paper does not examine whether performance gains transfer to free-form answering or to tasks without discrete options.
  • Scope of sensitivity analysis:
    • Hyperparameter sensitivity (e.g., α\alpha, β\beta, cconsc_{\text{cons}}, csepc_{\text{sep}}) is only shown on CharXiv. Cross-benchmark and cross-domain robustness to these coefficients and clipping thresholds is untested.
  • Training-time overhead and scalability:
    • While inference involves no extra tools, the compute and wall-clock costs of generating and verifying IpresI_{\mathrm{pres}}/IablI_{\mathrm{abl}} (LLM calls, code execution) are not reported. Scalability to larger datasets and more complex figures is an open question.
  • Data scaling laws and sample efficiency:
    • The method shows gains with 13K chart samples, but there is no systematic scaling analysis (more/less data, diversity of chart types) or characterization of diminishing returns.
  • Generality across base architectures:
    • BiPS is only evaluated on Qwen2.5-VL-7B. Whether the bidirectional KL shaping consistently benefits other VLMs (e.g., InternVL, LLaVA, GPT-4o-mini-like models) of varying sizes and vision backbones is not explored.
  • Theoretical understanding of KL objectives:
    • The paper provides empirical evidence for the coarse-to-fine curriculum but lacks a theoretical analysis of when forward-KL consistency and separation are optimal, how they interact with GRPO updates, and what failure modes (e.g., mode shrinkage or instability) can arise.
  • Alternative divergences and formulations:
    • Only forward KL is considered. The impact of reverse KL, symmetric KL, JS divergence, or contrastive objectives (e.g., InfoNCE) for consistency/separation is unstudied.
  • Curriculum design space:
    • The chosen two-stage schedule (consistency then separation) outperforms baselines, but the paper does not examine adaptive schedules (e.g., annealing α\alpha/β\beta, alternation per batch/episode) or task-conditional curricula.
  • Metrics for visual grounding:
    • Improvements are shown on aggregate accuracy, but there are no direct grounding diagnostics (e.g., attention maps, saliency agreement with IpresI_{\mathrm{pres}}, answer robustness under controlled occlusions) to validate reduced text-only shortcuts or enhanced fine-grained perception.
  • Error taxonomy and failure analysis:
    • Beyond a small case study, there is no systematic analysis of error types (e.g., numerical misreadings, mislocalization, curve-following errors), nor measurement of hallucination rates under image perturbations or blanked inputs.
  • Robustness to imperfect counterparts:
    • The method beats random masking, but it does not test graded counterpart quality (noisy IpresI_{\mathrm{pres}}/IablI_{\mathrm{abl}}, partial evidence removal) to assess tolerance to imperfect view generation.
  • Transfer beyond charts:
    • Although out-of-domain gains are reported on math/VO benchmarks, there is no evaluation on other structured image domains (documents, GUIs, floor plans) or natural scenes requiring irregular evidence localization (e.g., fine edges, textures).
  • Multi-image and video reasoning:
    • BiPS is designed for single-image inputs. How to construct paired views and apply bidirectional shaping to temporal or cross-image dependencies remains unexplored.
  • Interaction with other learning paradigms:
    • The paper pairs BiPS with GRPO; comparisons to DPO/MDPO, policy gradient variants, self-play, or supervised fine-tuning with counterpart-aware losses are absent.
  • Calibration and confidence effects:
    • The impact of consistency/separation on model calibration, uncertainty estimates, and abstention behavior (e.g., predicting “unanswerable” on IablI_{\mathrm{abl}}) is not measured.
  • Annotation-free view generation at inference:
    • While BiPS avoids test-time tools, an open question is whether learned internal policies can emit implicit evidence indicators (e.g., soft masks, attention heatmaps) that correlate with IpresI_{\mathrm{pres}} without explicit generation.
  • Counterpart generation quality control:
    • There is no automated verification pipeline (unit tests on code edits, render-time consistency checks, visual diffs) reported to guarantee that IpresI_{\mathrm{pres}} contains all necessary evidence and IablI_{\mathrm{abl}} truly excludes it.
  • Impact on reasoning vs. perception disentanglement:
    • The method aims to improve perception, but whether reasoning chains or numerical manipulation steps become more reliable (e.g., fewer arithmetic slips) is not analyzed separately from perception gains.
  • Benchmarks and coverage:
    • ECD-Bench appears only in ablations; comprehensive inclusion and analysis across all reported benchmarks (with per-category breakdowns) is incomplete. No evaluation on OCR-heavy or text-dense charts is presented.
  • Reproducibility and resource requirements:
    • Full release status of the paired-view dataset, LLM prompts for code editing, and arbitrator configurations is unclear. Without these, reproducing the pipeline and its results may be challenging.
  • Potential negative side effects:
    • Separation could encourage changes in the answer distribution even when ablation preserves enough context to answer; the paper does not quantify any induced brittleness (e.g., increased sensitivity to minor visual perturbations) or trade-offs.
  • Extension to continuous-valued outputs:
    • Many chart tasks require precise numeric extraction; the paper reports accuracy but does not evaluate calibration of numeric predictions, tolerance to high-frequency curves, or value error distributions (e.g., RMSE/MAE on reading tasks).
  • Handling ambiguous or multi-evidence questions:
    • How BiPS deals with questions requiring multiple disjoint evidence regions (and how to construct IpresI_{\mathrm{pres}}/IablI_{\mathrm{abl}} in such cases) is not specified or evaluated.

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that can be built with the paper’s Bi-directional Perceptual Shaping (BiPS) method and data pipeline as-is or with minor adaptation. Each item includes sectors, potential tools/products/workflows, and key assumptions.

Industry

  • Evidence-grounded analytics assistants for BI dashboards
    • Sectors: software, finance, operations, marketing
    • What: VLM copilots that answer questions about embedded charts in BI tools (e.g., “Which region had the steepest Q3 drop?”) while relying on fine-grained evidence (lines, intersections, small markers) and avoiding text-only shortcuts.
    • Tools/products/workflows: BiPS training recipe integrated into model fine-tuning; “evidence-preserving vs. ablated” unit tests for model validation; plugin for Tableau/Power BI/Looker to expose chart QA.
    • Assumptions/dependencies: Access to underlying chart specifications or exportable SVG/metadata for robust counterpart generation; compute for GRPO fine-tuning; base VLM license compatibility.
  • Robust report QA and review bots for enterprises
    • Sectors: finance, consulting, insurance, manufacturing
    • What: Bots that validate claims in slide decks and PDF reports by cross-checking answers with evidence-preserving views and flagging any reliance on ablated views.
    • Tools/products/workflows: “Evidence-grounded review” CI step; auditors’ dashboard showing divergence under ablation; red-team test suites based on KL separation thresholds.
    • Assumptions/dependencies: Reliable PDF/figure parsing; permissioned access to internal reports; data privacy policies for model training/evaluation.
  • Chart-heavy customer support and RPA
    • Sectors: software, telecom, cloud operations
    • What: Assistants that explain monitoring charts (incidents, latency spikes), summarize multi-panel figures, and trigger workflows when evidence meets criteria (e.g., threshold crossings).
    • Tools/products/workflows: BiPS-fine-tuned models embedded into ticketing systems; alert explanation templates; automated “grounded summary” generators.
    • Assumptions/dependencies: Availability of chart images or code; alignment of chart semantics with task prompts; stable model APIs.
  • Evidence-grounded newsroom tools
    • Sectors: media, journalism
    • What: Plugins that produce “explain this chart” captions and quick fact checks against chart evidence to reduce misinterpretation in articles.
    • Tools/products/workflows: CMS-integrated captioner with ablation-based reliability score; editorial QA checklist referencing divergence metrics.
    • Assumptions/dependencies: Consent and licensing to process images; predictable chart templates (SVG preferred).

Academia

  • Scientific figure QA and auto-captioning
    • Sectors: academic publishing, research communication
    • What: Tools that check if conclusions match figure evidence (multi-panel, thin curves), and generate grounded figure captions.
    • Tools/products/workflows: LaTeX/Overleaf plugin; pre-submission figure QA; “evidence-ablation test” to flag potential over-claims.
    • Assumptions/dependencies: Access to figure images or vector sources; variable figure styles; limited gold labels for evaluation.
  • Benchmarking and training data synthesis for chart reasoning
    • Sectors: ML research, data curation
    • What: Use the paper’s programmatic code-editing pipeline to create preserve/ablate counterparts and MCQ conversions for new chart corpora, improving VLMs’ visual grounding.
    • Tools/products/workflows: BiPS data generator; public challenge sets with paired views; evaluation harness reporting KL-consistency/separation metrics.
    • Assumptions/dependencies: Executable chart code or structured provenance (e.g., Vega/Matplotlib); small LLM arbitrator for MCQ validation.

Policy and Governance

  • Evidence-grounding compliance checks for model deployments
    • Sectors: governance, risk, compliance
    • What: Standardized “ablation robustness” tests in model validation reports to demonstrate reduced shortcut reliance and lower hallucination risk for chart-related tasks.
    • Tools/products/workflows: Governance templates and checklists; dashboards showing divergence under ablated views; thresholds for acceptable evidence reliance.
    • Assumptions/dependencies: Agreement on metrics and thresholds; reproducible test datasets; documentation of training regimen.
  • Energy- and latency-conscious AI procurement
    • Sectors: public sector IT, enterprise IT
    • What: Favor solutions that internalize perception at training (no inference-time tool chains), reducing runtime overhead and energy use.
    • Tools/products/workflows: Procurement criteria rewarding “no test-time tool” pipelines; cost–carbon reporting comparing tool-heavy vs. BiPS-like approaches.
    • Assumptions/dependencies: Transparent vendor documentation; standardized energy measurement.

Daily Life

  • “Explain my chart” assistants for personal use
    • Sectors: consumer software, education
    • What: Apps that interpret fitness, finance, or weather charts on-device/cloud, giving grounded answers like “Your steps peaked on Tuesday.”
    • Tools/products/workflows: Mobile app feature using a BiPS-fine-tuned small VLM; optional “show evidence” overlay.
    • Assumptions/dependencies: Consent to process images; variability of consumer chart styles.
  • Homework and study helpers for math/physics with figures
    • Sectors: education
    • What: Tutors that read problem diagrams/charts and give grounded hints without hallucinating structure.
    • Tools/products/workflows: LMS plugins; “evidence-preserving” mini-views as pedagogical aids.
    • Assumptions/dependencies: School privacy constraints; need for stronger domain coverage beyond charts for some curricula.

Long-Term Applications

Below are higher-impact directions that require additional research, scaling, or domain-specific data/programmatic view generation pipelines.

Industry

  • Multi-domain document intelligence with fine-grained grounding
    • Sectors: legal, pharma, energy, manufacturing
    • What: End-to-end systems that parse diverse technical figures (process diagrams, CAD schematics, geo maps) and answer grounded questions for audits, safety, and operations.
    • Tools/products/workflows: Domain-specific preserve/ablate generators built from source artifacts (CAD, GIS layers); cross-modal provenance tracing.
    • Assumptions/dependencies: Access to structured source layers (CAD/GIS/diagram objects) rather than raw pixels; robust parsing; custom data agreements.
  • Financial risk and compliance analytics over complex figures
    • Sectors: finance, fintech, audit
    • What: Evidence-grounded analysis of prospectuses and regulatory filings that contain intricate charts, automatically surfacing discrepancies and supporting/contradicting evidence.
    • Tools/products/workflows: Filing ingestors; ediscovery-like search with ablation divergence scoring; auditor co-pilots.
    • Assumptions/dependencies: High-fidelity vector extraction; reliable OCR/structure alignment; regulatory acceptance of evaluation methodology.

Healthcare and Scientific Imaging

  • Medical imaging assistants with fine structure grounding
    • Sectors: healthcare, medical devices
    • What: Extend BiPS-style shaping to enforce reliance on lesion contours and subtle structures (e.g., microcalcifications), reducing text-only bias in report generation and QA.
    • Tools/products/workflows: Programmatic preserve/ablate view generation via segmentation masks; radiology report copilot with “show supporting pixels.”
    • Assumptions/dependencies: High-quality, expert-validated masks; regulatory approval; domain-shift mitigation beyond charts.
  • Laboratory automation and instrument UIs
    • Sectors: life sciences, materials science
    • What: Grounded reasoning over plots from instruments (chromatograms, spectra) for anomaly detection and parameter recommendation.
    • Tools/products/workflows: Instrument SDKs exposing peak/line provenance to synthesize counterparts; lab assistant copilots.
    • Assumptions/dependencies: Vendor cooperation for provenance access; rigorous validation on real data.

Robotics and Autonomy

  • Perception shaping for safety-critical HMI and dashboards
    • Sectors: automotive, aviation, industrial robotics
    • What: Assistants that interpret multi-panel telemetry displays and issue grounded, evidence-based alerts, minimizing shortcut-induced false alarms.
    • Tools/products/workflows: Simulator-based counterpart synthesis; ablation stress-tests; runtime “evidence required” gating for alerts.
    • Assumptions/dependencies: Realistic simulators; certification processes; failure mode analysis.
  • On-robot grounded visual reasoning with structured overlays
    • Sectors: robotics
    • What: Extend bidirectional shaping to environments where overlays/annotations (maps, trajectories) can be programmatically preserved/ablated for training robust policies.
    • Tools/products/workflows: Training pipelines coupling sensor data with programmatic masks; GRPO variants tuned for embodied settings.
    • Assumptions/dependencies: Accurate environment labels; sim-to-real transfer; compute budget on-device or edge.

Education and Accessibility

  • Universal figure accessibility with verifiable grounding
    • Sectors: accessibility tech, education
    • What: Auto-generate alt text and interactive explanations for complex figures across textbooks and the web, with ablation-backed evidence confidence.
    • Tools/products/workflows: Browser and e-reader extensions; “confidence under ablation” scores exposed to users.
    • Assumptions/dependencies: Diverse figure formats; fairness evaluation; multilingual support.

Policy and Standards

  • Evidence-grounded AI evaluation standards
    • Sectors: standardization bodies, regulators
    • What: Formalize “evidence-preserving consistency” and “evidence-ablated separation” as audit metrics for multimodal models used in high-stakes settings.
    • Tools/products/workflows: Reference datasets with paired counterparts; guidance docs; certification checklists and thresholds.
    • Assumptions/dependencies: Cross-stakeholder consensus; reproducibility; avoiding perverse incentives (e.g., overfitting to ablation styles).
  • Provenance-first figure ecosystems
    • Sectors: publishing, software tooling
    • What: Encourage publishing pipelines to retain executable/chart provenance (e.g., Vega/Matplotlib notebooks) to enable programmatic counterpart synthesis for trustworthy AI assistants.
    • Tools/products/workflows: “Provenance-required” submission policies; repositories storing code + figure; viewer apps supporting evidence overlays.
    • Assumptions/dependencies: Cultural and workflow changes; IP/privacy concerns; tooling maturity.

Cross-Cutting Tools That Could Emerge

  • BiPS Trainer SDK
    • What: Open-source library to add consistency/separation KL terms to GRPO/PPO pipelines, plus utilities for ablation-based evaluation.
    • Assumptions/dependencies: Stable RL training infra; base model access.
  • Counterpart Synthesizer Suite
    • What: Adapters that generate evidence-preserving/ablated views from different sources: chart code (today), SVG layers, segmentation masks, CAD/GIS layers (future).
    • Assumptions/dependencies: Domain-specific parsers and provenance.
  • Grounding QA Dashboard
    • What: Continuous evaluation with divergence-under-ablation metrics, stress tests, and failure analytics for MLOps.
    • Assumptions/dependencies: Agreement on metrics; dataset curation.

Notes on feasibility and risk across applications:

  • Domain transfer beyond charts requires structured provenance or high-quality masks to avoid brittle, coarse ablations.
  • The LLM arbitrator used for question reformulation introduces a dependency; weak or biased arbitrators can degrade training signal.
  • GRPO and KL terms require careful coefficient tuning to avoid destabilizing optimization; compute cost for training can be significant, though inference is efficient.
  • Governance acceptance depends on transparent documentation of counterpart generation and evaluation protocols.

Glossary

  • Ablation study: A controlled analysis that removes or varies components to measure their contribution to performance. "Ablation study on the components of BiPS."
  • Arbitrator (LLM arbitrator): An auxiliary LLM used to reformulate, validate, and edit supervision signals or code. "we employ an auxiliary LLM arbitrator, GPT5-mini, to convert the original questions into a multiple-choice format."
  • Bi-directional KL constraints: A pair of KL-based objectives that pull predictions toward an evidence-preserving view and push them away from an evidence-ablated view. "Our core methodology comprises two complementary KL constraints operating on the paired views constructed in Section~\ref{sec:data_construction}."
  • Bi-directional Perceptual Shaping (BiPS): The proposed training-time framework that uses paired views to shape a model’s visual grounding via KL-based consistency and separation. "we propose Bi-directional Perceptual Shaping (BiPS), which transforms question-conditioned masked views into bidirectional where-to-look signals that shape perception during training."
  • Coarse-to-fine training curriculum: A training schedule that first teaches broad evidence focus (consistency) and then enforces fine-grained grounding (separation). "This coarse-to-fine curriculum first applies the positive signal ($\mathcal{L}_{\text{cons}$) and then the regularizer ($\mathcal{L}_{\text{sep}$) to ensure the learned policy is both accurate and grounded."
  • Evidence-ablated view: A modified image where critical visual evidence is removed so the original answer is no longer supported. "an evidence-ablated view where critical pixels are masked so the image no longer supports the original answer"
  • Evidence-preserving view: A modified image that retains only the regions necessary to answer the question. "an evidence-preserving view that keeps only question-relevant regions"
  • Executable reasoning: Reasoning that operates over code or programs which can be run to verify or derive answers. "convert charts into structured symbolic programs or code for executable reasoning"
  • Executable rendering code: Source code that deterministically generates the chart images used for precise edits and supervision. "paired with executable rendering code."
  • Explicit provenance: Metadata linking visual elements back to their source objects, enabling exact edits and traceability. "every object (marks, layers, axes, legends) has explicit provenance, which enables exact edits to synthesize the two complementary views."
  • Group Relative Policy Optimization (GRPO): A PPO-style RL method that normalizes rewards within rollout groups to stabilize optimization. "Group Relative Policy Optimization (GRPO) framework"
  • Group-relative advantage: An advantage estimate normalized within a rollout group for stable policy updates. "Here, AtA_t denotes the group-relative advantage,"
  • Kullback–Leibler (KL) divergence: A measure of dissimilarity between probability distributions used to enforce consistency or separation. "applies a consistency constraint based on the Kullback–Leibler (KL) divergence"
  • KL-consistency constraint: A KL-based objective that aligns predictions on the full image with those on an evidence-preserving view. "applies a KL-consistency constraint between the original image and an evidence-preserving view"
  • KL-separation constraint: A KL-based objective that forces predictions on the full image to diverge from those on an evidence-ablated view. "It then applies a KL-separation constraint between the original and an evidence-ablated view"
  • Latent visual tokens: Internal visual representations produced during reasoning instead of explicit intermediate images or tools. "generated as latent visual tokens during reasoning"
  • Out-of-domain generalization (OOD): The ability to transfer to unseen datasets or image types beyond the training distribution. "shows strong out-of-domain generalization to unseen datasets and image types."
  • Programmatic code-editing: Automated modification of chart-generating code to create precise preserved/ablated views. "We compare our programmatic code-editing strategy against a random masking baseline."
  • Programmatic data pipeline: An automated process that constructs paired supervision by editing and rendering charts from code. "We therefore build a programmatic data pipeline for chart data that generates the required evidence-preserving and evidence-ablated views."
  • Proximal Policy Optimization (PPO): A popular on-policy RL algorithm using clipped objective updates; GRPO extends it. "extends Proximal Policy Optimization (PPO) by normalizing rewards across rollouts within the same group"
  • Random masking: A baseline that hides random image patches to create alternative inputs for training or evaluation. "a random masking baseline"
  • Reinforcement Learning from Verifiable Rewards (RLVR): An RL setup where rewards are programmatically checkable to ensure correctness. "Following a typical RLVR setting~\cite{shao2024grpo,meng2025mmeureka}, we refine this data"
  • Rollouts: Multiple sampled trajectories (model attempts) used to estimate advantages and stabilize RL training. "We perform 8 rollouts for each validated question."
  • Stop-gradient: An operation that prevents gradient flow through a computation branch so it serves as a fixed target. "sg[]\operatorname{sg}[\cdot] indicates stop-gradient so that the $I_{\text{pres}$ branch serves as a fixed target;"
  • Text-only shortcuts: Failure modes where the model answers from text priors without using visual evidence. "discouraging text-only shortcuts (i.e., answering from text alone)"
  • Visual chain-of-thought: A training or inference process that uses intermediate visual steps (e.g., boxes, masks) analogous to textual CoT. "“visual chain-of-thought’’ traces"
  • Visual hallucinations: Model outputs asserting unsupported visual content or interpretations. "to avoid visual hallucinations and text-only shortcuts"
  • Vision–LLM (VLM): A model jointly processing visual and textual inputs for multimodal reasoning. "Large vision–LLMs (VLMs) are increasingly serving as a unified interface for both visual and language-based reasoning"

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 3 tweets with 44 likes about this paper.