Drawing-Grounded Document QA
- Drawing-grounded document QA is a paradigm that integrates text answering with explicit spatial evidence localization using bounding boxes in complex documents.
- It leverages specialized methodologies such as plug-and-play regression heads and segment–judge–generate pipelines, utilizing benchmarks like BoundingDocs and BBox DocVQA.
- This approach enhances interpretability and reliability in regulatory, legal, and scientific domains by precisely identifying evidence locations for each answer.
Drawing-grounded document question answering (QA) refers to the suite of methods, benchmarks, and evaluation protocols that tightly couple answer generation with explicit spatial evidence localization—most systematically, with bounding-box supervision—within visually complex documents. This paradigm addresses the critical shortcoming of conventional text-based or even multimodal document QA: the inability to reliably indicate not just what the answer is, but precisely where in the document the supporting evidence resides. Drawing-grounded QA is thus foundational for robust, interpretable systems in regulatory, legal, and scientific domains, where transparent reasoning and evidence traceability are essential.
1. Formulation and Objectives
Drawing-grounded document QA tasks operationalize the requirement that for every question–answer (QA) pair over a document, not only must the answer be textually correct or semantically consistent, but the model must also predict one or more bounding boxes (or analogous spatial pointers) on the source page. These bounding boxes must demarcate the precise visual region(s) that encode or justify the answer, such as a numerical entry in a table, a labeled figure component, or a paragraph with salient details.
The dominant motivation lies in separating textual understanding from spatial localization, enabling evaluation of spatial semantic alignment. This is formalized in datasets including BBox DocVQA (Yu et al., 19 Nov 2025), BoundingDocs (Chen et al., 12 Sep 2025), and JDocQA (Onami et al., 2024), all of which provide human- or model-curated bounding boxes per QA instance, often alongside explicit page and document references.
2. Benchmark Datasets and Taxonomy
Three flagship benchmarks exemplify and drive progress in drawing-grounded document QA:
| Dataset | Scale (Docs/QA) | Annot. Box Types | Notable Features |
|---|---|---|---|
| BoundingDocs v2.0 | 48,151 / 249,016 | Single GT bbox per QA, 8 languages | Robust ANLS/IoU metrics |
| JDocQA | 5,504 / 11,600 | Multi-box, human-drawn (JP) | Unanswerable Qs, visual focus |
| BBox DocVQA | 3,751 / 32,403 | Multi-scale: [S | M]P[S |
BoundingDocs v2.0 (Chen et al., 12 Sep 2025) presents a large, multilingual corpus (invoices, contracts, receipts) with exactly one box per QA, and a standardized metric suite. JDocQA (Onami et al., 2024) targets Japanese documents, emphasizing multi-box, multi-modality, and includes 1,000 unanswerable QA pairs. BBox DocVQA (Yu et al., 19 Nov 2025) covers scientific (arXiv) literature, stratifying QA into single-/multi-page and single-/multi-bbox settings (SPSBB, SPMBB, MPMBB), and features a validated benchmark for fine-grained evidence localization.
3. Methodological Foundations
Drawing-grounded QA systems are characterized by explicit architectural and pipeline design to disentangle answer content from spatial grounding:
- Plug-and-Play Regression Heads: DocExplainerV0 (Chen et al., 12 Sep 2025) introduces a modular bounding-box regressor trained atop a frozen vision–language encoder (e.g., SigLiP2), consuming both visual embeddings of the document image and text embeddings of the model’s answer string. Spatial and textual branches are fused via dual linear layers, with normalized box coordinates predicted by a regression head. Only the fusion and regression layers are trained on paired (answer, box) data, with all upstream parameters frozen.
- Automated Region Proposal: BBox DocVQA (Yu et al., 19 Nov 2025) employs a Segment–Judge–Generate pipeline: the Segment Anything Model (SAM) first proposes regions; a large VLM (Qwen2.5-VL-72B) semantically validates and types each candidate; overlapping boxes are deduplicated by content type, and GPT-5 generates constrained QA pairs, strictly tied to crop content.
- Annotation Protocols: JDocQA (Onami et al., 2024) uses human annotators to draw visual regions central to QA, supporting direct evaluation of table/chart reading and grounding. Text is extracted via PDF parsing or OCR, with rasterizations at standardized resolutions guiding box placement.
- Decoupled Losses: Training regimes often optimize textual answering and spatial grounding under distinct objectives:
- Autoregressive cross-entropy (text loss) for answer sequence generation,
- Huber (Smooth L₁) or similar regression loss for bounding-box coordinates:
with text loss fixed, except in joint multimodal fine-tuning scenarios.
4. Evaluation Metrics and Protocols
Effectiveness in drawing-grounded QA hinges on evaluating both answer fidelity and localization accuracy:
- Textual Answering: Metrics include Average Normalized Levenshtein Similarity (ANLS) (Chen et al., 12 Sep 2025), Exact Match (EM), and F1 (token overlap). These are robust to minor formatting or spelling deviations.
- Spatial Grounding: Primary measures are Intersection over Union (IoU) between predicted and annotated boxes:
IoU-EM@τ (proportion of samples with ) and mean Average Precision (mAP) at varying thresholds detail localization robustness over varied tasks (Yu et al., 19 Nov 2025). Recall@k and Precision@k are used when multiple predicted boxes per QA are allowed.
- Joint Scoring: Simultaneous accuracy in both textual and spatial domains is emphasized in benchmarks (e.g., BBox DocVQA), revealing cases where high answer string accuracy is not matched by evidence localization, and vice versa.
5. Performance Benchmarks and Limitations
Baseline experiments across the principal datasets indicate strong text answering from state-of-the-art VLMs, but persistent shortcomings in spatial grounding:
| Model / Setting | Text Metric | Spatial Metric |
|---|---|---|
| SmolVLM (zero-shot) | ANLS 0.527 | MeanIoU 0.011 |
| QwenVL (anchors) | ANLS 0.694 | MeanIoU 0.051 |
| Qwen2.5-VL-72B (BBox DocVQA) | – | Mean IoU 35.2% |
| DocExplainerV0 (smolVLM) | ANLS 0.572 | MeanIoU 0.175 |
| OCR-lookup upper bound | ANLS 0.556-0.690 | MeanIoU 0.405-0.494 |
Even under advanced prompting (anchor-based, chain-of-thought), leading generative VLMs remain essentially agnostic to spatial location, as reflected by negligible MeanIoU scores (<0.06 on BoundingDocs v2.0 (Chen et al., 12 Sep 2025)). Plug-in regressors such as DocExplainerV0 lift bounding-box accuracy by 3–5× (to ∼0.18), but fall short of upper bounds obtained by direct OCR-match strategies (∼0.5). Notably, VLMs tuned on BBox DocVQA data see 10–20 point gains in IoU and 5–10 point gains in EM for initially weak models (Yu et al., 19 Nov 2025).
A key finding is that abstract or non-verbatim reasoning questions—where the answer is generated, synthesized, or obliquely restated—expose the hardest cases for bounding-box recoverability, as the simple OCR-lookup strategy fails and the regressor is forced to bridge semantic matches.
6. Implications for Interpretability and System Design
Drawing-grounded QA frameworks offer a transparent rationale for every prediction, surfacing the exact document regions drawn upon during inference—whether in regulatory compliance, contract validation, or data extraction (Chen et al., 12 Sep 2025). This explicit spatial attribution facilitates human auditing, error analysis, and model trustworthiness. Furthermore, bounding-box or cropped-region supervision is instrumental in moderating hallucination in generative LLMs, as evidenced by ablations in JDocQA: incorporation of explicitly unanswerable instances leads models to abstain or reply “not mentioned in the text” rather than hallucinating content (Onami et al., 2024).
The architectural modularity of plug-and-play regressors (e.g., DocExplainerV0) enables attachment to proprietary or frozen VLMs, extending grounding capability without expensive joint retraining (Chen et al., 12 Sep 2025). Segment–Judge–Generate pipelines in BBox DocVQA enable scalable QA generation, human-in-the-loop verification, and systematic coverage of reading comprehension and spatial reasoning cases (Yu et al., 19 Nov 2025).
7. Open Challenges and Future Directions
Key limitations and research frontiers include:
- Spatial Awareness Deficit: State-of-the-art VLMs—even with anchor or thought chain prompting—remain largely “blind” to layout until explicit grounding modules or fine-tuning are applied (Chen et al., 12 Sep 2025).
- Beyond Single-Box Grounding: Scaling to multi-box, hierarchical, or cross-page answers (MPMBB settings in BBox DocVQA) is error-prone, with substantial accuracy drops, especially in cross-page spatial reasoning (Yu et al., 19 Nov 2025).
- Insufficient Region Proposal: Reliance on static crop annotation (human or model-based) risks overlooking latent relevant regions; end-to-end or retrieval-augmented architectures are proposed as remedies.
- Extension to Heatmap Attribution: Moving beyond bounding boxes to heatmap-based or attention-mapped grounding could enable finer-grained interpretability and new evaluation metrics.
- Layout- and Multilinguality-Aware Architectures: Integration of layout-aware transformers (e.g., LayoutLMv2, UDOP) and expanded multilingual resources are highlighted as promising advances (Onami et al., 2024).
Future work encourages joint multimodal fine-tuning (balancing and with optimized ), expansion to non-extractive answers, and automated or weakly-supervised region proposals for generalization beyond curated datasets.
References
- "Towards Reliable and Interpretable Document Question Answering via VLMs" (Chen et al., 12 Sep 2025)
- "BBox DocVQA: A Large Scale Bounding Box Grounded Dataset for Enhancing Reasoning in Document Visual Question Answer" (Yu et al., 19 Nov 2025)
- "JDocQA: Japanese Document Question Answering Dataset for Generative LLMs" (Onami et al., 2024)