PDF-VQA: Multimodal Document Q&A

Updated 22 February 2026

PDF-VQA is a domain of methods and benchmarks that enable answering natural language questions grounded in rich, multimodal content of PDF documents.
It leverages multimodal analysis by integrating text, visual, and layout features to navigate complex multi-page structures with hierarchical and cross-page reasoning.
Advanced retrieval pipelines and model architectures, including graph-based and sparse sampling techniques, drive improved entity localization and answer precision.

PDF-VQA (PDF Visual Question Answering) denotes the class of methods, datasets, and benchmarks dedicated to answering natural-language questions grounded in the rich, hierarchical, and multimodal content of Portable Document Format (PDF) documents. Unlike standard VQA tasks focused on natural images or single-page document images, PDF-VQA targets the unique challenges of scholarly articles, financial reports, and technical documents: variable layout, multi-page span, interleaved figures/tables and text blocks, and semantic hierarchies. Systems must combine advances in vision-language modeling, hierarchical retrieval, graph-based reasoning, and efficient context management to enable accurate, interpretable, and language-agnostic question answering for real-world PDF workflows.

1. Task Definition and Unique Challenges

The PDF-VQA task is defined as follows: Given a multi-page PDF document $D$ and a natural-language query $Q$ (possibly referencing a section, figure, or cross-page concept), the system must locate the correct content (text span, table cell, figure, or structural entity) providing the answer, and optionally generate a precise, context-sensitive response (Ding et al., 2023, Ding et al., 2024, Xie et al., 2024). Core challenges include:

Heterogeneous multimodality: PDF documents encode text, vector graphics, raster figures, mathematical notation, and variable page layout. Entities such as tables and captions may span multiple pages or columns.
Hierarchical and cross-page structure: Long documents (10–100+ pages) have semantic hierarchies (sections, subsections, lists, references), cross-references, and non-linear reading order.
Scale and input length: Transformer architectures are typically limited to 4k–32k token windows, while journal articles easily reach 50k+ tokens, requiring evidence selection or hierarchical modeling (Ding et al., 2024, Xie et al., 2024).
Grounded evidence retrieval: Answers must be localized to document entities and not merely generated in isolation; grounded responses prevent hallucinations and support interpretability.

2. Representative Datasets and Annotation Strategies

PDF-VQA research leverages several benchmark datasets, each engineered to stress distinct aspects of document understanding:

Dataset	Documents/Pages	QA Pairs	Targeted Capabilities
PDFVQA (Ding et al., 2023)	1,000 / 9,000	60,000	Existence, counting, layout, hierarchy
PDF-MVQA (Ding et al., 2024)	3,146 / 30,239	262,928	Hierarchical, entity-level retrieval
PaperPDF (Xie et al., 2024)	89,000 / —	1.1M	Multi-evidence retrieval, cross-modality

Annotation protocols emphasize (i) explicit entity markup—bounding boxes, structural tags, page numbers, (ii) QA pairs referencing multi-level relations (existence, counting, parent/child, reference resolution), and (iii) human verification for subset quality control (acceptance >92% in PDFVQA (Ding et al., 2023)).

PDF-MVQA annotates every paragraph, table, and figure via PDF-to-XML alignment and categories (title, section, paragraph, figure, table, captions), supporting comprehensive evaluation by entity type and semantic section (Ding et al., 2024). PaperPDF (Xie et al., 2024) uses automatic evidence annotation during LLM-powered QA pair generation, enabling scalable training and evaluation at multilingual scale.

3. Core Model Architectures and Retrieval Pipelines

Current PDF-VQA systems are unified by a retrieval-first, multimodal reasoning paradigm, but differ in architectural specifics:

a) Entity-level Multimodal Retrieval (Ding et al., 2024): The document is decomposed into a set $S_E = \{E_1, ..., E_n\}$ of semantic entities (paragraphs, tables, figures), each embedded by concatenating its text, visual (RoI pooled features), layout (bounding box), positional, and one-hot label features. Questions are jointly encoded and cross-attended with entity embeddings in a Transformer encoder. Candidate entities are ranked by classifier heads over the contextualized fusion output.

b) Hierarchical Graph-Based Reasoning (Ding et al., 2023): Entities become nodes in a heterogeneous graph, edges encode spatial adjacency or tree-based hierarchy, and message-passing GNNs propagate multimodal signals. Node embeddings combine visual (region features), text (BERT encodings), and layout. This structure is particularly effective for parent/child and cross-page reference tasks.

c) Sparse Sampling/Chunking for Long Sequences (Xie et al., 2024): PDF-WuKong introduces an end-to-end sparse sampler: text and image chunks ( $T_i, I_j$ ) are embedded once; queries are matched against these with cosine similarity to select only the top- $k$ relevant chunks for LLM ingestion. This drastically reduces sequence length and computational cost, integrating with a joint-contrastive and generation loss: $\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{rep}} + \mathcal{L}_{\text{QA}}$ where $\mathcal{L}_{\text{QA}}$ is standard answer-token cross-entropy, and $\mathcal{L}_{\text{rep}}$ optimizes retrieval alignment.

d) Concatenation-Based Feature Fusion (Long et al., 2023): Jaeger concatenates embeddings from multiple pretrained encoder branches (RoBERTa-large, GPT2-xl), projects to a shared space, and fuses with document/text/image features in a lightweight multi-transformer. This design gains both speed (∼37% lower FLOPs) and improved accuracy over single-encoder baselines.

4. Training Objectives, Loss Functions, and Evaluation Metrics

Training regimes align with the model architecture and retrieval formulation:

Retrieval loss (multi-label/entity): Binary cross-entropy over candidate entities, matching predicted binary labels to gold evidence sets $L_{\text{retrieval}} = -\sum_i y_i \log s_i + (1 - y_i) \log (1-s_i)$ .
Answer generation loss: Cross-entropy over output tokens; for extractive heads, pointer-network span loss; for abstractive heads, AR sequence-to-sequence loss (Ding et al., 2022, Xie et al., 2024).
Contrastive/margin ranking losses: Used for chunk sampling and text–image alignment in end-to-end models (Xie et al., 2024, Wong et al., 20 Nov 2025).
Auxiliary structural losses: Encourage learned representations to respect document tree/information hierarchies (Ding et al., 2023).

Metrics include Exact Match (EM), F1 token overlap, Partial Match (PM), multi-label recall (MR), and for long-document models, Average Normalized Levenshtein Similarity (ANLS), ROUGE, and latency/tokens processed (Long et al., 2023, Ding et al., 2024, Xie et al., 2024).

5. Empirical Results and Comparative Analysis

On the PDFVQA Task C benchmark, which requires hierarchical, cross-page reasoning, model performance is as follows (Long et al., 2023, Ding et al., 2023):

Model	Task C Test EMA (%)
VisualBERT	18.52
ViLT	9.87
LXMERT	14.41
LoSpa	28.99
Jaeger	33.63
Graph-VQA	68.3 (all types, (Ding et al., 2023))

Graph-based models yield the highest overall accuracy on relation-heavy tasks (parent-child, reference resolution), especially benefiting from explicit spatial and hierarchical encoding—removing spatial edges degrades performance by 3–5% and removing hierarchy by almost 5% (Ding et al., 2023).

Sparse-sampling models (PDF-WuKong) achieve state-of-the-art F1 and cost savings for ultra-long PDFs, surpassing RAG and closed-source competitors by 8.6% absolute F1 and reducing latency 3× (Xie et al., 2024).

6. Advanced Reasoning and Multilingual Scenarios

Recent work extends PDF-VQA beyond English and flat retrieval to deeper, more robust semantic processing:

Hierarchical sub-question decomposition (Zhou et al., 22 Aug 2025): Complex multiple-choice questions are decomposed into explicit sub-questions; the LLM answers each, cross-verifies with extracted reference text, then aggregates to select the final answer. Bilingual prompting (English/Japanese) counteracts English-only pretraining bias. Across the Japanese PDF-VQA (JDocQA) benchmark, this yields 0.52–0.54 accuracy, with further improvement via ensemble voting.
ColQwen2 retrieval:

Late-interaction scoring and contrastive training on synthetic/real QA pairs refine page selection, narrowing long documents to 1–3 high-relevance candidates for the LLM, a critical step where major performance bottlenecks remain (Zhou et al., 22 Aug 2025).

A plausible implication is that, while deep vision–language transformers can reason well given the right evidence, the dominant limiting factor is precise, recall-oriented retrieval.

7. Future Directions and Open Problems

PDF-VQA remains an open research area with several unresolved challenges:

Cross-domain transfer: Approaches validated on academic papers must be extended to legal, medical, financial, and open-domain PDFs, necessitating robust parsing and domain adaptation (Xie et al., 2024, Ding et al., 2024).
Multi-hop and cross-document reasoning: Many real-world scenarios demand answers that combine evidence across disparate sections, tables, or even documents—not yet fully explored.
Dynamic chunk selection/lifetime learning: Optimizing $k$ in sparse sampling per query or adopting reinforcement/supervised signals for retrieval is largely unsolved.
Grounded generation and explainability: Providing answer rationales via evidence indices or explicit reference to supporting content is crucial for human trust and downstream automation (Xie et al., 2024).
Joint vision–language–layout pretraining: Further gains may depend on pretraining models on large, unlabeled PDF corpora using graph-based masked element or masked layout modeling (Ding et al., 2023).

The PDF-VQA task is thus the crucible for advancing scalable, interpretable document AI—requiring sophisticated retrieval, robust multi-modal fusion, explicit hierarchical reasoning, and modular extensibility. It constitutes an essential benchmark for next-generation multimodal language agents operating in scholarly, industrial, and educational environments.