Visually-Rich Documents (VrDs)
- Visually-Rich Documents are semi-structured documents where meaning emerges from the fusion of text, spatial layout, and visual cues.
- They integrate elements like tables, charts, and figures, enabling advanced methods for accurate key-value extraction and document understanding.
- Modern approaches use transformer-based and graph-based models to jointly model language, structure, and visual features for robust information retrieval.
Visually-Rich Documents (VrDs) are document artifacts whose semantic content emerges not just from the surface sequence of text, but through the co-organization of language, layout, and diverse visual elements such as tables, figures, charts, and stylized graphics. They are ubiquitous across domains—forms, invoices, receipts, scientific papers, reports, and infographics—posing unique challenges for automated understanding and information extraction. Interpretations of VrDs require joint modeling of content and structure, including two-dimensional arrangements, hierarchical nesting, and multimodal cues, often under highly variable templates and domain-specific conventions.
1. Core Definition and Multimodal Characteristics
A Visually-Rich Document is defined as a semi-structured or unstructured page or collection of pages for which crucial semantic information is encoded in the spatial arrangement, font and style, key–value grouping, and visual regions, in addition to the textual signal. Typical document types include PDFs, scanned pages, digital/handwritten forms, scientific journal articles, infographics, and multi-page reports (Zhang, 16 Dec 2025, Ding et al., 2024, Ding et al., 2 Jun 2025, Ding et al., 2024). Distinguishing characteristics of VrDs encompass:
- Textual content: plain and stylized paragraphs, lists, field values, tokens extracted by OCR.
- Layout/structural cues: bounding boxes, tables, columns, hierarchies, flow lines, reading order.
- Visual elements: charts, figures, tables, logos, colored highlights, graphical separators.
- Semantic entities: logical fields (invoice numbers, menu items), groupings, and cross-field relations.
Crucially, meaning in VrDs is inseparable from spatial configuration. For example, in an invoice, "Amount" above a vertical list signals monetary attribution, while bold section headers in a resume partition information. Critical tasks thus require alignment between text, spatial position (e.g., bounding boxes), and visual attributes.
2. Modeling Strategies and Document Representation
Contemporary VrD modeling falls into three broad paradigms, each leveraging recent advances in deep learning and multimodal representation (Ding et al., 2024):
A. Sequence-Based Models
- Treat VrDs as serial token streams, embedding each token’s text, 1D/2D positional encodings, and sometimes coarse visual cues.
- Apply BiLSTM-CRF or CNN+LSTM architectures for BIO tagging.
- Limitations: Brittle to multi-column, irregular, cross-modal layouts; poor at long-range or region-based reasoning (Ding et al., 2024, Liu et al., 2019).
B. Graph-Based Models
- Represent each text segment or entity as a node, with graph edges encoding spatial relations, adjacency, or logical dependencies.
- Use Graph Convolutional Networks (GCNs) or self-attention graph transformers to fuse context, enabling layout- and logic-aware entity extraction or relation prediction (Liu et al., 2019, Zhang et al., 2021).
- Strengths: Explicit relational modeling; robustness to varied layouts, spatially-linked fields, and templates.
C. Transformer-Based Multimodal Models
- Pretrain vision–language transformers (e.g., LayoutLM, LayoutLMv3, DocFormer) on millions of pages with tasks that blend masked language modeling, document layout, and visual patch masking.
- Fuse token, layout (2D coords), and visual (ROI/patch) features via joint attention layers. Extendable to multi-page, cross-modality tasks (Ding et al., 2024, Ding et al., 2024).
- Achieve state-of-the-art performance on information extraction (KIE), document classification, reading order detection, and VQA.
Advanced models (e.g., DAViD (Ding et al., 2024)) employ a joint-grained architecture that incorporates token-level encoders (LayoutLMv3), entity-level vision–language encoders (LXMERT), and hierarchical transformers to fuse sequence and entity streams for both fine-grained and coarse-grained tasks.
3. Key Information Extraction, Relational Understanding, and Reasoning
Information extraction on VrDs frequently targets:
- Token/sequence-level tagging: Assigning field classes, BIO labels, or slot tags per token, robust to OCR and layout noise (Nguyen et al., 2021, Ding et al., 2024).
- Entity labeling and relation extraction: Predicting directed key–value or structural links via dependency parsing or graph neural networks (Zhang et al., 2021, Li et al., 2022).
- Span and block extraction: Framing fields as spans or semantically self-contained blocks, enabling modular reasoning, multi-entity grouping, or value-absent inference (e.g., counting line items) (Nguyen et al., 2021, Bhattacharyya et al., 18 May 2025).
- Relation-centric pretraining: Direct pretraining of relation matrices (DocReL (Li et al., 2022)) or path predictions to improve downstream tasks (e.g., reading order, entity linking, structure recognition).
Recent work in layout- and spatial-aware few-shot learning incorporates variational modeling of 2D context and rectified prototypical memory to extract relations from previously unseen document types with minimal annotation (Wang et al., 2024).
4. Benchmarks, Metrics, and Evaluation
The field has established a robust suite of benchmarks and metrics tailored to VrDs:
- Datasets: FUNSD (forms), CORD (receipts), XFUND (multilingual forms), PDF-MVQA (multi-page scientific articles), DocTrack (eye-tracked layouts), Form-NLU (multi-format forms), among others (Ding et al., 2024, Wang et al., 2023, Ding et al., 2 Jun 2025).
- Metrics: Token/entity-level exact match, Levenshtein distance, F1, Average Normalized Levenshtein Similarity (ANLS) (DeGange et al., 2022, Ding et al., 2024).
- Geometric and hierarchical criteria: Grouped/constituent IoU (bounding box overlap), Hierarchical Edit Distance (HED/UHED) quantifying fidelity of nested/grouped outputs (DeGange et al., 2022).
- Relational consistency: Relation-based F1 (entity–relation links), graph-based evaluation, BLEU for reading order.
DI-Metrics formalizes a multi-faceted, open-source evaluation suite for assessing text, geometry, and structure (DeGange et al., 2022).
5. Retrieval, RAG, and Large Multimodal LLMs
Retrieval-augmented generation (RAG) over VrDs requires joint modeling of layout-dependent semantics, fine-grained grounding, and efficient multi-hop retrieval (Zhang, 16 Dec 2025, Sourati et al., 8 Oct 2025):
- MLLM Roles in Retrieval:
- Modality-Unifying Captioners: Convert page or region images to text surrogates for text-based IR (Donut, Pix2Struct).
- Multimodal Embedders: Encode visual + text content into joint dense spaces for cross-modal similarity search (CLIP-style).
- End-to-End Representers: Produce holistic embeddings from raw images, often via page-level bi-encoders or patch-level late interaction models (Zhang, 16 Dec 2025).
- Dynamic, layout-aware RAG: Symbolic document graphs (LAD-RAG (Sourati et al., 8 Oct 2025)), agentic retrieval, and hybrid neural–symbolic indices are required for high-recall, low-latency multi-page QA.
- Emerging robustness benchmarks: VRD-UQA introduces evaluation for VLLM resilience to unanswerable questions due to document-specific corruptions, emphasizing the need for abstention mechanisms and nuanced multimodal alignment (Napolitano et al., 14 Nov 2025).
6. Practical Implications, Applications, and Future Challenges
VrD understanding drives automation in finance, healthcare, scientific publishing, law, and business operations:
- Applications: KIE for receipts/invoices, document classification, contract analysis, scientific paper QA, cross-lingual entity extraction (Ding et al., 2 Jun 2025, Nguyen et al., 2024).
- Adaptation to real-world, low-resource, and layout-variable domains: Synthetic annotation (DAViD (Ding et al., 2024)), programmatic rule synthesis (VRDSynth (Nguyen et al., 2024)), and structured relation learning enable rapid adaptation and annotation minimization.
- Efficiency and scalability: Knowledge distillation (DistilDoc (Landeghem et al., 2024)), dynamic RAG, and low-FLOP models enable deployment on resource-constrained hardware.
- Open problems: Multi-page reasoning, domain adaptation with minimal labels, OCR-free multimodal pretraining, fine-grained grounding of answers, and robust handling of noisy, highly variable layouts (Ding et al., 2024, Zhang, 16 Dec 2025).
7. Current Trends and Research Directions
- Continual and few-shot learning: Recursive span extraction, query-based modularity, and prototypical rectification are enabling VRD models to support new fields and templates with minimal additional supervision (Nguyen et al., 2021, Wang et al., 2024).
- Holistic, multi-granular retrieval: HKRAG demonstrates that fine-print and salient knowledge must be explicitly retrieved and fused, with hybrid masking and uncertainty-guided agentic generation frameworks (Tong et al., 25 Nov 2025).
- Joint neural-symbolic architectures: Graph-based symbolic indices, adaptive retrieval over layout-structured graphs, and hybrid embedding approaches are central to reliable multi-page and cross-modal reasoning (Sourati et al., 8 Oct 2025).
- Relational and structure-aware pretraining: Self-supervised objectives going beyond contextual modeling—e.g., Relational Consistency Modeling—produce generalized, task-agnostic relational encoders for table, reading order, and key–value extraction (Li et al., 2022).
- Evaluation and robustness: New datasets align human reading order (DocTrack (Wang et al., 2023)) and unanswerable query detection with VLLMs, encouraging research in cognitive alignment and exception handling (Napolitano et al., 14 Nov 2025).
Visually-Rich Documents, as a research domain, synthesize advances in multimodal neural architectures, relational modeling, information extraction, retrieval, knowledge distillation, and learning from partial, noisy, or synthetically generated supervision. The field continues to evolve toward robust, efficient, and continually adaptive document intelligence systems.