DoCo Pretraining for Visual Document Understanding
- The paper proposes an object-level contrastive pretraining framework to tackle feature collapse in text-rich visual documents.
- It leverages an auxiliary multimodal encoder and ROI aggregation to fuse visual, textual, and layout cues from OCR-detected regions.
- Empirical evaluations on standard VDU benchmarks demonstrate approximately a +2% accuracy improvement with zero added inference cost.
Document Object Contrastive (DoCo) Pretraining is a pretraining framework designed to enhance visual document understanding (VDU) in large visual-LLMs (LVLMs) by introducing explicit object-level contrastive learning. DoCo addresses the challenge of fine-grained feature collapse observed in LVLMs when applied to text-rich visual inputs such as business reports, scientific papers, and invoices. By leveraging an auxiliary multimodal encoder and a novel object-level contrastive loss, DoCo improves the ability of standard LVLMs to represent and reason about the structural and textual complexity inherent in documents, all while introducing no additional inference cost after pretraining (Li et al., 2024).
1. Motivation and Challenges in Visual Document Understanding
VDU tasks require models to comprehend visual scenes densely populated with textual and structured graphical elements. LVLMs, such as those initialized with CLIP-style vision transformers, typically excel at aligning global semantics between entire images and their associated captions. However, this instance discrimination paradigm is prone to fine-grained feature collapse: the global image–text alignment washes out smaller textual or graphical elements, impairing localized content comprehension and thereby hindering performance on text-rich document tasks.
Conventional LVLM pretraining lacks object-level supervision. Models are not exposed to the dense, interwoven layout of text, tables, form fields, charts, and diagrams characteristic of documents. Consequently, the vision encoder cannot acquire the localized cues necessary for tasks demanding text localization and structured reasoning.
2. DoCo Architecture and Module Design
The DoCo framework augments standard LVLM pretraining by introducing an auxiliary multimodal encoder and a region-of-interest (ROI) aggregation module to produce object-specific embeddings. Its architecture components are as follows:
- Vision Encoder: The backbone of the LVLM (e.g., ViT) receives the full document image (e.g., 448×448 pixels), producing a sequence of visual patch embeddings .
- Auxiliary Multimodal Encoder: Leveraging OCR-detected bounding boxes and recognized text, each box is encoded using visual, textual, and 2D positional (layout) embeddings. The output is a fused embedding , where is the number of OCR boxes and the extra corresponds to a global box.
- ROI Aggregation Module: For each OCR box , an attention bias mask correlates document patches to the object region. A class token is prepended and a masked self-attention mechanism aggregates visual features into an object-specific embedding . A small MLP projects each multimodal embedding into the visual space, resulting in .
The following table summarizes core architectural components:
| Component | Input Type | Output Dimension |
|---|---|---|
| Vision Encoder | Full image (patches) | |
| Multimodal Encoder | OCR boxes + text + layout | |
| ROI Aggregation Module | Patch embeddings, ROI masks |
3. Object-Level Contrastive Learning Objective
DoCo formalizes object alignment via two InfoNCE-based contrastive objectives:
- Intra-Document Object Contrastive Loss (Intra-DoCo): For a document image, each object pair is treated as a positive, with all non-matching object pairs acting as negatives. This is computed as:
where is a temperature parameter and denotes the cosine similarity.
- Inter-Document Object Contrastive Loss (Inter-DoCo): Over a batch of images, each image’s object embeddings are averaged. The resulting global representations are contrasted across the batch, treating same-image pairs as positives and cross-image pairs as negatives.
- Total DoCo Loss: The sum of intra- and inter-document contrastive losses:
4. Pretraining Procedure and Plug-and-Play Integration
DoCo pretraining involves the following workflow:
- Data Pipeline: 1.0 million image–text pairs from CC3M and LAION are processed using PaddleOCR to extract bounding boxes and text tokens.
- Multimodal Branch: During pretraining, an auxiliary multimodal branch generates object-specific features. The vision encoder and the small MLP are optimized; the multimodal encoder and the LVLM language head remain frozen.
- Joint Objectives: The standard LVLM pretraining losses—such as image–text matching and next-sentence prediction—are retained. The DoCo loss is added to specifically encourage fine-grained object-level alignment.
During inference or downstream fine-tuning, the entire multimodal branch (OCR, LayoutLMv3, ROI Aggregation) is removed. Only the enhanced vision encoder is retained, ensuring zero additional inference complexity. DoCo requires no changes to the downstream LVLM architecture and can be applied to any backbone. Empirical results show consistent performance improvements on both Qwen-VL-Chat and mPLUG-Owl backbones, illustrating its backbone-agnostic design (Li et al., 2024).
5. Empirical Evaluation and Ablation Analysis
Experiments on standard VDU benchmarks, including DocVQA, TextVQA, OCRVQA, ChartQA, InfoVQA, Key-List-Cell, WTQ, and TextCaps, demonstrate the effectiveness of DoCo pretraining. Specific quantitative gains include:
- On DocVQA, Qwen-VL-Chat improves from 62.2% to 64.8% accuracy; mPLUG-Owl from 61.8% to 63.6%.
- Average improvement across eight VDU tasks is approximately +2%.
Ablation studies reveal the contribution of architectural components and loss terms:
- Intra-DoCo alone provides a +1.7% increase on DocVQA over CLIP pretraining.
- Adding Inter-DoCo yields an additional +0.9%.
- ROI Aggregation outperforms average pooling by +0.6%.
- Integration of text, layout, and image modalities provides the largest gain (+2.6% vs. no DoCo).
Qualitative analysis shows improved attention focus on text regions and more robust token generation, particularly in cases of occlusion or faint text.
6. Limitations and Prospects for Future Research
While DoCo addresses fine-grained feature collapse and enhances localized representation in LVLMs for document-centric tasks, it does not tackle higher-level reasoning such as mathematical computation, complex table-based aggregation, or commonsense inference over documents. Future research directions include the exploration of symbolic reasoning modules and advanced object co-attention mechanisms to extend capabilities toward comprehensive document understanding (Li et al., 2024).
7. Significance and Adoption Implications
DoCo constitutes a lightweight, plug-and-play framework for injecting fine-grained object-level supervision into LVLM pretraining, specifically tailored for VDU tasks. The absence of inference overhead and the demonstrated backbone-agnostic improvements position DoCo as a practical approach for advancing document analysis, information extraction, and text-centric visual reasoning within the LVLM paradigm.