High-resolution DocCompressor
- High-resolution DocCompressor is a visual-token reduction module that converts multi-page document images into a fixed 324-token representation.
- It employs shared ViT encoding and cross-attention guided by global low-res features to drastically reduce tokens while preserving structural information.
- Experimental results show an over 50% speedup in first-token latency and improved GPU memory efficiency with near-baseline accuracy in document Q&A tasks.
The High-resolution DocCompressor is a visual-token reduction module developed in the context of the mPLUG-DocOwl2 framework for multimodal LLMs (MLLMs) targeting efficient, OCR-free multi-page document understanding. Its principal objective is to compress high-resolution document images—consisting of tens of sub-images—into a fixed, minimal set of 324 tokens per page, leveraging global low-resolution visual features for guidance. This compression yields dramatically improved GPU memory efficiency and inference latency while maintaining high performance in question answering and structural reasoning tasks over multi-page documents (Hu et al., 2024).
1. System Placement and Functional Overview
The High-resolution DocCompressor operates between the vision encoder (a Vision Transformer, ViT) combined with a vision-to-text alignment adapter (H-Reducer) and the downstream LLM. The input comprises a global low-resolution image and a grid of high-resolution sub-images spatially partitioned across an grid. The pipeline is as follows:
- Shape-adaptive cropping: Documents are cropped into sub-images plus a resized global image.
- Shared ViT encoding: Produces features and for both global and local crops.
- H-Reducer: Applies a 2D convolution and fully-connected layer to align visual features to the LLM’s hidden dimension, forming and .
- DocCompressor: Utilizes two layers of grouped cross-attention, compressing the high-resolution features into , a compact token set.
- Flattening and concatenation: Produces exactly 324 tokens per page, which are then prepended with page-id tokens for LLM input.
2. Architectural and Mathematical Details
The DocCompressor module applies a detailed multi-stage architecture:
- Cropping and Encoding: Page images ( pixels) are cropped such that each patch and the global view are fed into a shared ViT, yielding feature maps ().
- H-Reducer: A (1,4)-kernel, (1,4)-stride convolution reduces the width dimension, followed by an FC to match the LLM’s hidden size (), yielding shape .
- Grouped Cross-Attention in DocCompressor: Sub-image features are stacked, grouped with respect to each position in the global map, and processed by cross-attention layers. For each global position :
The attention update is:
All are stacked and flattened to produce exactly 324 tokens (). No quantization or gating is applied; all projections are end-to-end learned.
3. Guidance Mechanism and Positional Grouping
Compression efficiency is achieved via a guidance mechanism:
- The global low-res features () serve as anchor queries for the cross-attention heads and also as residuals in the cross-attention output, ensuring compressed features remain closely tied to global structure.
- Positional grouping assigns each global token’s compression window, spatially matching sub-image tokens with their corresponding global context.
- This design empowers effective token reduction while preserving necessary cross-page and intra-page structure for downstream comprehension.
4. Hyperparameters and Design Rationale
Key hyperparameters and design choices include:
- Image base size: pixels
- Maximum crops: 12 sub-images per page
- ViT patch size: 16 pixels
- Conv stride: 4 (reducing to )
- Token count: 324 tokens per A4-sized page ( reduction vs. prior 1,700-token approach)
- Hidden dimension: Aligned to LLM (e.g., 1,024)
- Number of cross-attention layers: Two (empirically, ablation study indicates 1–4 layers yield similar results)
- Performance–efficiency trade-off: With 324 tokens, first-token latency is reduced by more than 50% and question-answering accuracy is preserved at 98% relative to uncompressed baselines.
5. Training Framework
mPLUG-DocOwl2 adopts a three-stage training recipe for optimal document-understanding performance:
- Single-Image Pretraining: Leveraging DocStruct4M (4M annotated pages) for unified structure generation. Task: generate parse JSON via cross-entropy on sequence output.
- Multi-Image Continue-Pretraining: Includes MP-DocStruct1M (1.1M multi-page documents from PixParse) plus 0.5M additional single-page data. Tasks involve multi-page text parsing and lookup.
- Multi-task Fine-tuning: Combines single-page and multi-page reasoning datasets (DocDownstream-1.0, MP-DocVQA, DUDE, NewsVideoQA, DocGenome12K). Losses incorporate answer tokens and explanation/evidence tags.
6. Empirical Evaluation
Experimental results demonstrate substantial improvements:
- Token count: Reduced from to 324 per page ( fewer)
- GPU memory usage: Decreased by 80%
- First-token latency (DocVQA/ChartQA/TextVQA): Improved from 0.58/0.53/0.56s to 0.26/0.21/0.23s (>50% speedup)
- Question answering accuracy: DocVQA (80.7%, 98% relative to baseline), ChartQA (70.0%), TextVQA (66.7%, ~97% relative)
- Multi-page MP-DocVQA: Latency 2.13s→0.95s; accuracy 67.15%→69.42%.
7. Module Pseudocode
The compression mechanism is succinctly expressed in the following procedural pseudocode:
1 2 3 4 5 6 7 8 9 10 11 12 13 |
function DOCCOMPRESSOR(Vg: [h×w/4×d], Vs: [Rh×C(w/4)×d]): // Group sub-features to match each global location for i in 1..h, j in 1..(w/4): group_s = select_group(Vs, i, j, R, C) # shape [RC, d] vg = Vg[i,j] # shape [d] # Cross-attention update q = Wq * vg # [d_k] K = Wk * group_s # [RC, d_k] V = Wv * group_s # [RC, d_v] A = softmax(q * K.T / sqrt(d_k)) # [1, RC] v_new[i,j] = A * V + vg # [d] end return v_new # shape [h×(w/4)×d], flatten → 324 tokens |
A plausible implication is that fixed-token compression—anchored by global embeddings—enables standard LLM architectures to scale to multi-document and multi-page tasks without exceeding hardware constraints, while preserving essential semantic and structural information (Hu et al., 2024).