High-resolution DocCompressor

Updated 6 January 2026

High-resolution DocCompressor is a visual-token reduction module that converts multi-page document images into a fixed 324-token representation.
It employs shared ViT encoding and cross-attention guided by global low-res features to drastically reduce tokens while preserving structural information.
Experimental results show an over 50% speedup in first-token latency and improved GPU memory efficiency with near-baseline accuracy in document Q&A tasks.

The High-resolution DocCompressor is a visual-token reduction module developed in the context of the mPLUG-DocOwl2 framework for multimodal LLMs (MLLMs) targeting efficient, OCR-free multi-page document understanding. Its principal objective is to compress high-resolution document images—consisting of tens of sub-images—into a fixed, minimal set of 324 tokens per page, leveraging global low-resolution visual features for guidance. This compression yields dramatically improved GPU memory efficiency and inference latency while maintaining high performance in question answering and structural reasoning tasks over multi-page documents (Hu et al., 2024).

1. System Placement and Functional Overview

The High-resolution DocCompressor operates between the vision encoder (a Vision Transformer, ViT) combined with a vision-to-text alignment adapter (H-Reducer) and the downstream LLM. The input comprises a global low-resolution image $I^g\in\mathbb{R}^{H\times W\times3}$ and a grid of high-resolution sub-images $\{I^s_{x,y}\}$ spatially partitioned across an $R\times C$ grid. The pipeline is as follows:

Shape-adaptive cropping: Documents are cropped into $R\times C$ sub-images plus a resized global image.
Shared ViT encoding: Produces features $V^g$ and $V^s_{x,y}$ for both global and local crops.
H-Reducer: Applies a 2D convolution and fully-connected layer to align visual features to the LLM’s hidden dimension, forming $\hat V^g$ and $\hat V^s_{x,y}$ .
DocCompressor: Utilizes two layers of grouped cross-attention, compressing the high-resolution features into $\bar V$ , a compact token set.
Flattening and concatenation: Produces exactly 324 tokens per page, which are then prepended with page-id tokens for LLM input.

2. Architectural and Mathematical Details

The DocCompressor module applies a detailed multi-stage architecture:

Cropping and Encoding: Page images ( $504\times504$ pixels) are cropped such that each patch and the global view are fed into a shared ViT, yielding feature maps ( $h=w=\frac{504}{16}=31.5\approx32$ ).
H-Reducer: A (1,4)-kernel, (1,4)-stride convolution reduces the width dimension, followed by an FC to match the LLM’s hidden size ( $\hat d$ ), yielding shape $\mathbb{R}^{h\times(w/4)\times\hat d}$ .
Grouped Cross-Attention in DocCompressor: Sub-image features are stacked, grouped with respect to each position in the global map, and processed by cross-attention layers. For each global position $(i,j)$ :

$q_{ij} = W^q\,\hat v^g_{ij}, \quad K_{ij} = W^k\,\hat V^s_{ij}, \quad V_{ij} = W^v\,\hat V^s_{ij}$

The attention update is:

$A_{ij} = \mathrm{softmax}(q_{ij}K_{ij}^\top/\sqrt{d_k}), \quad \bar v_{ij} = A_{ij}V_{ij} + \hat v^g_{ij}$

All $\bar v_{ij}$ are stacked and flattened to produce exactly 324 tokens ( $N=h\times (w/4)$ ). No quantization or gating is applied; all projections are end-to-end learned.

3. Guidance Mechanism and Positional Grouping

Compression efficiency is achieved via a guidance mechanism:

The global low-res features ( $\hat V^g$ ) serve as anchor queries for the cross-attention heads and also as residuals in the cross-attention output, ensuring compressed features remain closely tied to global structure.
Positional grouping assigns each global token’s compression window, spatially matching sub-image tokens with their corresponding global context.
This design empowers effective token reduction while preserving necessary cross-page and intra-page structure for downstream comprehension.

4. Hyperparameters and Design Rationale

Key hyperparameters and design choices include:

Image base size: $504\times504$ pixels
Maximum crops: 12 sub-images per page
ViT patch size: 16 pixels
Conv stride: 4 (reducing $w$ to $w/4$ )
Token count: 324 tokens per A4-sized page ( $\sim80\%$ reduction vs. prior 1,700-token approach)
Hidden dimension: Aligned to LLM (e.g., 1,024)
Number of cross-attention layers: Two (empirically, ablation study indicates 1–4 layers yield similar results)
Performance–efficiency trade-off: With 324 tokens, first-token latency is reduced by more than 50% and question-answering accuracy is preserved at 98% relative to uncompressed baselines.

5. Training Framework

mPLUG-DocOwl2 adopts a three-stage training recipe for optimal document-understanding performance:

Single-Image Pretraining: Leveraging DocStruct4M (4M annotated pages) for unified structure generation. Task: generate parse JSON via cross-entropy on sequence output.
Multi-Image Continue-Pretraining: Includes MP-DocStruct1M (1.1M multi-page documents from PixParse) plus 0.5M additional single-page data. Tasks involve multi-page text parsing and lookup.
Multi-task Fine-tuning: Combines single-page and multi-page reasoning datasets (DocDownstream-1.0, MP-DocVQA, DUDE, NewsVideoQA, DocGenome12K). Losses incorporate answer tokens and explanation/evidence tags.

6. Empirical Evaluation

Experimental results demonstrate substantial improvements:

Token count: Reduced from $\sim1,700$ to 324 per page ( $\sim80\%$ fewer)
GPU memory usage: Decreased by 80%
First-token latency (DocVQA/ChartQA/TextVQA): Improved from 0.58/0.53/0.56s to 0.26/0.21/0.23s (>50% speedup)
Question answering accuracy: DocVQA (80.7%, 98% relative to baseline), ChartQA (70.0%), TextVQA (66.7%, ~97% relative)
Multi-page MP-DocVQA: Latency 2.13s→0.95s; accuracy 67.15%→69.42%.

7. Module Pseudocode

The compression mechanism is succinctly expressed in the following procedural pseudocode:

function DOCCOMPRESSOR(Vg: [h×w/4×d], Vs: [Rh×C(w/4)×d]):
  // Group sub-features to match each global location
  for i in 1..h, j in 1..(w/4):
    group_s = select_group(Vs, i, j, R, C)      # shape [RC, d]
    vg = Vg[i,j]                                # shape [d]
    # Cross-attention update
    q = Wq * vg                  # [d_k]
    K = Wk * group_s             # [RC, d_k]
    V = Wv * group_s             # [RC, d_v]
    A = softmax(q * K.T / sqrt(d_k))  # [1, RC]
    v_new[i,j] = A * V + vg      # [d]
  end
  return v_new   # shape [h×(w/4)×d], flatten → 324 tokens

A plausible implication is that fixed-token compression—anchored by global embeddings—enables standard LLM architectures to scale to multi-document and multi-page tasks without exceeding hardware constraints, while preserving essential semantic and structural information (Hu et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

mPLUG-DocOwl2: High-resolution Compressing for OCR-free Multi-page Document Understanding (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-resolution DocCompressor.