Papers
Topics
Authors
Recent
Search
2000 character limit reached

High-resolution DocCompressor

Updated 6 January 2026
  • High-resolution DocCompressor is a visual-token reduction module that converts multi-page document images into a fixed 324-token representation.
  • It employs shared ViT encoding and cross-attention guided by global low-res features to drastically reduce tokens while preserving structural information.
  • Experimental results show an over 50% speedup in first-token latency and improved GPU memory efficiency with near-baseline accuracy in document Q&A tasks.

The High-resolution DocCompressor is a visual-token reduction module developed in the context of the mPLUG-DocOwl2 framework for multimodal LLMs (MLLMs) targeting efficient, OCR-free multi-page document understanding. Its principal objective is to compress high-resolution document images—consisting of tens of sub-images—into a fixed, minimal set of 324 tokens per page, leveraging global low-resolution visual features for guidance. This compression yields dramatically improved GPU memory efficiency and inference latency while maintaining high performance in question answering and structural reasoning tasks over multi-page documents (Hu et al., 2024).

1. System Placement and Functional Overview

The High-resolution DocCompressor operates between the vision encoder (a Vision Transformer, ViT) combined with a vision-to-text alignment adapter (H-Reducer) and the downstream LLM. The input comprises a global low-resolution image IgRH×W×3I^g\in\mathbb{R}^{H\times W\times3} and a grid of high-resolution sub-images {Ix,ys}\{I^s_{x,y}\} spatially partitioned across an R×CR\times C grid. The pipeline is as follows:

  1. Shape-adaptive cropping: Documents are cropped into R×CR\times C sub-images plus a resized global image.
  2. Shared ViT encoding: Produces features VgV^g and Vx,ysV^s_{x,y} for both global and local crops.
  3. H-Reducer: Applies a 2D convolution and fully-connected layer to align visual features to the LLM’s hidden dimension, forming V^g\hat V^g and V^x,ys\hat V^s_{x,y}.
  4. DocCompressor: Utilizes two layers of grouped cross-attention, compressing the high-resolution features into Vˉ\bar V, a compact token set.
  5. Flattening and concatenation: Produces exactly 324 tokens per page, which are then prepended with page-id tokens for LLM input.

2. Architectural and Mathematical Details

The DocCompressor module applies a detailed multi-stage architecture:

  • Cropping and Encoding: Page images (504×504504\times504 pixels) are cropped such that each patch and the global view are fed into a shared ViT, yielding feature maps (h=w=50416=31.532h=w=\frac{504}{16}=31.5\approx32).
  • H-Reducer: A (1,4)-kernel, (1,4)-stride convolution reduces the width dimension, followed by an FC to match the LLM’s hidden size (d^\hat d), yielding shape Rh×(w/4)×d^\mathbb{R}^{h\times(w/4)\times\hat d}.
  • Grouped Cross-Attention in DocCompressor: Sub-image features are stacked, grouped with respect to each position in the global map, and processed by cross-attention layers. For each global position (i,j)(i,j):

qij=Wqv^ijg,Kij=WkV^ijs,Vij=WvV^ijsq_{ij} = W^q\,\hat v^g_{ij}, \quad K_{ij} = W^k\,\hat V^s_{ij}, \quad V_{ij} = W^v\,\hat V^s_{ij}

The attention update is:

Aij=softmax(qijKij/dk),vˉij=AijVij+v^ijgA_{ij} = \mathrm{softmax}(q_{ij}K_{ij}^\top/\sqrt{d_k}), \quad \bar v_{ij} = A_{ij}V_{ij} + \hat v^g_{ij}

All vˉij\bar v_{ij} are stacked and flattened to produce exactly 324 tokens (N=h×(w/4)N=h\times (w/4)). No quantization or gating is applied; all projections are end-to-end learned.

3. Guidance Mechanism and Positional Grouping

Compression efficiency is achieved via a guidance mechanism:

  • The global low-res features (V^g\hat V^g) serve as anchor queries for the cross-attention heads and also as residuals in the cross-attention output, ensuring compressed features remain closely tied to global structure.
  • Positional grouping assigns each global token’s compression window, spatially matching sub-image tokens with their corresponding global context.
  • This design empowers effective token reduction while preserving necessary cross-page and intra-page structure for downstream comprehension.

4. Hyperparameters and Design Rationale

Key hyperparameters and design choices include:

  • Image base size: 504×504504\times504 pixels
  • Maximum crops: 12 sub-images per page
  • ViT patch size: 16 pixels
  • Conv stride: 4 (reducing ww to w/4w/4)
  • Token count: 324 tokens per A4-sized page (80%\sim80\% reduction vs. prior 1,700-token approach)
  • Hidden dimension: Aligned to LLM (e.g., 1,024)
  • Number of cross-attention layers: Two (empirically, ablation study indicates 1–4 layers yield similar results)
  • Performance–efficiency trade-off: With 324 tokens, first-token latency is reduced by more than 50% and question-answering accuracy is preserved at 98% relative to uncompressed baselines.

5. Training Framework

mPLUG-DocOwl2 adopts a three-stage training recipe for optimal document-understanding performance:

  1. Single-Image Pretraining: Leveraging DocStruct4M (4M annotated pages) for unified structure generation. Task: generate parse JSON via cross-entropy on sequence output.
  2. Multi-Image Continue-Pretraining: Includes MP-DocStruct1M (1.1M multi-page documents from PixParse) plus 0.5M additional single-page data. Tasks involve multi-page text parsing and lookup.
  3. Multi-task Fine-tuning: Combines single-page and multi-page reasoning datasets (DocDownstream-1.0, MP-DocVQA, DUDE, NewsVideoQA, DocGenome12K). Losses incorporate answer tokens and explanation/evidence tags.

6. Empirical Evaluation

Experimental results demonstrate substantial improvements:

  • Token count: Reduced from 1,700\sim1,700 to 324 per page (80%\sim80\% fewer)
  • GPU memory usage: Decreased by 80%
  • First-token latency (DocVQA/ChartQA/TextVQA): Improved from 0.58/0.53/0.56s to 0.26/0.21/0.23s (>50% speedup)
  • Question answering accuracy: DocVQA (80.7%, 98% relative to baseline), ChartQA (70.0%), TextVQA (66.7%, ~97% relative)
  • Multi-page MP-DocVQA: Latency 2.13s→0.95s; accuracy 67.15%→69.42%.

7. Module Pseudocode

The compression mechanism is succinctly expressed in the following procedural pseudocode:

1
2
3
4
5
6
7
8
9
10
11
12
13
function DOCCOMPRESSOR(Vg: [h×w/4×d], Vs: [Rh×C(w/4)×d]):
  // Group sub-features to match each global location
  for i in 1..h, j in 1..(w/4):
    group_s = select_group(Vs, i, j, R, C)      # shape [RC, d]
    vg = Vg[i,j]                                # shape [d]
    # Cross-attention update
    q = Wq * vg                  # [d_k]
    K = Wk * group_s             # [RC, d_k]
    V = Wv * group_s             # [RC, d_v]
    A = softmax(q * K.T / sqrt(d_k))  # [1, RC]
    v_new[i,j] = A * V + vg      # [d]
  end
  return v_new   # shape [h×(w/4)×d], flatten → 324 tokens

A plausible implication is that fixed-token compression—anchored by global embeddings—enables standard LLM architectures to scale to multi-document and multi-page tasks without exceeding hardware constraints, while preserving essential semantic and structural information (Hu et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to High-resolution DocCompressor.