OmniDocBench v1.0 Benchmark

Updated 14 January 2026

OmniDocBench v1.0 is a benchmark designed to evaluate document parsing systems with extensive high-quality annotations across varied document types.
It features detailed evaluation protocols and metrics such as normalized edit distance, mAP, TEDS, and CDM to rigorously assess model performance.
The benchmark supports both pipeline and vision-language models through attribute-based evaluations that reveal system strengths and limitations.

OmniDocBench v1.0 is a comprehensive benchmark designed to evaluate document parsing systems by testing their ability to extract and structure content from a diverse range of real-world PDF documents. This benchmark emphasizes fair, fine-grained, and multi-level analysis by providing high-quality annotations across varied sources and extensive coverage of layout categories and attribute labels. Its release addresses critical gaps in prior benchmarks, particularly limited document diversity and unrealistic evaluation schemes, thus establishing a new standard for the field (Ouyang et al., 2024).

1. Dataset Composition

OmniDocBench v1.0 consists exclusively of a held-out evaluation set comprising 981 pages without any training or validation split. This design ensures its purpose strictly as a benchmarking resource. The dataset integrates documents from nine distinct sources, each selected for structural and linguistic diversity, including challenging content such as handwritten notes and complex newspapers. The sources and their page counts are:

Source	Pages
Books	104
Slides	133
Financial Reports	81
Textbooks	96
Exam Papers	114
Magazines	97
Academic Papers	129
Handwritten Notes	116
Newspapers	111

This evaluation set is extensively annotated: over 100,000 region-level annotations are provided, covering 19 region types and reading order, with an additional 80,000+ span-level annotations for elements such as text runs, inline formulas, and footnotes.

2. Annotation Schema

OmniDocBench establishes a rigorous annotation framework, consisting of both layout categories and rich attribute tags.

2.1 Layout Categories

Nineteen region types are exhaustively labeled, covering:

Titles (main, chapter)
Text Blocks
Figures, Figure Captions, Figure Footnotes
Tables, Table Captions, Table Footnotes
Header, Footer, Page Number, Page Footnote
Code Block, Code Block Caption
Reference (academic context)
Text Span, Equation Inline (LaTeX-encoded)
Equation Ignore (simple expressions not requiring LaTeX)
Footnote Mark

2.2 Attribute Labels

Fourteen attributes are classified into three groups and systematically attached to block-level annotations:

A. Page-Level (5):

Language: English, Chinese, Mixed
Column Layout: Single, Double, Three, More-mixed, Complex
Fuzzy Scan: Yes/No
Watermark: Yes/No
Colorful Background: Yes/No

B. Text-Block (3):

Background Color: White, Single-color, Multi-color
Rotation: 0°, 90°, 270°, Horizontal
Language: English, Chinese, Mixed

C. Table (6):

Language: English, Chinese, Mixed
Frame Type: Full-frame, Omission-line, Three-line, No-frame
Merge-cells: Yes/No
Contains Formula: Yes/No
Colorful Background: Yes/No
Rotation: Yes/No

These attributes facilitate detailed stratified evaluation, allowing analysis by linguistic, structural, and visual properties.

3. Evaluation Protocols and Metrics

OmniDocBench supports three principal evaluation modes, each tailored for precise assessment of various model capabilities.

3.1 End-to-End Evaluation

Input: PDF image → Model outputs Markdown.
Procedure: Extraction and normalization (decorations are removed, tables, formulas, and code are extracted in a canonical sequence).
Matching: “Adjacency Search”—a matching algorithm that merges/splits and computes normalized edit distances to align model predictions with reference paragraphs.
Metrics: Computed on matched units.

3.2 Task-Specific (Module) Evaluation

Independent evaluation tasks are provided for:

Layout Detection: bounding-box mean average precision (mAP)
OCR/Text Recognition: normalized edit distance
Table Recognition: TEDS, normalized edit distance
Formula Recognition: Correctness and Discrepancy Metric (CDM), normalized edit distance, BLEU
Reading-Order: normalized edit distance

3.3 Attribute-Based Analysis

All results can be filtered or stratified by any of the 14 attributes to identify strengths or failure points on, for example, multi-column layouts or “fuzzy scans.”

3.4 Core Metrics

Precision: $\frac{TP}{TP + FP}$
Recall: $\frac{TP}{TP + FN}$
**F_1\text{-score}: $2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}$
IoU (Intersection-over-Union): $\text{IoU}(A,B) = \frac{\text{Area}(A \cap B)}{\text{Area}(A \cup B)}$
Normalized Edit Distance: $\text{NED}(s_1, s_2) = \frac{\text{EditDistance}(s_1, s_2)}{\max(\text{len}(s_1),\text{len}(s_2))}$
TEDS: Tree-Edit Distance Similarity for table structure (see PubTabNet).
CDM: Correctness and Discrepancy Metric for formulas (see “CDM” paper).

4. Baseline Models and Evaluation Setups

All baselines are evaluated “out of the box” with no dataset-specific retraining. The main systems benchmarked are:

4.1 Pipeline-Based Tools

MinerU (v0.9.3): Employs DocLayout-YOLO (layout detection), PaddleOCR (text), UniMERNet (mathematics), and TablesGenerator→HTML (tables).
Marker (v0.2.17): Modular pipeline using custom layout, Tesseract/PaddleOCR, and pix2tex for formulas.
Mathpix: Proprietary OCR and LaTeX-OCR pipeline.

4.2 “Expert” Vision-LLMs

GOT-OCR 2.0: Unified end-to-end OCR model producing structured output.
Nougat (0.1.0-base): Encoder-decoder pre-trained for PDF to LaTeX generation.

4.3 General Vision-LLMs

GPT-4o (OpenAI): Multimodal LLM.
Qwen2-VL (72B): Transformer with interleaved vision-language layers.
InternVL2 (Llama3-76B): Large multimodal foundation model.

5. Experimental Results and Systematic Findings

The experimental analysis reveals strengths and weaknesses of current document parsing methodologies across multiple axes.

5.1 Overall End-to-End Performance

Pipelines set the highest bar for English text (MinerU, edit distance 0.180 EN) and for Chinese (Mathpix, edit distance 0.384 ZH).
General VLMs lag by ~20–30% on Chinese pages.

5.2 Document Type Sensitivity

Academic Papers: Pipeline methods excel (MinerU, edit 0.025) over general VLMs (~0.146).
Handwritten Notes: General VLMs (Qwen2-VL 0.298, InternVL2 0.226) significantly surpass pipelines (MinerU 0.984).
Slides and Textbooks: General VLMs generalize more effectively to these outlier formats.

5.3 Attribute Robustness

Fuzzy Scans, Watermarks, and Colorful Backgrounds: InternVL2 and Qwen2-VL show greatest robustness; MinerU is competitive.
Multi-Column Layouts: MinerU and Mathpix retain best reading-order performance, but overall accuracy drops in complex structures for all methods.

5.4 Module-Level Highlights

Layout Detection: DocLayout-YOLO (MinerU) achieves mean average precision of ~48.7, with reduced performance in non-academic domains.
Table Recognition: OCR-based RapidTable achieves ~82.5 TEDS; StructEqTable leads on no-frame tables; general VLMs score ~71–74 TEDS.
OCR/Text Recognition: PaddleOCR yields ~73.6% normalized edit distance; Tesseract performs best on Chinese.
Formula Recognition: GPT-4o, Mathpix, and UniMERNet all score ~86–87% CDM, with GPT-4o achieving the highest strict recall at 65.5%.

5.5 Observed Strengths and Limitations

Specialized pipelines dominate on well-formed, academic-layout pages but are brittle on scanned notes and atypical layouts.
General VLMs are less precise on tasks demanding fine-grained structure (e.g., table borders, strict LaTeX), but show superior robustness to visual noise and content format variability.
Reading-order and affiliation detection remain unsolved for all models on highly structured, multi-column, or cross-page content.

6. Benchmark Access and Reproducibility

The complete dataset and evaluation code are publicly available at https://github.com/opendatalab/OmniDocBench. The recommended environment comprises Python 3.8+, PyTorch 2.x, Faiss, Detectron2, and PaddleOCR, with layouts sourced from DocLayout-YOLO.

Setup instructions:

Clone the repository and enter the directory.
Create an environment: [conda](https://www.emergentmind.com/topics/column-normalized-adam-conda) create –n odbench python=3.8 && conda activate odbench
Install dependencies: pip install –r requirements.txt
Download data: python scripts/download_evalset.py --out data/

Evaluation commands:

End-to-end:

1	python eval/end2end_eval.py --gts data/gt.json --preds predictions.json --out results.csv

Module-level:

1 2	python eval/layout_eval.py --gts data/layout_gt.json --pred data/layout_pred.json python eval/ocr_eval.py --gts data/ocr_gt.json --pred data/ocr_pred.json

With these resources, users can reproduce the results from Tables 1–8 of the reference, conduct attribute-stratified analyses, and evaluate novel parsing models against the OmniDocBench v1.0 benchmark (Ouyang et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniDocBench v1.0.