Papers
Topics
Authors
Recent
Search
2000 character limit reached

OmniDocBench v1.0 Benchmark

Updated 14 January 2026
  • OmniDocBench v1.0 is a benchmark designed to evaluate document parsing systems with extensive high-quality annotations across varied document types.
  • It features detailed evaluation protocols and metrics such as normalized edit distance, mAP, TEDS, and CDM to rigorously assess model performance.
  • The benchmark supports both pipeline and vision-language models through attribute-based evaluations that reveal system strengths and limitations.

OmniDocBench v1.0 is a comprehensive benchmark designed to evaluate document parsing systems by testing their ability to extract and structure content from a diverse range of real-world PDF documents. This benchmark emphasizes fair, fine-grained, and multi-level analysis by providing high-quality annotations across varied sources and extensive coverage of layout categories and attribute labels. Its release addresses critical gaps in prior benchmarks, particularly limited document diversity and unrealistic evaluation schemes, thus establishing a new standard for the field (Ouyang et al., 2024).

1. Dataset Composition

OmniDocBench v1.0 consists exclusively of a held-out evaluation set comprising 981 pages without any training or validation split. This design ensures its purpose strictly as a benchmarking resource. The dataset integrates documents from nine distinct sources, each selected for structural and linguistic diversity, including challenging content such as handwritten notes and complex newspapers. The sources and their page counts are:

Source Pages
Books 104
Slides 133
Financial Reports 81
Textbooks 96
Exam Papers 114
Magazines 97
Academic Papers 129
Handwritten Notes 116
Newspapers 111

This evaluation set is extensively annotated: over 100,000 region-level annotations are provided, covering 19 region types and reading order, with an additional 80,000+ span-level annotations for elements such as text runs, inline formulas, and footnotes.

2. Annotation Schema

OmniDocBench establishes a rigorous annotation framework, consisting of both layout categories and rich attribute tags.

2.1 Layout Categories

Nineteen region types are exhaustively labeled, covering:

  • Titles (main, chapter)
  • Text Blocks
  • Figures, Figure Captions, Figure Footnotes
  • Tables, Table Captions, Table Footnotes
  • Header, Footer, Page Number, Page Footnote
  • Code Block, Code Block Caption
  • Reference (academic context)
  • Text Span, Equation Inline (LaTeX-encoded)
  • Equation Ignore (simple expressions not requiring LaTeX)
  • Footnote Mark

2.2 Attribute Labels

Fourteen attributes are classified into three groups and systematically attached to block-level annotations:

A. Page-Level (5):

  • Language: English, Chinese, Mixed
  • Column Layout: Single, Double, Three, More-mixed, Complex
  • Fuzzy Scan: Yes/No
  • Watermark: Yes/No
  • Colorful Background: Yes/No

B. Text-Block (3):

  • Background Color: White, Single-color, Multi-color
  • Rotation: 0°, 90°, 270°, Horizontal
  • Language: English, Chinese, Mixed

C. Table (6):

  • Language: English, Chinese, Mixed
  • Frame Type: Full-frame, Omission-line, Three-line, No-frame
  • Merge-cells: Yes/No
  • Contains Formula: Yes/No
  • Colorful Background: Yes/No
  • Rotation: Yes/No

These attributes facilitate detailed stratified evaluation, allowing analysis by linguistic, structural, and visual properties.

3. Evaluation Protocols and Metrics

OmniDocBench supports three principal evaluation modes, each tailored for precise assessment of various model capabilities.

3.1 End-to-End Evaluation

  • Input: PDF image → Model outputs Markdown.
  • Procedure: Extraction and normalization (decorations are removed, tables, formulas, and code are extracted in a canonical sequence).
  • Matching: “Adjacency Search”—a matching algorithm that merges/splits and computes normalized edit distances to align model predictions with reference paragraphs.
  • Metrics: Computed on matched units.

3.2 Task-Specific (Module) Evaluation

Independent evaluation tasks are provided for:

  • Layout Detection: bounding-box mean average precision (mAP)
  • OCR/Text Recognition: normalized edit distance
  • Table Recognition: TEDS, normalized edit distance
  • Formula Recognition: Correctness and Discrepancy Metric (CDM), normalized edit distance, BLEU
  • Reading-Order: normalized edit distance

3.3 Attribute-Based Analysis

All results can be filtered or stratified by any of the 14 attributes to identify strengths or failure points on, for example, multi-column layouts or “fuzzy scans.”

3.4 Core Metrics

  • Precision: TPTP+FP\frac{TP}{TP + FP}
  • Recall: TPTP+FN\frac{TP}{TP + FN}
  • **F_1\text{-score}:2×Precision×RecallPrecision+Recall2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
  • IoU (Intersection-over-Union): IoU(A,B)=Area(AB)Area(AB)\text{IoU}(A,B) = \frac{\text{Area}(A \cap B)}{\text{Area}(A \cup B)}
  • Normalized Edit Distance: NED(s1,s2)=EditDistance(s1,s2)max(len(s1),len(s2))\text{NED}(s_1, s_2) = \frac{\text{EditDistance}(s_1, s_2)}{\max(\text{len}(s_1),\text{len}(s_2))}
  • TEDS: Tree-Edit Distance Similarity for table structure (see PubTabNet).
  • CDM: Correctness and Discrepancy Metric for formulas (see “CDM” paper).

4. Baseline Models and Evaluation Setups

All baselines are evaluated “out of the box” with no dataset-specific retraining. The main systems benchmarked are:

4.1 Pipeline-Based Tools

  • MinerU (v0.9.3): Employs DocLayout-YOLO (layout detection), PaddleOCR (text), UniMERNet (mathematics), and TablesGenerator→HTML (tables).
  • Marker (v0.2.17): Modular pipeline using custom layout, Tesseract/PaddleOCR, and pix2tex for formulas.
  • Mathpix: Proprietary OCR and LaTeX-OCR pipeline.

4.2 “Expert” Vision-LLMs

  • GOT-OCR 2.0: Unified end-to-end OCR model producing structured output.
  • Nougat (0.1.0-base): Encoder-decoder pre-trained for PDF to LaTeX generation.

4.3 General Vision-LLMs

5. Experimental Results and Systematic Findings

The experimental analysis reveals strengths and weaknesses of current document parsing methodologies across multiple axes.

5.1 Overall End-to-End Performance

  • Pipelines set the highest bar for English text (MinerU, edit distance 0.180 EN) and for Chinese (Mathpix, edit distance 0.384 ZH).
  • General VLMs lag by ~20–30% on Chinese pages.

5.2 Document Type Sensitivity

  • Academic Papers: Pipeline methods excel (MinerU, edit 0.025) over general VLMs (~0.146).
  • Handwritten Notes: General VLMs (Qwen2-VL 0.298, InternVL2 0.226) significantly surpass pipelines (MinerU 0.984).
  • Slides and Textbooks: General VLMs generalize more effectively to these outlier formats.

5.3 Attribute Robustness

  • Fuzzy Scans, Watermarks, and Colorful Backgrounds: InternVL2 and Qwen2-VL show greatest robustness; MinerU is competitive.
  • Multi-Column Layouts: MinerU and Mathpix retain best reading-order performance, but overall accuracy drops in complex structures for all methods.

5.4 Module-Level Highlights

  • Layout Detection: DocLayout-YOLO (MinerU) achieves mean average precision of ~48.7, with reduced performance in non-academic domains.
  • Table Recognition: OCR-based RapidTable achieves ~82.5 TEDS; StructEqTable leads on no-frame tables; general VLMs score ~71–74 TEDS.
  • OCR/Text Recognition: PaddleOCR yields ~73.6% normalized edit distance; Tesseract performs best on Chinese.
  • Formula Recognition: GPT-4o, Mathpix, and UniMERNet all score ~86–87% CDM, with GPT-4o achieving the highest strict recall at 65.5%.

5.5 Observed Strengths and Limitations

  • Specialized pipelines dominate on well-formed, academic-layout pages but are brittle on scanned notes and atypical layouts.
  • General VLMs are less precise on tasks demanding fine-grained structure (e.g., table borders, strict LaTeX), but show superior robustness to visual noise and content format variability.
  • Reading-order and affiliation detection remain unsolved for all models on highly structured, multi-column, or cross-page content.

6. Benchmark Access and Reproducibility

The complete dataset and evaluation code are publicly available at https://github.com/opendatalab/OmniDocBench. The recommended environment comprises Python 3.8+, PyTorch 2.x, Faiss, Detectron2, and PaddleOCR, with layouts sourced from DocLayout-YOLO.

Setup instructions:

  1. Clone the repository and enter the directory.
  2. Create an environment: [conda](https://www.emergentmind.com/topics/column-normalized-adam-conda) create –n odbench python=3.8 && conda activate odbench
  3. Install dependencies: pip install –r requirements.txt
  4. Download data: python scripts/download_evalset.py --out data/

Evaluation commands:

  • End-to-end:

1
python eval/end2end_eval.py --gts data/gt.json --preds predictions.json --out results.csv

  • Module-level:

1
2
python eval/layout_eval.py --gts data/layout_gt.json --pred data/layout_pred.json
python eval/ocr_eval.py --gts data/ocr_gt.json --pred data/ocr_pred.json

With these resources, users can reproduce the results from Tables 1–8 of the reference, conduct attribute-stratified analyses, and evaluate novel parsing models against the OmniDocBench v1.0 benchmark (Ouyang et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OmniDocBench v1.0.